SBAIJ issue

Andreas Grassl Andreas.Grassl at student.uibk.ac.at
Tue Oct 13 03:05:50 CDT 2009


Hong Zhang schrieb:
> Ando,
> 
> I do not see any error message from attached info below.
> Even '-log_summary' gives correct display.
> I guess you sent us the working output (np=2).
> 

I have attached 3 files. The one you found with -log_summary printed is indeed
the working scenario.

The other 2 are hanging.

Output of top for np=4 when still "running":


 8466 csae1801  25   0 1442m 704m 5708 R  100  5.9   1:20.87 externalsolver

 8468 csae1801  25   0 1413m 697m 5052 R  100  5.8   1:13.45 externalsolver

 8469 csae1801  25   0 1359m 614m 5148 R  100  5.1   1:12.75 externalsolver

 8467 csae1801  25   0 1415m 702m 5096 R   96  5.9   1:13.01 externalsolver


Output of top for np=4 when hanging:

 8466 csae1801  18   0 1443m 769m 6120 S    0  6.4   2:09.47 externalsolver

 8468 csae1801  15   0 1413m 759m 5420 S    0  6.3   2:00.87 externalsolver

 8467 csae1801  15   0 1415m 748m 5396 S    0  6.2   2:01.21 externalsolver

 8469 csae1801  18   0 1359m 688m 5460 S    0  5.7   2:01.39 externalsolver

other processes use about 12% memory in sum.

> I would suggest you run your code with debugger,
> e.g., '-start_in_debugger'.
> When it hangs, type Control-C,
> and type 'where' to check where it hangs.
> 

I guess it is hanging somewhere after the numerical factorization because the
extrapolated time would match.

Using debug-version or nondebug doesn't change the behaviour

Output from where (using gdb):


#0  0x0000003a0ccc5cdf in poll () from /lib64/libc.so.6
#1  0x00000000011d1024 in MPIDU_Sock_wait (sock_set=0x4464890,
millisecond_timeout=4,
    eventp=0xffffffffffffffff) at sock_wait.i:124
#2  0x00000000011a3203 in MPIDI_CH3I_Progress (blocking=71714960, state=0x4)
    at ch3_progress.c:1038
#3  0x00000000011843ce in PMPI_Recv (buf=0x4464890, count=4, datatype=-1,
source=-1,
    tag=108517088, comm=168072704, status=0x4f503b0) at recv.c:156
#4  0x0000000000ea9926 in BI_Srecv (ctxt=0x4f522d0, src=-2, msgid=2, bp=0x1813ad8)
    at BI_Srecv.c:8
#5  0x0000000000ea9414 in BI_SringBR (ctxt=0x4f522d0, bp=0x1813ad8,
send=0xea9800 <BI_Ssend>,
    src=1) at BI_SringBR.c:16
#6  0x0000000000ea22b1 in igebr2d_ (ConTxt=0x7fff0afeb110, scope=0x12a57f8
"Rowwise",
    top=0x17b9094 "S", m=0x12a57b8, n=0x12a57b8, A=0x7fff0afeb560, lda=0x12a57b8,
    rsrc=0x7fff0afeb118, csrc=0x7fff0afeb090) at igebr2d_.c:198
#7  0x0000000000e3b0f5 in pdpotf2 (uplo=Invalid C/C++ type code 13 in symbol table.
) at pdpotf2.f:340
#8  0x0000000000e2c818 in pdpotrf (uplo=Invalid C/C++ type code 13 in symbol table.
) at pdpotrf.f:327
#9  0x0000000000c5daf6 in dmumps_146 (myid=0, root=
      {mblock = 48, nblock = 48, nprow = 2, npcol = 2, myrow = 0, mycol = 0,
root_size = 2965, tot_root_size = 2965, cntxt_blacs = 0, rg2l_row = 0x676f0bf,
rg2l_col = 0x676f107, ipiv = 0x676f14f, descriptor = {1, 0, 2965, 2965, 48, 48,
0, 0, 1488}, descb = {0, 0, 0, 0, 0, 0, 0, 0, 0}, yes = 4294967295,
gridinit_done = 4294967295, lpiv = 1, schur_pointer = 0x676f1eb, schur_mloc = 0,
schur_nloc = 0, schur_lld = 0, qr_tau = 0x676f23f, qr_rcond = 0, maxg = 0, gind
= 0, grow = 0x676f297, gcos = 0x676f2df, gsin = 0x676f327, elg_max = 0, null_max
= 0, elind = 0, euind = 0, nlupdate = 0, nuupdate = 0, perm_row = 0x676f387,
perm_col = 0x676f3cf, elrow = 0x676f417, eurow = 0x676f45f, ptrel = 0x676f4a7,
ptreu = 0x676f4ef, elelg = 0x676f537, euelg = 0x676f57f, dl = 0x676f5c7},
n=446912, iroot=266997, comm=-2080374780, iw=0x2aaaf5c49010, liw=8275423,
    ifree=1646107, a=0x2aaab9d6e010, la=125678965, ptrast=0xb09d2fc,
ptlust_s=0xb05d200,
    ptrfac=0xb0727b0, step=0xb4aca20, info={0, 0}, ldlt=1, qr=0, wk=0x2aaacab98ba0,
    lwk=90267651, keep=
      {8, 2571, 96, 24, 16, 48, 150, 120, 400, 6875958, 2147483646, 200,
3015153, 3259551, 1655023, 0, 0, 0, 0, 0, 0, 0, 0, 18, 0, 1646982, 3705, 21863,
8275423, 0, 0, 0, 0, 4, 8, 1, 800, 266997, 160000, -456788, 8, 0, 190998,
190998, 0, 1, 2, 5, 12663, 1, 48, 0, 0, 3, 0, 5, 500, 250, 0, 0, 0, 100, 60, 10,
120, 28139, 84754429, 0, 1, 0, 21863, 0, 0, 0, 1, 2, 30, 0, 2147483647, 1, 0, 5,
4, -8, 100, 1, 70, 70, 0, 1, 4, 0, 0, 0, 1, 0, 0, 0, 4, 12000000, 8791225, 150,
0, 16, 0, 1, 0, 1370, 0, 0, 0, 0, 11315240, 12209064, 0 <repeats 11 times>,
6167135, 3705, 0 <repeats 74 times>, 2214144, 0, 0, 0, 0, 0, 0, -1, 2, 2,
2214144, 201, 2, 0, 1, 0, 50, 1, 0, 0, 5, 2291986, 1670494, 1678547, 142320, 32,
0, 0, 0, 1, 3, 0, 1, 0, 0, 0, 12, 1, 10, 0 <repeats 260 times>}, keep8=
      {0, 407769668, 177587312, 0, 0, 0, 0, 0, 31341437, 30351541, 35301388,
41892965, 125678965, 12496233, 574564, 0, 37488833, 0 <repeats 91 times>,
120657071, 0, 137362626, 0 <repeats 39 times>}) at dmumps_part7.F:286
#10 0x0000000000c17921 in dmumps_251 (n=446912, iw=0x2aaaf5c49010, liw=8275423,
    a=0x2aaab9d6e010, la=125678965, nstk_steps=0xb0dd3d0, nbprocfils=0xb0f296c,
iflag=0,
    nd=0x4dbe8f0, fils=0xb661130, step=0xb4aca20, frere=0x4dd3ea0, dad=0x4de9450,
    cand=0x6a24830, istep_to_iniv2=0x4dfea00, tab_pos_in_pere=0x67bbff0,
maxfrt=0, ntotpv=0,
    ptrist=0xb087d60, ptrast=0xb09d2fc, pimaster=0xb0b2898, pamaster=0xb0c7e34,
    ptrarw=0xb9c9f40, ptraiw=0xb815840, itloc=0xb107f08, ierror=0, ipool=0xb2bc608,
    lpool=21867, rinfo={28139655699.833332, 0 <repeats 19 times>}, posfac=35411315,
    iwpos=1646106, lrlu=90267651, iptrlu=125678965, lrlus=90267651, leaf=1865,
nbroot=1,
    nbrtot=4, uu=0, icntl=
      {6, 0, 6, -1, 0, 0, 7, 77, 1, 0, 0, 1, 0, 200, 0, 0, 0, 3, 0, 0, 0, 0, 0,
0, 0, 0, -8, 0 <repeats 11 times>, 1, 0}, ptlust_s=0xb05d200, ptrfac=0xb0727b0,
nsteps=5877, info=
      {0, 0, 35301388, 1646982, 3705, 0, 8275423, 125678965, 0, 0, 0, 0, 0, 0,
1112, 1112, 421, 0, 8392947, 37488833, 0, 0, 0, 31341437, 0 <repeats 16 times>},
keep=
      {8, 2571, 96, 24, 16, 48, 150, 120, 400, 6875958, 2147483646, 200,
3015153, 3259551, 1655023, 0, 0, 0, 0, 0, 0, 0, 0, 18, 0, 1646982, 3705, 21863,
8275423, 0, 0, 0, 0, 4, 8, 1, 800, 266997, 160000, -456788, 8, 0, 190998,
190998, 0, 1, 2, 5, 12663, 1, 48, 0, 0, 3, 0, 5, 500, 250, 0, 0, 0, 100, 60, 10,
120, 28139, 84754429, 0, 1, 0, 21863, 0, 0, 0, 1, 2, 30, 0, 2147483647, 1, 0, 5,
4, -8, 100, 1, 70, 70, 0, 1, 4, 0, 0, 0, 1, 0, 0, 0, 4, 12000000, 8791225, 150,
0, 16, 0, 1, 0, 1370, 0, 0, 0, 0, 11315240, 12209064, 0 <repeats 11 times>,
6167135, 3705, 0 <repeats 74 times>, 2214144, 0, 0, 0, 0, 0, 0, -1, 2, 2,
2214144, 201, 2, 0, 1, 0, 50, 1, 0, 0, 5, 2291986, 1670494, 1678547, 142320, 32,
0, 0, 0, 1, 3, 0, 1, 0, 0, 0, 12, 1, 10, 0 <repeats 260 times>}, keep8=
      {0, 407769668, 177587312, 0, 0, 0, 0, 0, 31341437, 30351541, 35301388,
41892965, 12567896---Type <return> to continue, or q <return> to quit---

-- 
 /"\                               Grassl Andreas
 \ /    ASCII Ribbon Campaign      Uni Innsbruck Institut f. Mathematik
  X      against HTML email        Technikerstr. 13 Zi 709
 / \                               +43 (0)512 507 6091


More information about the petsc-users mailing list