[MOAB-dev] Problem with running mbparallelcomm_test

Dmitry Karpeev karpeev at mcs.anl.gov
Sun Mar 14 10:58:44 CDT 2010


I keep running into problems with running parallel reads (of tjunc6)
on more than 4 procs on cosmea.
1. Whenever I try to read tjunc6 partitioned into N parts with N >= 4
(e.g., N=4,8) on np=4 procs,
the run equivalent to mbparallelcomm_test 0 3 1 tjunc6_N.h5m
PARALLEL_PARTITION completes.
In the case N=4,np=4 I get nice output with entities per proc, etc.
With N=8,np=4, I only get total times,
but no per proc output.
2. Whenever I try to read tjunc6 partitioned into N parts equal to the
number of np procs, but np > 4 (e.g., N = 8, np=8),
the run "hangs" -- times out after exhausting all of the allotted time
(e.g., 5 minutes).
3. This is a problem with BCAST_DELETE, but I experience similar
problems with READ_PART.

The problem seems to be related to the fact that cosmea has 4 core
nodes, so running with np=4 amounts to using a single
node (mpi.nodes file is constructed that way).  Using np=8 uses 2
nodes and the run never completes.
I have confirmed that I can run on my laptop with any number np of
procs and the corresponding number of partitions N>=np.
I'm trying to replicate the problem on fusion.

Anybody have any idea what may be going on?
Below I'm quoting the output with tjunc6_N and np, (N=4,np=4),
(N=8,np=4), (N=8,np=8) (the first two run fine, while the last one
doesn't).
Similar problems with N=32,np=32, etc.
I can send results from my laptop, if necessary.  They always look right.

Any insight is appreciated.
Thanks!
Dmitry.

====================================================================================================================
N=4,np=4
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Batch output:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Using MPI from /home/karpeev/fathom/mpi2/mpich2-1.2.1/icc
Running on 1 nodes
4 cores per node, one MPI process per core
for a total of 4 MPI processes
.................................................
PBS nodefile:
n030
n030
n030
n030
.................................................
mpd ring nodefile:
n030:4
.................................................
Running mpdboot ...
done
Running mpdtrace ...
n030
done
Running mpdringtest ...
time for 1 loops = 3.69548797607e-05 seconds
done
Using MPIEXEC_CMD=/home/karpeev/fathom/mpi2/mpich2-1.2.1/icc/bin/mpiexec -n 4
Commencing parallel run tjunc6_4 of executable ../mbparead
Proc 2 iface entities:
    913 0d iface entities.
    1780 1d iface entities.
    868 2d iface entities.
    0 3d iface entities.
    (913 verts adj to other iface ents)
Proc 1 iface entities:
    946 0d iface entities.
    1843 1d iface entities.
    Proc 3 iface entities:
    946 0d iface entities.
    1843 1d iface entities.
    898 2d iface entities.
    0 3d iface entities.
    (946 verts adj to other iface ents)
898 2d iface entities.
    0 3d iface entities.
    (946 verts adj to other iface ents)
Proc 0 iface entities:
    913 0d iface entities.
    1780 1d iface entities.
    868 2d iface entities.
    0 3d iface entities.
    (913 verts adj to other iface ents)
Proc Proc 2: Success.
1: Success.
Proc 0Proc 3: Success.
: Success.
Finished ........................................
Running mpdallexit ...
done
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Actual output:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Read times: 0.189004 3.38554e-05 1.19209e-06 0.028718 0.109702 Proc 2
owns 4004 3d entities.
0.558844 0.096935 (PARALLEL READ/PARALLEL CHECK_GIDS_SERIAL/PARALLEL
GET_FILESET_ENTS/PARALLEL BROADCAST/PARALLEL DELETE NONLOCAL/PARALLEL
RESOLVE_SHARED_ENTS/PARALLEL EXCHANGE_GHOSTS/)
Proc 3 owns 4004 3d entities.
Proc 1 owns 4004 3d entities.
Proc 0 owns 4004 3d entities.
Total # owned regions = 16016
Times: 0.99444 0.994439 1.26849e+09 (total/read/delete)


====================================================================================================================
N=8,np=4
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Batch output:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Using MPI from /home/karpeev/fathom/mpi2/mpich2-1.2.1/icc
Running on 1 nodes
4 cores per node, one MPI process per core
for a total of 4 MPI processes
.................................................
PBS nodefile:
n018
n018
n018
n018
.................................................
mpd ring nodefile:
n018:4
.................................................
Running mpdboot ...
done
Running mpdtrace ...
n018
done
Running mpdringtest ...
time for 1 loops = 0.000272989273071 seconds
done
Using MPIEXEC_CMD=/home/karpeev/fathom/mpi2/mpich2-1.2.1/icc/bin/mpiexec -n 4
Commencing parallel run tjunc6_8 of executable ../mbparead
Proc 0: Success.
Proc 3: Success.
Proc 1: Success.
Proc 2: Success.
Finished ........................................
Running mpdallexit ...
done
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Actual output:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Times: 0.000132084 -1.26858e+09 1.26858e+09 (total/read/delete)


====================================================================================================================
N=8,np=4
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Batch output:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Using MPI from /home/karpeev/fathom/mpi2/mpich2-1.2.1/icc
Running on 2 nodes
4 cores per node, one MPI process per core
for a total of 8 MPI processes
.................................................
PBS nodefile:
n030
n030
n030
n030
n032
n032
n032
n032
.................................................
mpd ring nodefile:
n030:4
n032:4
.................................................
Running mpdboot ...
=>> PBS: job killed: walltime 332 exceeded limit 300
30216
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Actual output:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<No output, as the run times out>


More information about the moab-dev mailing list