[mpich-discuss] Errors related to the increased number of tasks

Bernard Chambon chambon at cc.in2p3.fr
Sat Dec 17 02:56:01 CST 2011


Hello,

Thank you Gustavo and thank you Dave for you interest in my problem,


Le 16 déc. 2011 à 16:35, Gustavo Correa a écrit :

> Hi Bernard
> 
> Am I mistaken, or does your main routine perhaps calls only 
> MPI_Init?
> Your main seems to call only 'basicTest', but not 'rank',
> where  other MPI routines appear.
> 
> The MPICH2 developers may shed some light here,
> but I think MPI_Init alone doesn't compose a minimal MPI program.
> You need at least MPI_Finalize, I guess.
> Or not?
> 

NO, you probably miss an opening brace "{", (due to my bad indentation, sorry)
In fact my basicTest function include : MPI_Init + MPI_Comm_rank + MPI_Comm_size + MPI_Finalize

> Also, not related to your C program, but 
> since you are in Linux, why did you choose g77 to compile the Fortran-77 bindings,
> and f95 [is this g95?] to compile the Fortran-90 bindings of MPICH2?
> g77 is quite old, I have been luckier using gfortran to compile both 
> the Fortran 77 and 90 bindings.

mpich2 1.0.x was the installed software on our machines, not by myself
so I can tell you, at the current time, why g77 
(A try with mpich2 1.0.x  was just a reference to shown that I can have more than 512 tasks) 


What I can tell is that, after compiling the latest mpich2 version (1.4.1p1)
I encountered the failure when number of tasks reached ~160 tasks
I take care why the number of file descriptors and also the sire of share memory (SHMALL)
Here are the values :

>mpich2version
MPICH2 Version:    	1.4.1p1
MPICH2 Release date:	Thu Sep  1 13:53:02 CDT 2011
MPICH2 Device:    	ch3:nemesis
MPICH2 configure: 	--prefix=//scratch/BC/mpich2-1.4
MPICH2 CC: 	/usr/bin/gcc -m64   -O2
MPICH2 CXX: 	c++ -m64  -O2
MPICH2 F77: 	/usr/bin/f77   -O2
MPICH2 FC: 	f95  


 >limit
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    10240 kbytes
coredumpsize unlimited
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  4096 
memorylocked 32 kbytes
maxproc      409600 


 >more  /proc/sys/kernel/shmall
8388608


Here is the test

 >mpiexec -np 150  bin/basic_test
Running 150 tasks 

 >mpiexec -np 160 bin/basic_test
Assertion failed in file /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line 889: seg_sz > 0
internal ABORT - process 0
[proxy:0:0 at ccwpge0001] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:80): assert (!closed) failed
[proxy:0:0 at ccwpge0001] fn_get (./pm/pmiserv/pmip_pmi_v1.c:349): error sending PMI response
[proxy:0:0 at ccwpge0001] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned error
[proxy:0:0 at ccwpge0001] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at ccwpge0001] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[mpiexec at ccwpge0001] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
[mpiexec at ccwpge0001] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at ccwpge0001] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
[mpiexec at ccwpge0001] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion


As Dave pointed in an more recent mail, perhaps I must increase __FD_SETSIZE , and reocmpile mpich2
but I have to ask my sysadmin, for that !

The OS and CPU are : 

>uname -a 
Linux ccwpge0001 2.6.18-238.12cc.el5 #1 SMP Thu Mar 3 12:19:21 CET 2011 x86_64 x86_64 x86_64 GNU/Linux

grep ... /proc/cpuinfo
model name	: Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz


Best regards

---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111217/e17ff709/attachment-0001.htm>


More information about the mpich-discuss mailing list