[mpich-discuss] Errors related to the increased number of tasks
Bernard Chambon
chambon at cc.in2p3.fr
Sat Dec 17 02:56:01 CST 2011
Hello,
Thank you Gustavo and thank you Dave for you interest in my problem,
Le 16 déc. 2011 à 16:35, Gustavo Correa a écrit :
> Hi Bernard
>
> Am I mistaken, or does your main routine perhaps calls only
> MPI_Init?
> Your main seems to call only 'basicTest', but not 'rank',
> where other MPI routines appear.
>
> The MPICH2 developers may shed some light here,
> but I think MPI_Init alone doesn't compose a minimal MPI program.
> You need at least MPI_Finalize, I guess.
> Or not?
>
NO, you probably miss an opening brace "{", (due to my bad indentation, sorry)
In fact my basicTest function include : MPI_Init + MPI_Comm_rank + MPI_Comm_size + MPI_Finalize
> Also, not related to your C program, but
> since you are in Linux, why did you choose g77 to compile the Fortran-77 bindings,
> and f95 [is this g95?] to compile the Fortran-90 bindings of MPICH2?
> g77 is quite old, I have been luckier using gfortran to compile both
> the Fortran 77 and 90 bindings.
mpich2 1.0.x was the installed software on our machines, not by myself
so I can tell you, at the current time, why g77
(A try with mpich2 1.0.x was just a reference to shown that I can have more than 512 tasks)
What I can tell is that, after compiling the latest mpich2 version (1.4.1p1)
I encountered the failure when number of tasks reached ~160 tasks
I take care why the number of file descriptors and also the sire of share memory (SHMALL)
Here are the values :
>mpich2version
MPICH2 Version: 1.4.1p1
MPICH2 Release date: Thu Sep 1 13:53:02 CDT 2011
MPICH2 Device: ch3:nemesis
MPICH2 configure: --prefix=//scratch/BC/mpich2-1.4
MPICH2 CC: /usr/bin/gcc -m64 -O2
MPICH2 CXX: c++ -m64 -O2
MPICH2 F77: /usr/bin/f77 -O2
MPICH2 FC: f95
>limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize 10240 kbytes
coredumpsize unlimited
memoryuse unlimited
vmemoryuse unlimited
descriptors 4096
memorylocked 32 kbytes
maxproc 409600
>more /proc/sys/kernel/shmall
8388608
Here is the test
>mpiexec -np 150 bin/basic_test
Running 150 tasks
>mpiexec -np 160 bin/basic_test
Assertion failed in file /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line 889: seg_sz > 0
internal ABORT - process 0
[proxy:0:0 at ccwpge0001] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:80): assert (!closed) failed
[proxy:0:0 at ccwpge0001] fn_get (./pm/pmiserv/pmip_pmi_v1.c:349): error sending PMI response
[proxy:0:0 at ccwpge0001] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned error
[proxy:0:0 at ccwpge0001] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at ccwpge0001] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[mpiexec at ccwpge0001] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
[mpiexec at ccwpge0001] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at ccwpge0001] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
[mpiexec at ccwpge0001] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
As Dave pointed in an more recent mail, perhaps I must increase __FD_SETSIZE , and reocmpile mpich2
but I have to ask my sysadmin, for that !
The OS and CPU are :
>uname -a
Linux ccwpge0001 2.6.18-238.12cc.el5 #1 SMP Thu Mar 3 12:19:21 CET 2011 x86_64 x86_64 x86_64 GNU/Linux
grep ... /proc/cpuinfo
model name : Intel(R) Xeon(R) CPU E5540 @ 2.53GHz
Best regards
---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111217/e17ff709/attachment-0001.htm>
More information about the mpich-discuss
mailing list