[mpich-discuss] Errors related to the increased number of tasks
Gustavo Correa
gus at ldeo.columbia.edu
Sat Dec 17 16:15:48 CST 2011
Hi Bernard
Again, this may not be the cause of the problem,
but since you say the problem happens when you increase the number of
processes, it may.
Have you tried to increase the stack size and the locket memory size?
The numbers you have may be desktop-like defaults.
For instance, to make them unlimited you could add these lines to
/etc/security/limits.conf of your compute nodes [or ask the system administrator to do it]:
* - memlock -1
* - stack -1
* - nofile 4096
[Your number of file descriptors is already 4096 as above.]
Sooner or later your programs may hit the current limits anyway,
so this may be useful.
I hope this helps,
Gus Correa
On Dec 17, 2011, at 3:56 AM, Bernard Chambon wrote:
> Hello,
>
> Thank you Gustavo and thank you Dave for you interest in my problem,
>
>
> Le 16 déc. 2011 à 16:35, Gustavo Correa a écrit :
>
>> Hi Bernard
>>
>> Am I mistaken, or does your main routine perhaps calls only
>> MPI_Init?
>> Your main seems to call only 'basicTest', but not 'rank',
>> where other MPI routines appear.
>>
>> The MPICH2 developers may shed some light here,
>> but I think MPI_Init alone doesn't compose a minimal MPI program.
>> You need at least MPI_Finalize, I guess.
>> Or not?
>>
>
> NO, you probably miss an opening brace "{", (due to my bad indentation, sorry)
> In fact my basicTest function include : MPI_Init + MPI_Comm_rank + MPI_Comm_size + MPI_Finalize
>
>> Also, not related to your C program, but
>> since you are in Linux, why did you choose g77 to compile the Fortran-77 bindings,
>> and f95 [is this g95?] to compile the Fortran-90 bindings of MPICH2?
>> g77 is quite old, I have been luckier using gfortran to compile both
>> the Fortran 77 and 90 bindings.
>
> mpich2 1.0.x was the installed software on our machines, not by myself
> so I can tell you, at the current time, why g77
> (A try with mpich2 1.0.x was just a reference to shown that I can have more than 512 tasks)
>
>
> What I can tell is that, after compiling the latest mpich2 version (1.4.1p1)
> I encountered the failure when number of tasks reached ~160 tasks
> I take care why the number of file descriptors and also the sire of share memory (SHMALL)
> Here are the values :
>
> >mpich2version
> MPICH2 Version: 1.4.1p1
> MPICH2 Release date: Thu Sep 1 13:53:02 CDT 2011
> MPICH2 Device: ch3:nemesis
> MPICH2 configure: --prefix=//scratch/BC/mpich2-1.4
> MPICH2 CC: /usr/bin/gcc -m64 -O2
> MPICH2 CXX: c++ -m64 -O2
> MPICH2 F77: /usr/bin/f77 -O2
> MPICH2 FC: f95
>
>
> >limit
> cputime unlimited
> filesize unlimited
> datasize unlimited
> stacksize 10240 kbytes
> coredumpsize unlimited
> memoryuse unlimited
> vmemoryuse unlimited
> descriptors 4096
> memorylocked 32 kbytes
> maxproc 409600
>
>
> >more /proc/sys/kernel/shmall
> 8388608
>
>
> Here is the test
>
> >mpiexec -np 150 bin/basic_test
> Running 150 tasks
>
> >mpiexec -np 160 bin/basic_test
> Assertion failed in file /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line 889: seg_sz > 0
> internal ABORT - process 0
> [proxy:0:0 at ccwpge0001] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:80): assert (!closed) failed
> [proxy:0:0 at ccwpge0001] fn_get (./pm/pmiserv/pmip_pmi_v1.c:349): error sending PMI response
> [proxy:0:0 at ccwpge0001] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned error
> [proxy:0:0 at ccwpge0001] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at ccwpge0001] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [mpiexec at ccwpge0001] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
> [mpiexec at ccwpge0001] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at ccwpge0001] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
> [mpiexec at ccwpge0001] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
>
>
> As Dave pointed in an more recent mail, perhaps I must increase __FD_SETSIZE , and reocmpile mpich2
> but I have to ask my sysadmin, for that !
>
> The OS and CPU are :
>
> >uname -a
> Linux ccwpge0001 2.6.18-238.12cc.el5 #1 SMP Thu Mar 3 12:19:21 CET 2011 x86_64 x86_64 x86_64 GNU/Linux
>
> grep ... /proc/cpuinfo
> model name : Intel(R) Xeon(R) CPU E5540 @ 2.53GHz
>
>
> Best regards
>
> ---------------
> Bernard CHAMBON
> IN2P3 / CNRS
> 04 72 69 42 18
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list