[mpich-discuss] MPICH2 (or MPI_Init) limitation | scalability

Darius Buntinas buntinas at mcs.anl.gov
Wed Jan 11 11:49:25 CST 2012


I think I found the problem.  Apply this patch (using "patch -p0 < seg_sz.patch"), then "make clean; make; make install", and try it again.  Make sure to relink your application.

Let us know if this works.

Thanks,
-d

-------------- next part --------------
A non-text attachment was scrubbed...
Name: seg_sz.patch
Type: application/octet-stream
Size: 3474 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120111/318d9e9b/attachment.obj>
-------------- next part --------------

On Jan 11, 2012, at 1:36 AM, Bernard Chambon wrote:

> Hi,
> 
> Le 10 janv. 2012 ? 19:20, Darius Buntinas a ?crit :
> 
>> I think Dave has the right idea.  You may not have enough shared memory available to support that many processes.  There are two ways MPICH2 allocates shared memory, System V or mmap.  System V typically has very low limits on the size of shared memory regions, so we use mmap be default.  To make sure mmap is being used, send us the output of:
>> 
>> grep "shared memory" src/mpid/ch3/channels/nemesis/config.log
>> 
>> Thanks
> 
> yes mmap is used
> 
> >grep "shared memory" src/mpid/ch3/channels/nemesis/config.log
> configure:7220: Using a memory-mapped file for shared memory
> 
> The bad news is that there is NO influence of the shm* parameters
> I always get failure reaching 153 task even after incresing values by 8
> 
>  >sysctl -A | egrep "sem|shm"
> vm.hugetlb_shm_group = 0
> kernel.sem = 250	32000	32	128
> kernel.shmmni = 4096
> kernel.shmall = 2097152
> kernel.shmmax = 33554432
> 
> >mpiexec -genvall -profile -np 152 bin/my_test ; echo $status
> 
> ================================================================================
> [mpiexec at ccwpge0062] Number of PMI calls seen by the server: 306
> ================================================================================
> 
> 0
> 
> >mpiexec -genvall -profile -np 153 bin/my_test
> Assertion failed in file /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line 889: seg_sz > 0
> internal ABORT - process 0
> [proxy:0:0 at ccwpge0062] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:80): assert (!closed) failed
> [proxy:0:0 at ccwpge0062] fn_get (./pm/pmiserv/pmip_pmi_v1.c:349): error sending PMI response
> [proxy:0:0 at ccwpge0062] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned error
> ...
> 
> lowering shm* values (e.g. by 16) has also no influence
> 
> Thanks,
> 
> ---------------
> Bernard CHAMBON
> IN2P3 / CNRS
> 04 72 69 42 18
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list