[mpich-discuss] MPICH2 (or MPI_Init) limitation | scalability
Darius Buntinas
buntinas at mcs.anl.gov
Wed Jan 11 11:49:25 CST 2012
I think I found the problem. Apply this patch (using "patch -p0 < seg_sz.patch"), then "make clean; make; make install", and try it again. Make sure to relink your application.
Let us know if this works.
Thanks,
-d
-------------- next part --------------
A non-text attachment was scrubbed...
Name: seg_sz.patch
Type: application/octet-stream
Size: 3474 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120111/318d9e9b/attachment.obj>
-------------- next part --------------
On Jan 11, 2012, at 1:36 AM, Bernard Chambon wrote:
> Hi,
>
> Le 10 janv. 2012 ? 19:20, Darius Buntinas a ?crit :
>
>> I think Dave has the right idea. You may not have enough shared memory available to support that many processes. There are two ways MPICH2 allocates shared memory, System V or mmap. System V typically has very low limits on the size of shared memory regions, so we use mmap be default. To make sure mmap is being used, send us the output of:
>>
>> grep "shared memory" src/mpid/ch3/channels/nemesis/config.log
>>
>> Thanks
>
> yes mmap is used
>
> >grep "shared memory" src/mpid/ch3/channels/nemesis/config.log
> configure:7220: Using a memory-mapped file for shared memory
>
> The bad news is that there is NO influence of the shm* parameters
> I always get failure reaching 153 task even after incresing values by 8
>
> >sysctl -A | egrep "sem|shm"
> vm.hugetlb_shm_group = 0
> kernel.sem = 250 32000 32 128
> kernel.shmmni = 4096
> kernel.shmall = 2097152
> kernel.shmmax = 33554432
>
> >mpiexec -genvall -profile -np 152 bin/my_test ; echo $status
>
> ================================================================================
> [mpiexec at ccwpge0062] Number of PMI calls seen by the server: 306
> ================================================================================
>
> 0
>
> >mpiexec -genvall -profile -np 153 bin/my_test
> Assertion failed in file /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line 889: seg_sz > 0
> internal ABORT - process 0
> [proxy:0:0 at ccwpge0062] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:80): assert (!closed) failed
> [proxy:0:0 at ccwpge0062] fn_get (./pm/pmiserv/pmip_pmi_v1.c:349): error sending PMI response
> [proxy:0:0 at ccwpge0062] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned error
> ...
>
> lowering shm* values (e.g. by 16) has also no influence
>
> Thanks,
>
> ---------------
> Bernard CHAMBON
> IN2P3 / CNRS
> 04 72 69 42 18
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list