[mpich-discuss] Errors related to the increased number of tasks

Bernard Chambon bernard.chambon at cc.in2p3.fr
Mon Jan 2 08:29:23 CST 2012


Hello,


Le 27 déc. 2011 à 06:52, Pavan Balaji a écrit :

> 
> On 12/17/2011 02:56 AM, Bernard Chambon wrote:
>>> mpiexec -np 160 bin/basic_test
>> Assertion failed in file
>> /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line
>> 889: seg_sz > 0
> 
> Looks like the shared memory is bombing out.  Can you run mpiexec with the -verbose option and also send us the machine file that you are using (or is it all on a single node)?
> 
> -- Pavan
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji



I run again my test (MPI_Init + MPI_Comm_rank + MPI_Comm_size + MPI_Finalize), on a single node
and after :
 1/ increasing __FD_SETSIZE  (1024 -> 8192) and recompiling mpich2 1.4 

>grep -E "#define\W+__FD_SETSIZE" /usr/include/*.h /usr/include/*/*.h
/usr/include/bits/typesizes.h:#define	__FD_SETSIZE          8192	
/usr/include/linux/posix_types.h:#define __FD_SETSIZE	 8192	

 2/ asking my sysadmin to increase some limits

>limit
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    10240 kbytes
coredumpsize unlimited
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  4096 
memorylocked 32 kbytes
maxproc      409600 

>more /proc/sys/kernel/shmall
8388608

I got the same error when reaching a limit around 160 tasks (It's ok with, let say, 150 tasks)

> mpiexec -verbose -np 160 bin/advance_test

….

[proxy:0:0 at ccwpge0001] got pmi command (from 114): get_my_kvsname

[proxy:0:0 at ccwpge0001] PMI response: cmd=my_kvsname kvsname=kvs_10405_0
[proxy:0:0 at ccwpge0001] got pmi command (from 8): barrier_in
[proxy:0:0 at ccwpge0001] got pmi command (from 45): barrier_in
[proxy:0:0 at ccwpge0001] got pmi command (from 84): barrier_in
[proxy:0:0 at ccwpge0001] got pmi command (from 114): get
kvsname=kvs_10405_0 key=PMI_process_mapping 
[proxy:0:0 at ccwpge0001] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,1))
Assertion failed in file /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line 889: seg_sz > 0
internal ABORT - process 0

Best regards, and happy new year

PS : 
 To be clear, the purpose of this test is to understand why such a limit and,
 more precisaly what is the relationship between that limit and the machine|user|software configuration

---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120102/beb47b14/attachment-0001.htm>


More information about the mpich-discuss mailing list