[mpich-discuss] Errors related to the increased number of tasks
Bernard Chambon
bernard.chambon at cc.in2p3.fr
Mon Jan 2 08:29:23 CST 2012
Hello,
Le 27 déc. 2011 à 06:52, Pavan Balaji a écrit :
>
> On 12/17/2011 02:56 AM, Bernard Chambon wrote:
>>> mpiexec -np 160 bin/basic_test
>> Assertion failed in file
>> /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line
>> 889: seg_sz > 0
>
> Looks like the shared memory is bombing out. Can you run mpiexec with the -verbose option and also send us the machine file that you are using (or is it all on a single node)?
>
> -- Pavan
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
I run again my test (MPI_Init + MPI_Comm_rank + MPI_Comm_size + MPI_Finalize), on a single node
and after :
1/ increasing __FD_SETSIZE (1024 -> 8192) and recompiling mpich2 1.4
>grep -E "#define\W+__FD_SETSIZE" /usr/include/*.h /usr/include/*/*.h
/usr/include/bits/typesizes.h:#define __FD_SETSIZE 8192
/usr/include/linux/posix_types.h:#define __FD_SETSIZE 8192
2/ asking my sysadmin to increase some limits
>limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize 10240 kbytes
coredumpsize unlimited
memoryuse unlimited
vmemoryuse unlimited
descriptors 4096
memorylocked 32 kbytes
maxproc 409600
>more /proc/sys/kernel/shmall
8388608
I got the same error when reaching a limit around 160 tasks (It's ok with, let say, 150 tasks)
> mpiexec -verbose -np 160 bin/advance_test
….
[proxy:0:0 at ccwpge0001] got pmi command (from 114): get_my_kvsname
[proxy:0:0 at ccwpge0001] PMI response: cmd=my_kvsname kvsname=kvs_10405_0
[proxy:0:0 at ccwpge0001] got pmi command (from 8): barrier_in
[proxy:0:0 at ccwpge0001] got pmi command (from 45): barrier_in
[proxy:0:0 at ccwpge0001] got pmi command (from 84): barrier_in
[proxy:0:0 at ccwpge0001] got pmi command (from 114): get
kvsname=kvs_10405_0 key=PMI_process_mapping
[proxy:0:0 at ccwpge0001] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,1,1))
Assertion failed in file /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line 889: seg_sz > 0
internal ABORT - process 0
Best regards, and happy new year
PS :
To be clear, the purpose of this test is to understand why such a limit and,
more precisaly what is the relationship between that limit and the machine|user|software configuration
---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120102/beb47b14/attachment-0001.htm>
More information about the mpich-discuss
mailing list