[mpich-discuss] Errors related to the increased number of tasks
Bernard Chambon
bernard.chambon at cc.in2p3.fr
Thu Jan 5 01:51:01 CST 2012
Hello
Le 27 déc. 2011 à 06:52, Pavan Balaji a écrit :
>
> Looks like the shared memory is bombing out. Can you run mpiexec with the -verbose option and also send us the machine file that you are using (or is it all on a single node)?
>
> -- Pavan
Another test (to still point the same failure)
1/ after getting rid of limits on Linux machine (SL5, Linux 2.6.x)
>limit
cputime unlimited
filesize unlimited
datasize unlimited
stacksize unlimited
coredumpsize unlimited
memoryuse unlimited
vmemoryuse unlimited
descriptors 1000000
memorylocked unlimited
maxproc 409600
>more /proc/sys/kernel/shmall
8388608000
2/ after increasing FD_SETSIZE and recompiling mpich2 1.4.1p1
>grep -E "#define\W+__FD_SETSIZE" /usr/include/*.h /usr/include/*/*.h
/usr/include/bits/typesizes.h:#define __FD_SETSIZE 8192
/usr/include/linux/posix_types.h:#define __FD_SETSIZE 8192
I still get the same problem, when trying to run a basic code with more than ~150 tasks (trying with 170 tasks)
>mpich2version
MPICH2 Version: 1.4.1p1
MPICH2 Release date: Thu Sep 1 13:53:02 CDT 2011
MPICH2 Device: ch3:nemesis
MPICH2 configure: --prefix=//scratch/BC/mpich2-1.4
MPICH2 CC: /usr/bin/gcc -m64 -O2
MPICH2 CXX: c++ -m64 -O2
MPICH2 F77: /usr/bin/f77 -O2
MPICH2 FC: f95
>mpiexec -np 170 bin/advance_test
Assertion failed in file /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line 889: seg_sz > 0
internal ABORT - process 0
Another interesting thing is that the same basic code, running with older release of mpich2 (1.0.8p1, using mpd daemon, default installation on our machines) run without any failure
>mpich2version
MPICH2 Version: 1.0.8p1
MPICH2 Release date: Unknown, built on Tue Apr 21 13:52:10 CEST 2009
MPICH2 Device: ch3:sock
MPICH2 configure: -prefix=/usr/local/mpich2
MPICH2 CC: gcc -O2
MPICH2 CXX: c++ -O2
MPICH2 F77: g77 -O2
MPICH2 F90: f95 -O2
>mpicc -O2 -o bin/advance_test advance_test.c
>mpdboot --ncpus=170
>mpiexec -np 170 bin/advance_test | more
Running 170 tasks
In slave tasks
In slave tasks
In slave tasks
In slave tasks
In slave tasks
In slave tasks
…
mpdallexit
The test code run without failure
If you ask me why such a test, it's because, after installing mpich2 1.4.1.p1
and running jobs thru GridEngine, everything works fine if jobs specify small number of tasks
then I get failures as the number of tasks increases
(let's say with for example 32 tasks 100% jobs pass, with 64 tasks, 70% of jobs fails)
So at the current time, I can't provide Mpich2 for ours user
Thank you for any help
PS : the basic test code
if (MPI_Init(&argc, &argv) != MPI_SUCCESS ) {
printf("Error calling MPI_Init !!, exiting \n") ; fflush(stdout);
return(1);
}
int rank;
if ( MPI_Comm_rank(MPI_COMM_WORLD, &rank)!= MPI_SUCCESS ) {
printf("Error calling MPI_Comm_rank !!, exiting \n") ; fflush(stdout);
MPI_Abort(MPI_COMM_WORLD, 1);
return(1);
}
if (rank == 0) {
int nprocs;
if (MPI_Comm_size(MPI_COMM_WORLD, &nprocs)!= MPI_SUCCESS ) {
printf("Error calling MPI_Comm_size !!, exiting \n") ; fflush(stdout);
MPI_Abort(MPI_COMM_WORLD, 1);
return(1);
}
printf("Running %d tasks \n", nprocs) ; fflush(stdout);
MPI_Finalize();
return(0);
} else {
printf("In slave tasks \n") ; fflush(stdout);
sleep(1);
// MPI_Finalize(); // mandatory if <= mpich2-1.2 ?
return(0);
}
---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120105/5d794d7f/attachment.htm>
More information about the mpich-discuss
mailing list