[mpich-discuss] Errors related to the increased number of tasks

Bernard Chambon bernard.chambon at cc.in2p3.fr
Thu Jan 5 01:51:01 CST 2012


Le 27 déc. 2011 à 06:52, Pavan Balaji a écrit :

> Looks like the shared memory is bombing out.  Can you run mpiexec with the -verbose option and also send us the machine file that you are using (or is it all on a single node)?
> -- Pavan

Another test  (to still point the same failure)
 1/ after getting rid of limits on Linux machine (SL5, Linux 2.6.x)
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    unlimited
coredumpsize unlimited
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  1000000 
memorylocked unlimited
maxproc      409600 

>more /proc/sys/kernel/shmall

 2/ after increasing FD_SETSIZE and recompiling mpich2 1.4.1p1

>grep -E "#define\W+__FD_SETSIZE" /usr/include/*.h /usr/include/*/*.h
/usr/include/bits/typesizes.h:#define	__FD_SETSIZE          8192	
/usr/include/linux/posix_types.h:#define __FD_SETSIZE	 8192	

I still get the same problem, when trying to run a basic code with more than ~150 tasks (trying with 170 tasks)

MPICH2 Version:    	1.4.1p1
MPICH2 Release date:	Thu Sep  1 13:53:02 CDT 2011
MPICH2 Device:    	ch3:nemesis
MPICH2 configure: 	--prefix=//scratch/BC/mpich2-1.4
MPICH2 CC: 	/usr/bin/gcc -m64   -O2
MPICH2 CXX: 	c++ -m64  -O2
MPICH2 F77: 	/usr/bin/f77   -O2
MPICH2 FC: 	f95  

>mpiexec -np 170 bin/advance_test
Assertion failed in file /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line 889: seg_sz > 0
internal ABORT - process 0

Another interesting thing is that the same basic code, running with older release of mpich2 (1.0.8p1, using mpd daemon, default installation on our machines) run without any failure 

MPICH2 Version:    	1.0.8p1
MPICH2 Release date:	Unknown, built on Tue Apr 21 13:52:10 CEST 2009
MPICH2 Device:    	ch3:sock
MPICH2 configure: 	-prefix=/usr/local/mpich2
MPICH2 CC: 	gcc  -O2
MPICH2 CXX: 	c++  -O2
MPICH2 F77: 	g77  -O2
MPICH2 F90: 	f95  -O2

>mpicc -O2 -o bin/advance_test advance_test.c
>mpdboot --ncpus=170
>mpiexec -np 170 bin/advance_test | more
Running 170 tasks 
In slave tasks 
In slave tasks 
In slave tasks 
In slave tasks 
In slave tasks 
In slave tasks 

The test code run without failure 

If you ask me why such a test, it's because, after installing mpich2 1.4.1.p1
and running jobs thru GridEngine, everything works fine if jobs specify small number of tasks
then I get failures as the number of tasks increases
(let's say with for example 32 tasks 100% jobs pass, with 64 tasks, 70% of jobs fails)

So at the current time, I can't provide Mpich2 for ours user

Thank you for any help


PS : the basic test code
 if (MPI_Init(&argc, &argv) != MPI_SUCCESS ) {
  printf("Error calling MPI_Init !!, exiting \n") ; fflush(stdout);

 int rank;
 if ( MPI_Comm_rank(MPI_COMM_WORLD, &rank)!= MPI_SUCCESS ) {
  printf("Error calling  MPI_Comm_rank !!, exiting \n") ; fflush(stdout);
 if (rank == 0) {
  int nprocs;
  if (MPI_Comm_size(MPI_COMM_WORLD, &nprocs)!= MPI_SUCCESS ) {
   printf("Error calling  MPI_Comm_size !!, exiting \n") ; fflush(stdout);
   MPI_Abort(MPI_COMM_WORLD, 1);
  printf("Running %d tasks \n", nprocs) ; fflush(stdout);
 } else {
  printf("In slave tasks \n") ; fflush(stdout);
  // MPI_Finalize();  // mandatory if <= mpich2-1.2 ?

04 72 69 42 18

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120105/5d794d7f/attachment.htm>

More information about the mpich-discuss mailing list