[mpich-discuss] Errors related to the increased number of tasks

Bernard Chambon bernard.chambon at cc.in2p3.fr
Thu Jan 5 01:51:01 CST 2012


Hello

Le 27 déc. 2011 à 06:52, Pavan Balaji a écrit :

> 
> Looks like the shared memory is bombing out.  Can you run mpiexec with the -verbose option and also send us the machine file that you are using (or is it all on a single node)?
> 
> -- Pavan


Another test  (to still point the same failure)
 1/ after getting rid of limits on Linux machine (SL5, Linux 2.6.x)
 >limit
cputime      unlimited
filesize     unlimited
datasize     unlimited
stacksize    unlimited
coredumpsize unlimited
memoryuse    unlimited
vmemoryuse   unlimited
descriptors  1000000 
memorylocked unlimited
maxproc      409600 

>more /proc/sys/kernel/shmall
8388608000

 2/ after increasing FD_SETSIZE and recompiling mpich2 1.4.1p1

>grep -E "#define\W+__FD_SETSIZE" /usr/include/*.h /usr/include/*/*.h
/usr/include/bits/typesizes.h:#define	__FD_SETSIZE          8192	
/usr/include/linux/posix_types.h:#define __FD_SETSIZE	 8192	

I still get the same problem, when trying to run a basic code with more than ~150 tasks (trying with 170 tasks)

>mpich2version
MPICH2 Version:    	1.4.1p1
MPICH2 Release date:	Thu Sep  1 13:53:02 CDT 2011
MPICH2 Device:    	ch3:nemesis
MPICH2 configure: 	--prefix=//scratch/BC/mpich2-1.4
MPICH2 CC: 	/usr/bin/gcc -m64   -O2
MPICH2 CXX: 	c++ -m64  -O2
MPICH2 F77: 	/usr/bin/f77   -O2
MPICH2 FC: 	f95  

>mpiexec -np 170 bin/advance_test
Assertion failed in file /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line 889: seg_sz > 0
internal ABORT - process 0


Another interesting thing is that the same basic code, running with older release of mpich2 (1.0.8p1, using mpd daemon, default installation on our machines) run without any failure 

>mpich2version 
MPICH2 Version:    	1.0.8p1
MPICH2 Release date:	Unknown, built on Tue Apr 21 13:52:10 CEST 2009
MPICH2 Device:    	ch3:sock
MPICH2 configure: 	-prefix=/usr/local/mpich2
MPICH2 CC: 	gcc  -O2
MPICH2 CXX: 	c++  -O2
MPICH2 F77: 	g77  -O2
MPICH2 F90: 	f95  -O2

>mpicc -O2 -o bin/advance_test advance_test.c
>mpdboot --ncpus=170
>mpiexec -np 170 bin/advance_test | more
Running 170 tasks 
In slave tasks 
In slave tasks 
In slave tasks 
In slave tasks 
In slave tasks 
In slave tasks 
…
mpdallexit

The test code run without failure 

If you ask me why such a test, it's because, after installing mpich2 1.4.1.p1
and running jobs thru GridEngine, everything works fine if jobs specify small number of tasks
then I get failures as the number of tasks increases
(let's say with for example 32 tasks 100% jobs pass, with 64 tasks, 70% of jobs fails)

So at the current time, I can't provide Mpich2 for ours user

Thank you for any help

 

PS : the basic test code
 
 if (MPI_Init(&argc, &argv) != MPI_SUCCESS ) {
  printf("Error calling MPI_Init !!, exiting \n") ; fflush(stdout);
  return(1);
 }

 int rank;
 if ( MPI_Comm_rank(MPI_COMM_WORLD, &rank)!= MPI_SUCCESS ) {
  printf("Error calling  MPI_Comm_rank !!, exiting \n") ; fflush(stdout);
  MPI_Abort(MPI_COMM_WORLD, 1);
  return(1);
 }
 
 if (rank == 0) {
  int nprocs;
  if (MPI_Comm_size(MPI_COMM_WORLD, &nprocs)!= MPI_SUCCESS ) {
   printf("Error calling  MPI_Comm_size !!, exiting \n") ; fflush(stdout);
   MPI_Abort(MPI_COMM_WORLD, 1);
   return(1);
  }
 
  printf("Running %d tasks \n", nprocs) ; fflush(stdout);
  MPI_Finalize(); 
  return(0); 
 } else {
  printf("In slave tasks \n") ; fflush(stdout);
   sleep(1);
  // MPI_Finalize();  // mandatory if <= mpich2-1.2 ?
  return(0);
 }

---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20120105/5d794d7f/attachment.htm>


More information about the mpich-discuss mailing list