[mpich-discuss] Errors related to the increased number of tasks

Bernard Chambon bernard.chambon at cc.in2p3.fr
Fri Dec 16 08:49:15 CST 2011


Hi,

Le 15 déc. 2011 à 17:22, Bernard Chambon a écrit :

> I'm still working on failures encountered as the number of tasks increases
> (Using mpich2-1.4, compiled with gcc 4.1, on Scientific Linux 5 , 2.6.18-238.12cc.el5)
> 

Other tests, on the same machine with mpich2 1.0 then 1.1, 1.2 etc

 >mpich2version
MPICH2 Version:    	1.0.8p1
MPICH2 Release date:	Unknown, built on Tue Apr 21 13:52:10 CEST 2009
MPICH2 Device:    	ch3:sock
MPICH2 configure: 	-prefix=/usr/local/mpich2
MPICH2 CC: 	gcc  -O2
MPICH2 CXX: 	c++  -O2
MPICH2 F77: 	g77  -O2
MPICH2 F90: 	f95  -O2

 >mpicc -O2 -I $MPICH_HOME/include -L $MPICH_HOME/lib -o bin/basic_test basic_test.c

 >mpiexec -np 256 bin/basic_test
Running 256 tasks 

 >mpiexec -np 512 bin/basic_test
Running 512 tasks 

 >mpiexec -np 512 bin/basic_test
Running 512 tasks 



With Mpich2 1.1 and beyond , I got error with around 150 tasks
I probably ommitted something when compiling those versions , but I don't know where to look for


 >mpich2version 
MPICH2 Version:    	1.1b1
MPICH2 Release date:	Unknown, built on Fri Dec 16 15:30:19 CET 2011
MPICH2 Device:    	ch3:nemesis
MPICH2 configure: 	--prefix=//scratch/BC/mpich2-1.1
MPICH2 CC: 	/usr/bin/gcc -m64 -O2
MPICH2 CXX: 	c++ -m64 -O2
MPICH2 F77: 	/usr/bin/f77  -O2
MPICH2 F90: 	f95  -O2


 >mpicc -O2 -I $MPICH_HOME/include -L $MPICH_HOME/lib -o bin/basic_test basic_test.c
 >mpiexec -np 100 bin/basic_test
Running 100 tasks 

 >mpiexec -np 120 bin/basic_test
Running 120 tasks 

 >mpiexec -np 150 bin/basic_test
Assertion failed in file /scratch/BC/mpich2-1.1b1/src/util/wrappers/mpiu_shm_wrappers.h at line 919: seg_sz > 0
internal ABORT - process 0
rank 0 in job 26  ccwpge0001_56217   caused collective abort of all ranks
  exit status of rank 0: return code 1 


 >mpich2version 
MPICH2 Version:    	1.2.1
MPICH2 Release date:	Unknown, built on Fri Dec 16 13:40:20 CET 2011
MPICH2 Device:    	ch3:nemesis
MPICH2 configure: 	--prefix=//scratch/BC/mpich2-1.2
MPICH2 CC: 	/usr/bin/gcc -m64 -O2
MPICH2 CXX: 	c++ -m64 -O2
MPICH2 F77: 	/usr/bin/f77  -O2
MPICH2 F90: 	f95  -O2

 >mpicc -O2 -I $MPICH_HOME/include -L $MPICH_HOME/lib -o bin/basic_test basic_test.c


 >mpiexec -np 96 bin/basic_test
Running 96 tasks 
 >mpiexec -np 96 bin/basic_test
Running 96 tasks 
 >mpiexec -np 120 bin/basic_test
Running 120 tasks 
 >mpiexec -np 120 bin/basic_test
Running 120 tasks 
 >mpiexec -np 130 bin/basic_test
Assertion failed in file /scratch/BC/mpich2-1.2.1/src/util/wrappers/mpiu_shm_wrappers.h at line 923: seg_sz > 0
internal ABORT - process 0
rank 0 in job 16  ccwpge0001_56217   caused collective abort of all ranks
  exit status of rank 0: return code 1 

Best regards


PS :
the test code

int basicTest(int argc, char** argv) {
 if (MPI_Init(&argc, &argv) != MPI_SUCCESS ) {
  printf("Error calling MPI_Init !!, exiting \n") ; fflush(stdout);
  return(1);
 }

 int rank;
 if ( MPI_Comm_rank(MPI_COMM_WORLD, &rank)!= MPI_SUCCESS ) {
  printf("Error calling  MPI_Comm_rank !!, exiting \n") ; fflush(stdout);
  MPI_Abort(MPI_COMM_WORLD, 1);
  return(1);
 }
 
 if (rank == 0) {
  int nprocs;
  if (MPI_Comm_size(MPI_COMM_WORLD, &nprocs)!= MPI_SUCCESS ) {
   printf("Error calling  MPI_Comm_size !!, exiting \n") ; fflush(stdout);
   MPI_Abort(MPI_COMM_WORLD, 1);
   return(1);
  }
 
  printf("Running %d tasks \n", nprocs) ; fflush(stdout);
  MPI_Finalize(); 
  return(0); 
 } else {
  sleep(1);
  MPI_Finalize();  // Necessaire ssi <= mpich2-1.2
  return(0);
 }

}
/******************************/
int main(int argc, char** argv) {
  basicTest(argc, argv);  
}


---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111216/4bff9178/attachment-0001.htm>


More information about the mpich-discuss mailing list