[mpich-discuss] Errors related to the increased number of tasks
Bernard Chambon
bernard.chambon at cc.in2p3.fr
Fri Dec 16 08:49:15 CST 2011
Hi,
Le 15 déc. 2011 à 17:22, Bernard Chambon a écrit :
> I'm still working on failures encountered as the number of tasks increases
> (Using mpich2-1.4, compiled with gcc 4.1, on Scientific Linux 5 , 2.6.18-238.12cc.el5)
>
Other tests, on the same machine with mpich2 1.0 then 1.1, 1.2 etc
>mpich2version
MPICH2 Version: 1.0.8p1
MPICH2 Release date: Unknown, built on Tue Apr 21 13:52:10 CEST 2009
MPICH2 Device: ch3:sock
MPICH2 configure: -prefix=/usr/local/mpich2
MPICH2 CC: gcc -O2
MPICH2 CXX: c++ -O2
MPICH2 F77: g77 -O2
MPICH2 F90: f95 -O2
>mpicc -O2 -I $MPICH_HOME/include -L $MPICH_HOME/lib -o bin/basic_test basic_test.c
>mpiexec -np 256 bin/basic_test
Running 256 tasks
>mpiexec -np 512 bin/basic_test
Running 512 tasks
>mpiexec -np 512 bin/basic_test
Running 512 tasks
With Mpich2 1.1 and beyond , I got error with around 150 tasks
I probably ommitted something when compiling those versions , but I don't know where to look for
>mpich2version
MPICH2 Version: 1.1b1
MPICH2 Release date: Unknown, built on Fri Dec 16 15:30:19 CET 2011
MPICH2 Device: ch3:nemesis
MPICH2 configure: --prefix=//scratch/BC/mpich2-1.1
MPICH2 CC: /usr/bin/gcc -m64 -O2
MPICH2 CXX: c++ -m64 -O2
MPICH2 F77: /usr/bin/f77 -O2
MPICH2 F90: f95 -O2
>mpicc -O2 -I $MPICH_HOME/include -L $MPICH_HOME/lib -o bin/basic_test basic_test.c
>mpiexec -np 100 bin/basic_test
Running 100 tasks
>mpiexec -np 120 bin/basic_test
Running 120 tasks
>mpiexec -np 150 bin/basic_test
Assertion failed in file /scratch/BC/mpich2-1.1b1/src/util/wrappers/mpiu_shm_wrappers.h at line 919: seg_sz > 0
internal ABORT - process 0
rank 0 in job 26 ccwpge0001_56217 caused collective abort of all ranks
exit status of rank 0: return code 1
>mpich2version
MPICH2 Version: 1.2.1
MPICH2 Release date: Unknown, built on Fri Dec 16 13:40:20 CET 2011
MPICH2 Device: ch3:nemesis
MPICH2 configure: --prefix=//scratch/BC/mpich2-1.2
MPICH2 CC: /usr/bin/gcc -m64 -O2
MPICH2 CXX: c++ -m64 -O2
MPICH2 F77: /usr/bin/f77 -O2
MPICH2 F90: f95 -O2
>mpicc -O2 -I $MPICH_HOME/include -L $MPICH_HOME/lib -o bin/basic_test basic_test.c
>mpiexec -np 96 bin/basic_test
Running 96 tasks
>mpiexec -np 96 bin/basic_test
Running 96 tasks
>mpiexec -np 120 bin/basic_test
Running 120 tasks
>mpiexec -np 120 bin/basic_test
Running 120 tasks
>mpiexec -np 130 bin/basic_test
Assertion failed in file /scratch/BC/mpich2-1.2.1/src/util/wrappers/mpiu_shm_wrappers.h at line 923: seg_sz > 0
internal ABORT - process 0
rank 0 in job 16 ccwpge0001_56217 caused collective abort of all ranks
exit status of rank 0: return code 1
Best regards
PS :
the test code
int basicTest(int argc, char** argv) {
if (MPI_Init(&argc, &argv) != MPI_SUCCESS ) {
printf("Error calling MPI_Init !!, exiting \n") ; fflush(stdout);
return(1);
}
int rank;
if ( MPI_Comm_rank(MPI_COMM_WORLD, &rank)!= MPI_SUCCESS ) {
printf("Error calling MPI_Comm_rank !!, exiting \n") ; fflush(stdout);
MPI_Abort(MPI_COMM_WORLD, 1);
return(1);
}
if (rank == 0) {
int nprocs;
if (MPI_Comm_size(MPI_COMM_WORLD, &nprocs)!= MPI_SUCCESS ) {
printf("Error calling MPI_Comm_size !!, exiting \n") ; fflush(stdout);
MPI_Abort(MPI_COMM_WORLD, 1);
return(1);
}
printf("Running %d tasks \n", nprocs) ; fflush(stdout);
MPI_Finalize();
return(0);
} else {
sleep(1);
MPI_Finalize(); // Necessaire ssi <= mpich2-1.2
return(0);
}
}
/******************************/
int main(int argc, char** argv) {
basicTest(argc, argv);
}
---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111216/4bff9178/attachment-0001.htm>
More information about the mpich-discuss
mailing list