[mpich-discuss] Errors related to the increased number of tasks

Pavan Balaji balaji at mcs.anl.gov
Thu Mar 1 13:11:59 CST 2012


Hello,

Can you try the latest version of MPICH2 (1.5a2) and see if the problem 
still exists?

  -- Pavan

On 01/05/2012 01:51 AM, Bernard Chambon wrote:
> Hello
>
> Le 27 déc. 2011 à 06:52, Pavan Balaji a écrit :
>
>>
>> Looks like the shared memory is bombing out. Can you run mpiexec with
>> the -verbose option and also send us the machine file that you are
>> using (or is it all on a single node)?
>>
>> -- Pavan
>
> Another test (to still point the same failure)
> 1/ after getting rid of limits on Linux machine (SL5, Linux 2.6.x)
> />limit/
> /cputime unlimited/
> /filesize unlimited/
> /datasize unlimited/
> /stacksize unlimited/
> /coredumpsize unlimited/
> /memoryuse unlimited/
> /vmemoryuse unlimited/
> /descriptors 1000000 /
> /memorylocked unlimited/
> /maxproc 409600 /
> /
> /
> />more /proc/sys/kernel/shmall/
> /8388608000/
> /
> /
> 2/ after increasing /FD_SETSIZE /and recompiling mpich2 1.4.1p1
> /
> /
> />grep -E "#define\W+__FD_SETSIZE" /usr/include/*.h /usr/include/*/*.h/
> //usr/include/bits/typesizes.h:#define////__FD_SETSIZE 8192///
> //usr/include/linux/posix_types.h:#define __FD_SETSIZE////8192/
>
> I still get the same problem, when trying to run a basic code with more
> than ~150 tasks (trying with 170 tasks)
>
> />mpich2version/
> /MPICH2 Version: ////1.4.1p1/
> /MPICH2 Release date:////Thu Sep 1 13:53:02 CDT 2011/
> /MPICH2 Device: ////ch3:nemesis/
> /MPICH2 configure: ////--prefix=//scratch/BC/mpich2-1.4/
> /MPICH2 CC: /////usr/bin/gcc -m64 -O2/
> /MPICH2 CXX: ////c++ -m64 -O2/
> /MPICH2 F77: /////usr/bin/f77 -O2/
> /MPICH2 FC: ////f95 /
> /
> /
> />mpiexec -np 170 bin/advance_test/
> /Assertion failed in file
> /scratch/BC/mpich2-1.4.1p1/src/util/wrappers/mpiu_shm_wrappers.h at line
> 889: seg_sz > 0/
> /internal ABORT - process 0/
>
>
> Another interesting thing is that the same basic code, running with
> older release of mpich2 (1.0.8p1, using mpd daemon, default installation
> on our machines) run without any failure
>
> />mpich2version /
> /MPICH2 Version: ////1.0.8p1/
> /MPICH2 Release date:////Unknown, built on Tue Apr 21 13:52:10 CEST 2009/
> /MPICH2 Device: ////ch3:sock/
> /MPICH2 configure: ////-prefix=/usr/local/mpich2/
> /MPICH2 CC: ////gcc -O2/
> /MPICH2 CXX: ////c++ -O2/
> /MPICH2 F77: ////g77 -O2/
> /MPICH2 F90: ////f95 -O2/
>
> />mpicc -O2 -o bin/advance_test advance_test.c/
> />mpdboot --ncpus=170/
> />mpiexec -np 170 bin/advance_test | more/
> /Running 170 tasks /
> /In slave tasks /
> /In slave tasks /
> /In slave tasks /
> /In slave tasks /
> /In slave tasks /
> /In slave tasks /
> /…/
> /mpdallexit/
>
> The test code run without failure
>
> If you ask me why such a test, it's because, after installing mpich2
> 1.4.1.p1
> and running jobs thru GridEngine, everything works fine if jobs specify
> small number of tasks
> then I get failures as the number of tasks increases
> (let's say with for example 32 tasks 100% jobs pass, with 64 tasks, 70%
> of jobs fails)
>
> So at the current time, I can't provide Mpich2 for ours user
>
> Thank you for any help
>
>
> PS : the basic test code
> if (MPI_Init(&argc, &argv) != MPI_SUCCESS ) {
> printf("Error calling MPI_Init !!, exiting \n") ; fflush(stdout);
> return(1);
> }
>
> int rank;
> if ( MPI_Comm_rank(MPI_COMM_WORLD, &rank)!= MPI_SUCCESS ) {
> printf("Error calling MPI_Comm_rank !!, exiting \n") ; fflush(stdout);
> MPI_Abort(MPI_COMM_WORLD, 1);
> return(1);
> }
> if (rank == 0) {
> int nprocs;
> if (MPI_Comm_size(MPI_COMM_WORLD, &nprocs)!= MPI_SUCCESS ) {
> printf("Error calling MPI_Comm_size !!, exiting \n") ; fflush(stdout);
> MPI_Abort(MPI_COMM_WORLD, 1);
> return(1);
> }
> printf("Running %d tasks \n", nprocs) ; fflush(stdout);
> MPI_Finalize();
> return(0);
> } else {
> printf("In slave tasks \n") ; fflush(stdout);
> sleep(1);
> // MPI_Finalize(); // mandatory if <= mpich2-1.2 ?
> return(0);
> }
>
> ---------------
> Bernard CHAMBON
> IN2P3 / CNRS
> 04 72 69 42 18
>
>
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list