[mpich-discuss] Errors related to the increased number of tasks

Dave Goodell goodell at mcs.anl.gov
Fri Dec 16 09:53:52 CST 2011


Can you try your software with mpich2-1.4.1p1?  It contains a number of bug fixes over 1.4.  The errors you are seeing with 1.2.1 and earlier are probably other bugs that were fixed years ago.  I don't recommend using those older versions.

After trying that, if you are still encountering a problem, you might want to try increasing the limit on the number of open file descriptors on your system: http://www.cs.uwaterloo.ca/~brecht/servers/openfiles.html

What sort of processor are you using?  Is it Intel/AMD or something else?

-Dave

On Dec 15, 2011, at 10:22 AM CST, Bernard Chambon wrote:

> Hello,
> 
> I'm still working on failures encountered as the number of tasks increases
> (Using mpich2-1.4, compiled with gcc 4.1, on Scientific Linux 5 , 2.6.18-238.12cc.el5)
> 
> Here is the smallest  mpich2  code, with which I got failure above ~150 tasks
> No communication, only basic call
> 
> The code :
> 
> // Compilation with :
> // mpicc -O2 -I $HOME/mpich2-1.4/include -L $HOME/mpich2-1.4/lib -o bin/basic_test basic_test.c
> 
> if (MPI_Init(&argc, &argv) != MPI_SUCCESS ) {
>   printf("Error calling MPI_Init !!, exiting \n") ; fflush(stdout);
>   return(1);
>  }
> 
>  int rank;
>  if ( MPI_Comm_rank(MPI_COMM_WORLD, &rank)!= MPI_SUCCESS ) {
>   printf("Error calling  MPI_Comm_rank !!, exiting \n") ; fflush(stdout);
>   MPI_Abort(MPI_COMM_WORLD, 1);
>   return(1);
>  }
>  
>  if (rank == 0) {
>   int nprocs;
>   if (MPI_Comm_size(MPI_COMM_WORLD, &nprocs)!= MPI_SUCCESS ) {
>    printf("Error calling  MPI_Comm_size !!, exiting \n") ; fflush(stdout);
>    MPI_Abort(MPI_COMM_WORLD, 1);
>    return(1);
>   }
>  
>   printf("Running %d tasks \n", nprocs) ; fflush(stdout);
>   MPI_Finalize(); 
>   return(0); 
>  } else {
>   sleep(1);
>   return(0);
>  }
> 
> 
> Runnning the code (On Scientific Linux 5 , 2.6.18-238.12cc.el5 )
> Everything works fine up to around 150 tasks
>  >mpiexec -np 128 bin/basic_test
> Running 128 tasks 
> 
>  >mpiexec -np 150 bin/basic_test
> Running 150 tasks 
> 
>  >mpiexec -np 160 bin/basic_test
> [proxy:0:0 at ccdvli10] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:80): assert (!closed) failed
> [proxy:0:0 at ccdvli10] fn_get_maxes (./pm/pmiserv/pmip_pmi_v1.c:205): error sending PMI response
> [proxy:0:0 at ccdvli10] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned error
> [proxy:0:0 at ccdvli10] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at ccdvli10] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [mpiexec at ccdvli10] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
> [mpiexec at ccdvli10] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at ccdvli10] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
> [mpiexec at ccdvli10] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
> 
> 
> Has anybody an idea of my probable error code ?
> What is the high limit for number of tasks ?
> 
> Best regards
> ---------------
> Bernard CHAMBON
> IN2P3 / CNRS
> 04 72 69 42 18
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list