[mpich-discuss] Errors related to the increased number of tasks

Bernard Chambon bernard.chambon at cc.in2p3.fr
Thu Dec 15 10:22:27 CST 2011


Hello,

I'm still working on failures encountered as the number of tasks increases
(Using mpich2-1.4, compiled with gcc 4.1, on Scientific Linux 5 , 2.6.18-238.12cc.el5)

Here is the smallest  mpich2  code, with which I got failure above ~150 tasks
No communication, only basic call

The code :

// Compilation with :
// mpicc -O2 -I $HOME/mpich2-1.4/include -L $HOME/mpich2-1.4/lib -o bin/basic_test basic_test.c

if (MPI_Init(&argc, &argv) != MPI_SUCCESS ) {
  printf("Error calling MPI_Init !!, exiting \n") ; fflush(stdout);
  return(1);
 }

 int rank;
 if ( MPI_Comm_rank(MPI_COMM_WORLD, &rank)!= MPI_SUCCESS ) {
  printf("Error calling  MPI_Comm_rank !!, exiting \n") ; fflush(stdout);
  MPI_Abort(MPI_COMM_WORLD, 1);
  return(1);
 }
 
 if (rank == 0) {
  int nprocs;
  if (MPI_Comm_size(MPI_COMM_WORLD, &nprocs)!= MPI_SUCCESS ) {
   printf("Error calling  MPI_Comm_size !!, exiting \n") ; fflush(stdout);
   MPI_Abort(MPI_COMM_WORLD, 1);
   return(1);
  }
 
  printf("Running %d tasks \n", nprocs) ; fflush(stdout);
  MPI_Finalize(); 
  return(0); 
 } else {
  sleep(1);
  return(0);
 }


Runnning the code (On Scientific Linux 5 , 2.6.18-238.12cc.el5 )
Everything works fine up to around 150 tasks
 >mpiexec -np 128 bin/basic_test
Running 128 tasks 

 >mpiexec -np 150 bin/basic_test
Running 150 tasks 

 >mpiexec -np 160 bin/basic_test
[proxy:0:0 at ccdvli10] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:80): assert (!closed) failed
[proxy:0:0 at ccdvli10] fn_get_maxes (./pm/pmiserv/pmip_pmi_v1.c:205): error sending PMI response
[proxy:0:0 at ccdvli10] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned error
[proxy:0:0 at ccdvli10] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at ccdvli10] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[mpiexec at ccdvli10] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
[mpiexec at ccdvli10] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at ccdvli10] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
[mpiexec at ccdvli10] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion


Has anybody an idea of my probable error code ?
What is the high limit for number of tasks ?

Best regards
---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111215/3d337bc6/attachment.htm>


More information about the mpich-discuss mailing list