[mpich-discuss] Errors related to the increased number of tasks
Bernard Chambon
bernard.chambon at cc.in2p3.fr
Thu Dec 15 10:22:27 CST 2011
Hello,
I'm still working on failures encountered as the number of tasks increases
(Using mpich2-1.4, compiled with gcc 4.1, on Scientific Linux 5 , 2.6.18-238.12cc.el5)
Here is the smallest mpich2 code, with which I got failure above ~150 tasks
No communication, only basic call
The code :
// Compilation with :
// mpicc -O2 -I $HOME/mpich2-1.4/include -L $HOME/mpich2-1.4/lib -o bin/basic_test basic_test.c
if (MPI_Init(&argc, &argv) != MPI_SUCCESS ) {
printf("Error calling MPI_Init !!, exiting \n") ; fflush(stdout);
return(1);
}
int rank;
if ( MPI_Comm_rank(MPI_COMM_WORLD, &rank)!= MPI_SUCCESS ) {
printf("Error calling MPI_Comm_rank !!, exiting \n") ; fflush(stdout);
MPI_Abort(MPI_COMM_WORLD, 1);
return(1);
}
if (rank == 0) {
int nprocs;
if (MPI_Comm_size(MPI_COMM_WORLD, &nprocs)!= MPI_SUCCESS ) {
printf("Error calling MPI_Comm_size !!, exiting \n") ; fflush(stdout);
MPI_Abort(MPI_COMM_WORLD, 1);
return(1);
}
printf("Running %d tasks \n", nprocs) ; fflush(stdout);
MPI_Finalize();
return(0);
} else {
sleep(1);
return(0);
}
Runnning the code (On Scientific Linux 5 , 2.6.18-238.12cc.el5 )
Everything works fine up to around 150 tasks
>mpiexec -np 128 bin/basic_test
Running 128 tasks
>mpiexec -np 150 bin/basic_test
Running 150 tasks
>mpiexec -np 160 bin/basic_test
[proxy:0:0 at ccdvli10] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:80): assert (!closed) failed
[proxy:0:0 at ccdvli10] fn_get_maxes (./pm/pmiserv/pmip_pmi_v1.c:205): error sending PMI response
[proxy:0:0 at ccdvli10] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned error
[proxy:0:0 at ccdvli10] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0 at ccdvli10] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
[mpiexec at ccdvli10] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
[mpiexec at ccdvli10] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec at ccdvli10] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
[mpiexec at ccdvli10] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
Has anybody an idea of my probable error code ?
What is the high limit for number of tasks ?
Best regards
---------------
Bernard CHAMBON
IN2P3 / CNRS
04 72 69 42 18
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111215/3d337bc6/attachment.htm>
More information about the mpich-discuss
mailing list