[mpich-discuss] Errors related to the increased number of tasks
Dave Goodell
goodell at mcs.anl.gov
Fri Dec 16 09:53:52 CST 2011
Can you try your software with mpich2-1.4.1p1? It contains a number of bug fixes over 1.4. The errors you are seeing with 1.2.1 and earlier are probably other bugs that were fixed years ago. I don't recommend using those older versions.
After trying that, if you are still encountering a problem, you might want to try increasing the limit on the number of open file descriptors on your system: http://www.cs.uwaterloo.ca/~brecht/servers/openfiles.html
What sort of processor are you using? Is it Intel/AMD or something else?
-Dave
On Dec 15, 2011, at 10:22 AM CST, Bernard Chambon wrote:
> Hello,
>
> I'm still working on failures encountered as the number of tasks increases
> (Using mpich2-1.4, compiled with gcc 4.1, on Scientific Linux 5 , 2.6.18-238.12cc.el5)
>
> Here is the smallest mpich2 code, with which I got failure above ~150 tasks
> No communication, only basic call
>
> The code :
>
> // Compilation with :
> // mpicc -O2 -I $HOME/mpich2-1.4/include -L $HOME/mpich2-1.4/lib -o bin/basic_test basic_test.c
>
> if (MPI_Init(&argc, &argv) != MPI_SUCCESS ) {
> printf("Error calling MPI_Init !!, exiting \n") ; fflush(stdout);
> return(1);
> }
>
> int rank;
> if ( MPI_Comm_rank(MPI_COMM_WORLD, &rank)!= MPI_SUCCESS ) {
> printf("Error calling MPI_Comm_rank !!, exiting \n") ; fflush(stdout);
> MPI_Abort(MPI_COMM_WORLD, 1);
> return(1);
> }
>
> if (rank == 0) {
> int nprocs;
> if (MPI_Comm_size(MPI_COMM_WORLD, &nprocs)!= MPI_SUCCESS ) {
> printf("Error calling MPI_Comm_size !!, exiting \n") ; fflush(stdout);
> MPI_Abort(MPI_COMM_WORLD, 1);
> return(1);
> }
>
> printf("Running %d tasks \n", nprocs) ; fflush(stdout);
> MPI_Finalize();
> return(0);
> } else {
> sleep(1);
> return(0);
> }
>
>
> Runnning the code (On Scientific Linux 5 , 2.6.18-238.12cc.el5 )
> Everything works fine up to around 150 tasks
> >mpiexec -np 128 bin/basic_test
> Running 128 tasks
>
> >mpiexec -np 150 bin/basic_test
> Running 150 tasks
>
> >mpiexec -np 160 bin/basic_test
> [proxy:0:0 at ccdvli10] send_cmd_downstream (./pm/pmiserv/pmip_pmi_v1.c:80): assert (!closed) failed
> [proxy:0:0 at ccdvli10] fn_get_maxes (./pm/pmiserv/pmip_pmi_v1.c:205): error sending PMI response
> [proxy:0:0 at ccdvli10] pmi_cb (./pm/pmiserv/pmip_cb.c:327): PMI handler returned error
> [proxy:0:0 at ccdvli10] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [proxy:0:0 at ccdvli10] main (./pm/pmiserv/pmip.c:226): demux engine error waiting for event
> [mpiexec at ccdvli10] control_cb (./pm/pmiserv/pmiserv_cb.c:215): assert (!closed) failed
> [mpiexec at ccdvli10] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
> [mpiexec at ccdvli10] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
> [mpiexec at ccdvli10] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
>
>
> Has anybody an idea of my probable error code ?
> What is the high limit for number of tasks ?
>
> Best regards
> ---------------
> Bernard CHAMBON
> IN2P3 / CNRS
> 04 72 69 42 18
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list