[mpich-discuss] Question/problem with MPI mvapich hydra.

Darius Buntinas buntinas at mcs.anl.gov
Tue Oct 11 14:35:48 CDT 2011


I took a look at your code.  Mpiexec will send a SIGUSR1 signal to each process to notify it of a failed process (Oops, I forgot about that when I responded to your previous email).  If you need a signal for your application, you'll need to choose another one.  The signal handler you installed replaced MPICH's signal handler, so the library wasn't able to detect that the process had failed.

Another problem is that MPI_Abort() isn't killing all processes, so when I commented out CreateOwnSignalHandler(), the master detected the failure and called MPI_Abort(), but some slave processes were still hanging in MPI_Barrier().  We'll need to fix that.

-d


On Oct 9, 2011, at 1:45 AM, Anatoly G wrote:

> Thank you for your's reply.
> I"m sorry, it take me answer a lot of time, I was ill.
> I use system abort, exit is more normal program termination, I tried to simulate unexpected program termination.
> Attached h, cpp files is an example.
> Execution commnad:
> mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec /usr/bin/rsh -f machines2.txt -n 6 mpi_send_sync 100 1000 3 50 1 logs/test
> 
> Anatoly.
> 
> On Tue, Sep 27, 2011 at 9:10 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
> Are you calling MPI_Abort or the system call abort()?  MPI_Abort will abort the entire job.  I'm not sure why mpiexec would be sending sigusr1, though.  Try calling exit() instead and see if that works.
> 
> Can you send us a short program that demonstrates this?
> 
> Thanks,
> -d
> 
> 
> On Sep 27, 2011, at 4:09 AM, Anatoly G wrote:
> 
> > I tying to build master/slave application, which I execute using mpiexec.hydra.
> > Master is always process with rank 0. All rest of processes are slaves.(currently about ~10 processes).
> > Communication protocol (point to point):
> > Slave:
> > 1) sends single integer to master 1000 times in loop using MPI_Send operation
> > 2) waits in MPI_Recv to recieve single integer from master.
> > 3) Executes MPI_Finalize()
> >
> > Master:
> > 1) Master initialized with MPI_Errhandler_set(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
> > 2) Master passes on cyclic buffer of slaves ranks, and listens to each one of them by command MPI_Recv with slave rank.
> > Loop performed 1000 * Number of slaves.
> > 3) After end of loop master sends to each slave 0 as integer using
> > MPI_Send operation.
> > 4) Executes MPI_Finalize().
> >
> > Purpose of application: Tolerance to Process Failures
> > On the failure of number of slaves, continue work with the rest ones.
> >
> > Execute command:
> > mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec /usr/bin/rsh -f machines.txt -n 11 mpi_send_sync
> >
> > machines.txt contains:
> > student1-eth1:1
> > student2-eth1:3
> > student3-eth1:20
> >
> > Execution results:
> > 1) When I run application as written above, everything works.
> > 2) Then I try to simulate failure of the one of slaves by calling abort()  function on iteration 10 of the slave loop.
> > As result master get SIGUSR1 signal and fails.
> >
> > Questions:
> > 1) I don't understand what should I do in order to get an error status from MPI_Recv command in master code?
> > 2) In the case I use
> > MPI_Irecv + MPI_Waitany in master code, how can I recognize which slave "dead"?
> >
> > Anatoly.
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> <machines2.txt><mpi_test_incl.h><mpi_send_sync.cpp>_______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list