[mpich-discuss] Question/problem with MPI mvapich hydra.

Anatoly G anatolyrishon at gmail.com
Sun Oct 9 01:45:01 CDT 2011


Thank you for your's reply.
I"m sorry, it take me answer a lot of time, I was ill.
I use system abort, exit is more normal program termination, I tried to
simulate unexpected program termination.
Attached h, cpp files is an example.
Execution commnad:
mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec
/usr/bin/rsh -f machines2.txt -n 6 mpi_send_sync 100 1000 3 50 1 logs/test

Anatoly.

On Tue, Sep 27, 2011 at 9:10 PM, Darius Buntinas <buntinas at mcs.anl.gov>wrote:

> Are you calling MPI_Abort or the system call abort()?  MPI_Abort will abort
> the entire job.  I'm not sure why mpiexec would be sending sigusr1, though.
>  Try calling exit() instead and see if that works.
>
> Can you send us a short program that demonstrates this?
>
> Thanks,
> -d
>
>
> On Sep 27, 2011, at 4:09 AM, Anatoly G wrote:
>
> > I tying to build master/slave application, which I execute using
> mpiexec.hydra.
> > Master is always process with rank 0. All rest of processes are
> slaves.(currently about ~10 processes).
> > Communication protocol (point to point):
> > Slave:
> > 1) sends single integer to master 1000 times in loop using MPI_Send
> operation
> > 2) waits in MPI_Recv to recieve single integer from master.
> > 3) Executes MPI_Finalize()
> >
> > Master:
> > 1) Master initialized with MPI_Errhandler_set(MPI_COMM_WORLD,
> MPI_ERRORS_RETURN);
> > 2) Master passes on cyclic buffer of slaves ranks, and listens to each
> one of them by command MPI_Recv with slave rank.
> > Loop performed 1000 * Number of slaves.
> > 3) After end of loop master sends to each slave 0 as integer using
> > MPI_Send operation.
> > 4) Executes MPI_Finalize().
> >
> > Purpose of application: Tolerance to Process Failures
> > On the failure of number of slaves, continue work with the rest ones.
> >
> > Execute command:
> > mpiexec.hydra -disable-auto-cleanup -launcher rsh -launcher-exec
> /usr/bin/rsh -f machines.txt -n 11 mpi_send_sync
> >
> > machines.txt contains:
> > student1-eth1:1
> > student2-eth1:3
> > student3-eth1:20
> >
> > Execution results:
> > 1) When I run application as written above, everything works.
> > 2) Then I try to simulate failure of the one of slaves by calling abort()
>  function on iteration 10 of the slave loop.
> > As result master get SIGUSR1 signal and fails.
> >
> > Questions:
> > 1) I don't understand what should I do in order to get an error status
> from MPI_Recv command in master code?
> > 2) In the case I use
> > MPI_Irecv + MPI_Waitany in master code, how can I recognize which slave
> "dead"?
> >
> > Anatoly.
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111009/3368a652/attachment.htm>
-------------- next part --------------
student1-eth1:1
student2-eth1:2
student3-eth1:5
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_test_incl.h
Type: text/x-chdr
Size: 7231 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111009/3368a652/attachment.h>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mpi_send_sync.cpp
Type: text/x-c++src
Size: 3444 bytes
Desc: not available
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111009/3368a652/attachment.cpp>


More information about the mpich-discuss mailing list