[MPICH] MPI_Recv error

Sharon Lin shirokiryu at gmail.com
Tue Aug 28 08:54:01 CDT 2007


Hello,

I'm new to MPICH2 and am trying to write an MPI program to do some volume
rendering. Right now, I have the master process broadcast the rendering
parameters (viewing matrices, image dimensions, volume dimensions, etc) to
the slaves. The slaves would then render some rows of the image based on
their process ids and send their partially rendered images back to the
master. However, when I try to run the program using two processes and two
computers, I get an error in MPI_Recv where the master process is trying to
retrieve all the partial images from the slave. Here's the error output:

rank 1 in job 5  pcvaa13_36159   caused collective abort of all ranks
  exit status of rank 1: killed by signal 11
[cli_0]: aborting job:
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(186).............................: MPI_Recv(buf=0x7fffffd5fb90,
count=66048, MPI_UNSIGNED_CHAR, src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG,
MPI_COMM_WORLD, status=0x7fffffd80300) failed
MPIDI_CH3_Progress_wait(212)..............: an error occurred while handling
an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(413):
MPIDU_Socki_handle_read(633)..............: connection failure
(set=0,sock=1,errno=104:Connection reset by peer)

Any thoughts on how to fix this?

This happens before the slave has had a chance to finish rendering and send
back anything. Could timing be an issue?

Thanks,
Sharon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20070828/d281ca6a/attachment.htm>


More information about the mpich-discuss mailing list