[mpich-discuss] MPI Application Hangs Mysteriously After 5-6 Hours' Run

Arthur Wu r4726 at yahoo.com
Fri Apr 11 08:11:40 CDT 2008


Hi There,

We have a MPI application developed using MPICH2
version 1.0.6 on MS windows. The application is
basically doing Monte Carlo simulation in parallel
using MPI. We need to run the application every night
on a lot of different data set and
the whole application usually takes 7 -8 hours to
finish. The following is the issue we experience.
 
Our job often hangs after 5 or 6 hour run on some data
set(not always the same data set), but if we kill the
job and just run
our application on that data set which hangs our
application, everything seems to be OK. We also found
out that when our application hangs, each process of
our application running on different machines consumes
0 CPU resource(seems like a dead lock). Our MPI
application is very simple and we basically use
MPI_Bcast to distribute the work and then use
MPI_Reduce to collect results.During our own
application logging, we found out that the dead lock
happens during MPI_Reduce, but we don't know why.
 
I wonder if MPICH2 has some other logging capability
so that we can look into this issue further. Any help
and insight will be appreciated.

Thanks.

Richard & Arthur


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 




More information about the mpich-discuss mailing list