[mpich-discuss] MPI Application Hangs Mysteriously After 5-6 Hours' Run
jayesh at mcs.anl.gov
Fri Apr 11 09:14:54 CDT 2008
Can you provide us with your MPI application (or a test app which
shows the problem) ? If not, please provide us the following info,
# What is/are the Reduce operations that you perform on the data set ?
# What is the datatype on which you perform the Reduce operation ?
# Do you see the hang if you run your app on the same data set for 6-7 hrs ?
# Do you find the expected results before MPI_Reduce() (the hanging
MPI_Reduce()) on each worker process ?
From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of Arthur Wu
Sent: Friday, April 11, 2008 8:12 AM
To: Mpich MPI
Subject: [mpich-discuss] MPI Application Hangs Mysteriously After 5-6 Hours'
We have a MPI application developed using MPICH2 version 1.0.6 on MS
windows. The application is basically doing Monte Carlo simulation in
parallel using MPI. We need to run the application every night on a lot of
different data set and the whole application usually takes 7 -8 hours to
finish. The following is the issue we experience.
Our job often hangs after 5 or 6 hour run on some data set(not always the
same data set), but if we kill the job and just run our application on that
data set which hangs our application, everything seems to be OK. We also
found out that when our application hangs, each process of our application
running on different machines consumes 0 CPU resource(seems like a dead
lock). Our MPI application is very simple and we basically use MPI_Bcast to
distribute the work and then use MPI_Reduce to collect results.During our
own application logging, we found out that the dead lock happens during
MPI_Reduce, but we don't know why.
I wonder if MPICH2 has some other logging capability so that we can look
into this issue further. Any help and insight will be appreciated.
Richard & Arthur
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
More information about the mpich-discuss