[mpich-discuss] mpich2-1.4 works on one host, fails on multiple

Anthony Chan chan at mcs.anl.gov
Fri Dec 9 12:59:21 CST 2011


Since you are getting segfault, another likely possibility is
your Fortran code is hitting the thread's stack limit (Fortran
code tends to use larger stack).  If it is the case, check your
Fortran runtime on how to enlarge the stack limit or use heap
instead (i.e. use allocate()).

A.Chan

----- Original Message -----
> Another possibility is that there is some problem with OpenMP
> compiler/runtime on that system. Try running an OpenMP-only program
> (no MPI at all).
> 
> Rajeev
> 
> On Dec 9, 2011, at 12:14 PM, Anthony Chan wrote:
> 
> >
> > Is your program making any MPI call in within OpenMP pragmas ?
> > If it does, you need to use MPI_Init_thread with
> > MPI_THREAD_MULTIPLE.
> > If not, a correct MPI program should still need to use
> > MPI_Init_thread
> > with MPI_THREAD_FUNNELED. Not sure if this is the cause of problem,
> > but you should at least use the correct MPI_Init call for your
> > threaded
> > program.
> >
> > A.Chan
> >
> > ----- Original Message -----
> >> Sorry, one further detail that I've just discovered. The program is
> >> a
> >> MPI/OpenMP hybrid program. As a hunch, I commented out all the
> >> OpenMP
> >> calls, include statement, omp_set_num_threads, !$OMP DO etc... and
> >> tried to
> >> re-run the problem across several nodes and it worked out
> >> perfectly.
> >> Somehow the inclusion of OpenMP stuff, when trying to run across
> >> several
> >> nodes, is causing corruption that manifests itself in the MPI_INIT
> >> call, the
> >> first real call in the program after variable initialization and
> >> include
> >> statements. I don't know if this error is due to OpenMP or MPICH2,
> >> or
> >> something else. Hopefully this helps.
> >>
> >> Thanks,
> >> Jack
> >>
> >> -----Original Message-----
> >> From: mpich-discuss-bounces at mcs.anl.gov
> >> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev
> >> Thakur
> >> Sent: Friday, December 09, 2011 10:23 AM
> >> To: mpich-discuss at mcs.anl.gov
> >> Subject: Re: [mpich-discuss] mpich2-1.4 works on one host, fails on
> >> multiple
> >>
> >> Can you try running the hellow.f example from examples/f77.
> >>
> >> Rajeev
> >>
> >> On Dec 9, 2011, at 10:40 AM, Jack D. Galloway wrote:
> >>
> >>> I have an mpich2 error that I have run into a wall trying to
> >>> debug.
> >>> When
> >> I run on a given node, either the headnode, or a slave node, as
> >> long
> >> as the
> >> machinefile only has that same node in the file (i.e. ssh to "c302"
> >> and have
> >> the machines file only have "c302" listed), I get no problems at
> >> all
> >> and the
> >> code runs to completion just fine. If I try to run across any nodes
> >> though,
> >> I get a crashing error that is given below. I think the error may
> >> even
> >> be
> >> some architectural setup, but I'm fairly stuck regarding continued
> >> debugging. I ran the code using the "ddd" debugger through and it
> >> crashes
> >> on the first line of the program on the cross node (it's a fortran
> >> program,
> >> and crashes on the first line simply naming the program . 'program
> >> laplace'), and crashes on the first "step" in the ddd debugger in
> >> the
> >> window
> >> pertaining to the instance running on the headnode, which spawned
> >> the
> >> mpiexec job, saying:
> >>>
> >>>
> >>> Program received signal SIGINT, Interrupt.
> >>>
> >>> 0x00002aaaaaee4920 in __read_nocancel () from
> >>> /lib64/libpthread.so.0
> >>>
> >>> (gdb) step
> >>>
> >>>
> >>> I've pretty well exhausted my troubleshooting on this, and any
> >>> help
> >>> would
> >> be greatly appreciated. We're running Ubuntu 10.04, Lucid Lynx,
> >> running
> >> mpich2-1.4.1p1. Feel free to ask any questions or offer some
> >> troubleshooting tips. Thanks,
> >>>
> >>> ~Jack
> >>>
> >>> Error when running code:
> >>>
> >>> galloway at tebow:~/Flow3D/hybrid-test$ mpiexec -machinefile machines
> >>> -np 2
> >> -print-all-exitcodes ./mpich_debug_exec
> >>>
> >>>
> >> ============================================================================
> >> =========
> >>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> >>> = EXIT CODE: 11
> >>> = CLEANING UP REMAINING PROCESSES
> >>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> >>>
> >> ============================================================================
> >> =========
> >>> [proxy:0:0 at tebow] HYD_pmcd_pmip_control_cmd_cb
> >> (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> >>> [proxy:0:0 at tebow] HYDT_dmxu_poll_wait_for_event
> >> (./tools/demux/demux_poll.c:77): callback returned error status
> >>> [proxy:0:0 at tebow] main (./pm/pmiserv/pmip.c:226): demux engine
> >>> error
> >> waiting for event
> >>> [mpiexec at tebow] HYDT_bscu_wait_for_completion
> >> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
> >> terminated
> >> badly; aborting
> >>> [mpiexec at tebow] HYDT_bsci_wait_for_completion
> >> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
> >> waiting for
> >> completion
> >>> [mpiexec at tebow] HYD_pmci_wait_for_completion
> >> (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting
> >> for
> >> completion
> >>> [mpiexec at tebow] main (./ui/mpich/mpiexec.c:405): process manager
> >>> error
> >> waiting for completion
> >>> galloway at tebow:~/Flow3D/hybrid-test$
> >>> _______________________________________________
> >>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> >>> To manage subscription options or unsubscribe:
> >>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >>
> >> _______________________________________________
> >> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> >> To manage subscription options or unsubscribe:
> >> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >>
> >> _______________________________________________
> >> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> >> To manage subscription options or unsubscribe:
> >> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > _______________________________________________
> > mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list