[mpich-discuss] mpich2-1.4 works on one host, fails on multiple

Anthony Chan chan at mcs.anl.gov
Fri Dec 9 14:43:21 CST 2011


I think I should stress this point that if you don't call
MPI_Init_thread in your thread_funneled OpenMP program,
you are using MPI implementation details in your code
even if your code is running fine in all available platforms.
There is no reason not to use MPI_Init_thread unless your
program is intended to run on some old MPI platforms where
MPI_THREAD_XXX is not defined.

A.Chan

----- Original Message -----
> Strictly speaking for correctness, you should call MPI_Init_thread
> if your MPI program is threaded, i.e. you should tell MPI what thread
> level you are using. AFAIK, not many MPI implementations (if any)
> actually
> distinguish all levels. i.e. most implementations only want to know
> if you are using thread_multiple.
> 
> A.Chan
> 
> ----- Original Message -----
> > This is a related but slightly different question:
> >
> > If I have a mixed MPI-OpenMP code in which all MPI calls are outside
> > of the
> > parallel regions created by OpenMP, is it still necessary to call
> > MPI_INIT_THREAD in place of MPI_INIT? Essentially the master thread
> > will
> > be doing all MPI calls.
> >
> > I did not think this was the case, so can someone please confirm
> > that
> > this
> > is required? If yes, why? The last part is just my curiosity sake.
> >
> > Thanks!
> >
> > --rr
> > Please Note: Email transcribed by speech recognition; May have
> > unintended
> > words in some cases, especially when I fail to do a good job of
> > proofreading!
> >
> >
> > --On Friday, December 09, 2011 12:58 PM -0700 "Jack D. Galloway"
> > <jackg at lanl.gov> wrote:
> >
> > > All, thank you for your help and suggestions. I was pinned to
> > > using
> > > a
> > > pre-compiled proprietary code initially for troubleshooting but
> > > eventually
> > > got a test case, and this suggestion about "call
> > > MPI_INIT_THREAD(MPI_THREAD_FUNNELED,IMPI_prov,ierr)" was a correct
> > > one, no
> > > MPI calls within OpenMP paradigms were occurring. Secondly through
> > > this I
> > > found that some of the OMP directives were compiling correctly
> > > "undefined
> > > reference to omp_get_num_procs" and a couple others which keyed me
> > > off
> > > that something wasn't compiling in correctly. I found a suggestion
> > > that
> > > said:
> > >
> > > the right library to use is libgomp.so.1. So, in order to compile
> > > it
> > > you
> > > have to use, e.g: "gcc -lgomp helloworld.c -o helloworld"
> > >
> > > Which, when including both the MPI_INIT_THREAD, as well as the
> > > -lgomp link
> > > solved the problem and I got a program running just fine. Sorry
> > > for
> > > blowing up here and having it not be mpich2 related, other than a
> > > faulty
> > > MPI call. Perhaps it will be beneficial to someone making the same
> > > error
> > > in the future. Thanks so much for your suggestions and help.
> > >
> > > ~Jack
> > >
> > > -----Original Message-----
> > > From: mpich-discuss-bounces at mcs.anl.gov
> > > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Anthony
> > > Chan
> > > Sent: Friday, December 09, 2011 11:15 AM
> > > To: mpich-discuss at mcs.anl.gov
> > > Subject: Re: [mpich-discuss] mpich2-1.4 works on one host, fails
> > > on
> > > multiple
> > >
> > >
> > > Is your program making any MPI call in within OpenMP pragmas ?
> > > If it does, you need to use MPI_Init_thread with
> > > MPI_THREAD_MULTIPLE.
> > > If not, a correct MPI program should still need to use
> > > MPI_Init_thread
> > > with MPI_THREAD_FUNNELED. Not sure if this is the cause of
> > > problem,
> > > but you should at least use the correct MPI_Init call for your
> > > threaded
> > > program.
> > >
> > > A.Chan
> > >
> > > ----- Original Message -----
> > >> Sorry, one further detail that I've just discovered. The program
> > >> is
> > >> a
> > >> MPI/OpenMP hybrid program. As a hunch, I commented out all the
> > >> OpenMP
> > >> calls, include statement, omp_set_num_threads, !$OMP DO etc...
> > >> and
> > >> tried to
> > >> re-run the problem across several nodes and it worked out
> > >> perfectly.
> > >> Somehow the inclusion of OpenMP stuff, when trying to run across
> > >> several
> > >> nodes, is causing corruption that manifests itself in the
> > >> MPI_INIT
> > >> call, the
> > >> first real call in the program after variable initialization and
> > >> include
> > >> statements. I don't know if this error is due to OpenMP or
> > >> MPICH2,
> > >> or
> > >> something else. Hopefully this helps.
> > >>
> > >> Thanks,
> > >> Jack
> > >>
> > >> -----Original Message-----
> > >> From: mpich-discuss-bounces at mcs.anl.gov
> > >> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev
> > >> Thakur
> > >> Sent: Friday, December 09, 2011 10:23 AM
> > >> To: mpich-discuss at mcs.anl.gov
> > >> Subject: Re: [mpich-discuss] mpich2-1.4 works on one host, fails
> > >> on
> > >> multiple
> > >>
> > >> Can you try running the hellow.f example from examples/f77.
> > >>
> > >> Rajeev
> > >>
> > >> On Dec 9, 2011, at 10:40 AM, Jack D. Galloway wrote:
> > >>
> > >> > I have an mpich2 error that I have run into a wall trying to
> > >> > debug.
> > >> > When
> > >> I run on a given node, either the headnode, or a slave node, as
> > >> long
> > >> as the
> > >> machinefile only has that same node in the file (i.e. ssh to
> > >> "c302"
> > >> and have
> > >> the machines file only have "c302" listed), I get no problems at
> > >> all
> > >> and the
> > >> code runs to completion just fine. If I try to run across any
> > >> nodes
> > >> though,
> > >> I get a crashing error that is given below. I think the error may
> > >> even
> > >> be
> > >> some architectural setup, but I'm fairly stuck regarding
> > >> continued
> > >> debugging. I ran the code using the "ddd" debugger through and it
> > >> crashes
> > >> on the first line of the program on the cross node (it's a
> > >> fortran
> > >> program,
> > >> and crashes on the first line simply naming the program .
> > >> 'program
> > >> laplace'), and crashes on the first "step" in the ddd debugger in
> > >> the
> > >> window
> > >> pertaining to the instance running on the headnode, which spawned
> > >> the
> > >> mpiexec job, saying:
> > >> >
> > >> >
> > >> > Program received signal SIGINT, Interrupt.
> > >> >
> > >> > 0x00002aaaaaee4920 in __read_nocancel () from
> > >> > /lib64/libpthread.so.0
> > >> >
> > >> > (gdb) step
> > >> >
> > >> >
> > >> > I've pretty well exhausted my troubleshooting on this, and any
> > >> > help
> > >> > would
> > >> be greatly appreciated. We're running Ubuntu 10.04, Lucid Lynx,
> > >> running
> > >> mpich2-1.4.1p1. Feel free to ask any questions or offer some
> > >> troubleshooting tips. Thanks,
> > >> >
> > >> > ~Jack
> > >> >
> > >> > Error when running code:
> > >> >
> > >> > galloway at tebow:~/Flow3D/hybrid-test$ mpiexec -machinefile
> > >> > machines
> > >> > -np 2
> > >> -print-all-exitcodes ./mpich_debug_exec
> > >> >
> > >> >
> > >>
> > > =========================================================================
> > > ===
> > >> =========
> > >> > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> > >> > = EXIT CODE: 11
> > >> > = CLEANING UP REMAINING PROCESSES
> > >> > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> > >> >
> > >>
> > > =========================================================================
> > > ===
> > >> =========
> > >> > [proxy:0:0 at tebow] HYD_pmcd_pmip_control_cmd_cb
> > >> (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> > >> > [proxy:0:0 at tebow] HYDT_dmxu_poll_wait_for_event
> > >> (./tools/demux/demux_poll.c:77): callback returned error status
> > >> > [proxy:0:0 at tebow] main (./pm/pmiserv/pmip.c:226): demux engine
> > >> > error
> > >> waiting for event
> > >> > [mpiexec at tebow] HYDT_bscu_wait_for_completion
> > >> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
> > >> terminated
> > >> badly; aborting
> > >> > [mpiexec at tebow] HYDT_bsci_wait_for_completion
> > >> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
> > >> waiting for
> > >> completion
> > >> > [mpiexec at tebow] HYD_pmci_wait_for_completion
> > >> (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error
> > >> waiting
> > >> for
> > >> completion
> > >> > [mpiexec at tebow] main (./ui/mpich/mpiexec.c:405): process
> > >> > manager
> > >> > error
> > >> waiting for completion
> > >> > galloway at tebow:~/Flow3D/hybrid-test$
> > >> > _______________________________________________
> > >> > mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> > >> > To manage subscription options or unsubscribe:
> > >> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > >>
> > >> _______________________________________________
> > >> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> > >> To manage subscription options or unsubscribe:
> > >> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > >>
> > >> _______________________________________________
> > >> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> > >> To manage subscription options or unsubscribe:
> > >> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > > _______________________________________________
> > > mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> > > To manage subscription options or unsubscribe:
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > >
> > > _______________________________________________
> > > mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> > > To manage subscription options or unsubscribe:
> > > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


More information about the mpich-discuss mailing list