[mpich-discuss] mpich2-1.4 works on one host, fails on multiple
Anthony Chan
chan at mcs.anl.gov
Fri Dec 9 14:28:39 CST 2011
Strictly speaking for correctness, you should call MPI_Init_thread
if your MPI program is threaded, i.e. you should tell MPI what thread
level you are using. AFAIK, not many MPI implementations (if any) actually
distinguish all levels. i.e. most implementations only want to know
if you are using thread_multiple.
A.Chan
----- Original Message -----
> This is a related but slightly different question:
>
> If I have a mixed MPI-OpenMP code in which all MPI calls are outside
> of the
> parallel regions created by OpenMP, is it still necessary to call
> MPI_INIT_THREAD in place of MPI_INIT? Essentially the master thread
> will
> be doing all MPI calls.
>
> I did not think this was the case, so can someone please confirm that
> this
> is required? If yes, why? The last part is just my curiosity sake.
>
> Thanks!
>
> --rr
> Please Note: Email transcribed by speech recognition; May have
> unintended
> words in some cases, especially when I fail to do a good job of
> proofreading!
>
>
> --On Friday, December 09, 2011 12:58 PM -0700 "Jack D. Galloway"
> <jackg at lanl.gov> wrote:
>
> > All, thank you for your help and suggestions. I was pinned to using
> > a
> > pre-compiled proprietary code initially for troubleshooting but
> > eventually
> > got a test case, and this suggestion about "call
> > MPI_INIT_THREAD(MPI_THREAD_FUNNELED,IMPI_prov,ierr)" was a correct
> > one, no
> > MPI calls within OpenMP paradigms were occurring. Secondly through
> > this I
> > found that some of the OMP directives were compiling correctly
> > "undefined
> > reference to omp_get_num_procs" and a couple others which keyed me
> > off
> > that something wasn't compiling in correctly. I found a suggestion
> > that
> > said:
> >
> > the right library to use is libgomp.so.1. So, in order to compile it
> > you
> > have to use, e.g: "gcc -lgomp helloworld.c -o helloworld"
> >
> > Which, when including both the MPI_INIT_THREAD, as well as the
> > -lgomp link
> > solved the problem and I got a program running just fine. Sorry for
> > blowing up here and having it not be mpich2 related, other than a
> > faulty
> > MPI call. Perhaps it will be beneficial to someone making the same
> > error
> > in the future. Thanks so much for your suggestions and help.
> >
> > ~Jack
> >
> > -----Original Message-----
> > From: mpich-discuss-bounces at mcs.anl.gov
> > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Anthony Chan
> > Sent: Friday, December 09, 2011 11:15 AM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: Re: [mpich-discuss] mpich2-1.4 works on one host, fails on
> > multiple
> >
> >
> > Is your program making any MPI call in within OpenMP pragmas ?
> > If it does, you need to use MPI_Init_thread with
> > MPI_THREAD_MULTIPLE.
> > If not, a correct MPI program should still need to use
> > MPI_Init_thread
> > with MPI_THREAD_FUNNELED. Not sure if this is the cause of problem,
> > but you should at least use the correct MPI_Init call for your
> > threaded
> > program.
> >
> > A.Chan
> >
> > ----- Original Message -----
> >> Sorry, one further detail that I've just discovered. The program is
> >> a
> >> MPI/OpenMP hybrid program. As a hunch, I commented out all the
> >> OpenMP
> >> calls, include statement, omp_set_num_threads, !$OMP DO etc... and
> >> tried to
> >> re-run the problem across several nodes and it worked out
> >> perfectly.
> >> Somehow the inclusion of OpenMP stuff, when trying to run across
> >> several
> >> nodes, is causing corruption that manifests itself in the MPI_INIT
> >> call, the
> >> first real call in the program after variable initialization and
> >> include
> >> statements. I don't know if this error is due to OpenMP or MPICH2,
> >> or
> >> something else. Hopefully this helps.
> >>
> >> Thanks,
> >> Jack
> >>
> >> -----Original Message-----
> >> From: mpich-discuss-bounces at mcs.anl.gov
> >> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev
> >> Thakur
> >> Sent: Friday, December 09, 2011 10:23 AM
> >> To: mpich-discuss at mcs.anl.gov
> >> Subject: Re: [mpich-discuss] mpich2-1.4 works on one host, fails on
> >> multiple
> >>
> >> Can you try running the hellow.f example from examples/f77.
> >>
> >> Rajeev
> >>
> >> On Dec 9, 2011, at 10:40 AM, Jack D. Galloway wrote:
> >>
> >> > I have an mpich2 error that I have run into a wall trying to
> >> > debug.
> >> > When
> >> I run on a given node, either the headnode, or a slave node, as
> >> long
> >> as the
> >> machinefile only has that same node in the file (i.e. ssh to "c302"
> >> and have
> >> the machines file only have "c302" listed), I get no problems at
> >> all
> >> and the
> >> code runs to completion just fine. If I try to run across any nodes
> >> though,
> >> I get a crashing error that is given below. I think the error may
> >> even
> >> be
> >> some architectural setup, but I'm fairly stuck regarding continued
> >> debugging. I ran the code using the "ddd" debugger through and it
> >> crashes
> >> on the first line of the program on the cross node (it's a fortran
> >> program,
> >> and crashes on the first line simply naming the program . 'program
> >> laplace'), and crashes on the first "step" in the ddd debugger in
> >> the
> >> window
> >> pertaining to the instance running on the headnode, which spawned
> >> the
> >> mpiexec job, saying:
> >> >
> >> >
> >> > Program received signal SIGINT, Interrupt.
> >> >
> >> > 0x00002aaaaaee4920 in __read_nocancel () from
> >> > /lib64/libpthread.so.0
> >> >
> >> > (gdb) step
> >> >
> >> >
> >> > I've pretty well exhausted my troubleshooting on this, and any
> >> > help
> >> > would
> >> be greatly appreciated. We're running Ubuntu 10.04, Lucid Lynx,
> >> running
> >> mpich2-1.4.1p1. Feel free to ask any questions or offer some
> >> troubleshooting tips. Thanks,
> >> >
> >> > ~Jack
> >> >
> >> > Error when running code:
> >> >
> >> > galloway at tebow:~/Flow3D/hybrid-test$ mpiexec -machinefile
> >> > machines
> >> > -np 2
> >> -print-all-exitcodes ./mpich_debug_exec
> >> >
> >> >
> >>
> > =========================================================================
> > ===
> >> =========
> >> > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
> >> > = EXIT CODE: 11
> >> > = CLEANING UP REMAINING PROCESSES
> >> > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
> >> >
> >>
> > =========================================================================
> > ===
> >> =========
> >> > [proxy:0:0 at tebow] HYD_pmcd_pmip_control_cmd_cb
> >> (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
> >> > [proxy:0:0 at tebow] HYDT_dmxu_poll_wait_for_event
> >> (./tools/demux/demux_poll.c:77): callback returned error status
> >> > [proxy:0:0 at tebow] main (./pm/pmiserv/pmip.c:226): demux engine
> >> > error
> >> waiting for event
> >> > [mpiexec at tebow] HYDT_bscu_wait_for_completion
> >> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
> >> terminated
> >> badly; aborting
> >> > [mpiexec at tebow] HYDT_bsci_wait_for_completion
> >> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
> >> waiting for
> >> completion
> >> > [mpiexec at tebow] HYD_pmci_wait_for_completion
> >> (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting
> >> for
> >> completion
> >> > [mpiexec at tebow] main (./ui/mpich/mpiexec.c:405): process manager
> >> > error
> >> waiting for completion
> >> > galloway at tebow:~/Flow3D/hybrid-test$
> >> > _______________________________________________
> >> > mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> >> > To manage subscription options or unsubscribe:
> >> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >>
> >> _______________________________________________
> >> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> >> To manage subscription options or unsubscribe:
> >> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >>
> >> _______________________________________________
> >> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> >> To manage subscription options or unsubscribe:
> >> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > _______________________________________________
> > mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> > _______________________________________________
> > mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list