[mpich-discuss] mpich2-1.4 works on one host, fails on multiple
Raghu Reddy
rreddy at psc.edu
Fri Dec 9 14:16:41 CST 2011
This is a related but slightly different question:
If I have a mixed MPI-OpenMP code in which all MPI calls are outside of the
parallel regions created by OpenMP, is it still necessary to call
MPI_INIT_THREAD in place of MPI_INIT? Essentially the master thread will
be doing all MPI calls.
I did not think this was the case, so can someone please confirm that this
is required? If yes, why? The last part is just my curiosity sake.
Thanks!
--rr
Please Note: Email transcribed by speech recognition; May have unintended
words in some cases, especially when I fail to do a good job of
proofreading!
--On Friday, December 09, 2011 12:58 PM -0700 "Jack D. Galloway"
<jackg at lanl.gov> wrote:
> All, thank you for your help and suggestions. I was pinned to using a
> pre-compiled proprietary code initially for troubleshooting but eventually
> got a test case, and this suggestion about "call
> MPI_INIT_THREAD(MPI_THREAD_FUNNELED,IMPI_prov,ierr)" was a correct one, no
> MPI calls within OpenMP paradigms were occurring. Secondly through this I
> found that some of the OMP directives were compiling correctly "undefined
> reference to omp_get_num_procs" and a couple others which keyed me off
> that something wasn't compiling in correctly. I found a suggestion that
> said:
>
> the right library to use is libgomp.so.1. So, in order to compile it you
> have to use, e.g: "gcc -lgomp helloworld.c -o helloworld"
>
> Which, when including both the MPI_INIT_THREAD, as well as the -lgomp link
> solved the problem and I got a program running just fine. Sorry for
> blowing up here and having it not be mpich2 related, other than a faulty
> MPI call. Perhaps it will be beneficial to someone making the same error
> in the future. Thanks so much for your suggestions and help.
>
> ~Jack
>
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Anthony Chan
> Sent: Friday, December 09, 2011 11:15 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] mpich2-1.4 works on one host, fails on
> multiple
>
>
> Is your program making any MPI call in within OpenMP pragmas ?
> If it does, you need to use MPI_Init_thread with MPI_THREAD_MULTIPLE.
> If not, a correct MPI program should still need to use MPI_Init_thread
> with MPI_THREAD_FUNNELED. Not sure if this is the cause of problem,
> but you should at least use the correct MPI_Init call for your threaded
> program.
>
> A.Chan
>
> ----- Original Message -----
>> Sorry, one further detail that I've just discovered. The program is a
>> MPI/OpenMP hybrid program. As a hunch, I commented out all the OpenMP
>> calls, include statement, omp_set_num_threads, !$OMP DO etc... and
>> tried to
>> re-run the problem across several nodes and it worked out perfectly.
>> Somehow the inclusion of OpenMP stuff, when trying to run across
>> several
>> nodes, is causing corruption that manifests itself in the MPI_INIT
>> call, the
>> first real call in the program after variable initialization and
>> include
>> statements. I don't know if this error is due to OpenMP or MPICH2, or
>> something else. Hopefully this helps.
>>
>> Thanks,
>> Jack
>>
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Rajeev Thakur
>> Sent: Friday, December 09, 2011 10:23 AM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] mpich2-1.4 works on one host, fails on
>> multiple
>>
>> Can you try running the hellow.f example from examples/f77.
>>
>> Rajeev
>>
>> On Dec 9, 2011, at 10:40 AM, Jack D. Galloway wrote:
>>
>> > I have an mpich2 error that I have run into a wall trying to debug.
>> > When
>> I run on a given node, either the headnode, or a slave node, as long
>> as the
>> machinefile only has that same node in the file (i.e. ssh to "c302"
>> and have
>> the machines file only have "c302" listed), I get no problems at all
>> and the
>> code runs to completion just fine. If I try to run across any nodes
>> though,
>> I get a crashing error that is given below. I think the error may even
>> be
>> some architectural setup, but I'm fairly stuck regarding continued
>> debugging. I ran the code using the "ddd" debugger through and it
>> crashes
>> on the first line of the program on the cross node (it's a fortran
>> program,
>> and crashes on the first line simply naming the program . 'program
>> laplace'), and crashes on the first "step" in the ddd debugger in the
>> window
>> pertaining to the instance running on the headnode, which spawned the
>> mpiexec job, saying:
>> >
>> >
>> > Program received signal SIGINT, Interrupt.
>> >
>> > 0x00002aaaaaee4920 in __read_nocancel () from /lib64/libpthread.so.0
>> >
>> > (gdb) step
>> >
>> >
>> > I've pretty well exhausted my troubleshooting on this, and any help
>> > would
>> be greatly appreciated. We're running Ubuntu 10.04, Lucid Lynx,
>> running
>> mpich2-1.4.1p1. Feel free to ask any questions or offer some
>> troubleshooting tips. Thanks,
>> >
>> > ~Jack
>> >
>> > Error when running code:
>> >
>> > galloway at tebow:~/Flow3D/hybrid-test$ mpiexec -machinefile machines
>> > -np 2
>> -print-all-exitcodes ./mpich_debug_exec
>> >
>> >
>>
> =========================================================================
> ===
>> =========
>> > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> > = EXIT CODE: 11
>> > = CLEANING UP REMAINING PROCESSES
>> > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> >
>>
> =========================================================================
> ===
>> =========
>> > [proxy:0:0 at tebow] HYD_pmcd_pmip_control_cmd_cb
>> (./pm/pmiserv/pmip_cb.c:928): assert (!closed) failed
>> > [proxy:0:0 at tebow] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>> > [proxy:0:0 at tebow] main (./pm/pmiserv/pmip.c:226): demux engine error
>> waiting for event
>> > [mpiexec at tebow] HYDT_bscu_wait_for_completion
>> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes
>> terminated
>> badly; aborting
>> > [mpiexec at tebow] HYDT_bsci_wait_for_completion
>> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error
>> waiting for
>> completion
>> > [mpiexec at tebow] HYD_pmci_wait_for_completion
>> (./pm/pmiserv/pmiserv_pmci.c:191): launcher returned error waiting for
>> completion
>> > [mpiexec at tebow] main (./ui/mpich/mpiexec.c:405): process manager
>> > error
>> waiting for completion
>> > galloway at tebow:~/Flow3D/hybrid-test$
>> > _______________________________________________
>> > mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>> > To manage subscription options or unsubscribe:
>> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list