[ExM Users] turbine call does not return when invoking from Galaxy
Michael Wilde
wilde at anl.gov
Thu May 8 21:15:33 CDT 2014
I paste below my email thread with Pavan on this problem. What you are
seeing looks very similar to me.
Except that in my case mpiexec returned an error; in your case its
hanging ???
- Mike
-------- Original Message --------
Subject: Re: Need help with MPICH2 launch problem
Date: Fri, 30 Nov 2012 02:53:11 -0600
From: Pavan Balaji <balaji at mcs.anl.gov>
To: Michael Wilde <wilde at mcs.anl.gov>
CC: Justin M Wozniak <wozniak at mcs.anl.gov>, Mihael Hategan
<hategan at mcs.anl.gov>
Ok, I figured out a way to do this. I've committed a few changes into
the mpich trunk to cover this case. The test program you sent works
correctly now. Can you try the latest trunk version (>= r10701)?
-- Pavan
On 11/30/2012 02:17 AM US Central Time, Pavan Balaji wrote:
> Hi Mike,
>
> I'm able to reproduce this error.
>
> The issue is that, within mpiexec, I don't have a good way of knowing
> that STDIN has been closed (in which case I should disable stdin
> forwarding). I can check STDIN_FILENO, but there's no guarantee that
> it'll point to stdin, if stdin is closed. In this case, a new socket
> can have the same file descriptor.
>
> I'll look into this some more tomorrow, but I don't see an easy way to
> do this.
>
> -- Pavan
>
> On 11/29/2012 05:43 PM US Central Time, Michael Wilde wrote:
>>
>>> Hydra should automatically detect a closed fd 0; you shouldn't have to
>>> do anything special in Swift for this.
>>
>> I *thought* it did, but only on mpich2 on login.mcs.anl.gov. But now my latest test is failing there too (I kept distilling down to a tinier test, so maybe I perturbed something?)
>>
>> It seems to fail consistently on two local mvapich2's (fusion and midway) and likely on eureka (Im unsure which mpich I used there).
>>
>>> Do you have a test program that
>>> demonstrates this problem?
>>
>> Yes, attached.
>>
>> tar xf mpifd0.tar
>> cd mpifd0.tar
>> ./runme.sh
>>
>> Should give these results (all on login hosts, not cluster nodes):
>>
>> === login.mcs.anl.gov
>>
>> mcs$ which mpiexec
>> /usr/bin/mpiexec
>> mcs$ mpiexec --version | head -3
>> HYDRA build details:
>> Version: 1.4.1
>> Release Date: Wed Aug 24 14:40:04 CDT 2011
>> mcs$ ./runme.sh
>>
>> testing with fd 0 open:
>>
>> ./tinympi: rank is 1
>> ./tinympi: rank is 0
>>
>> Testing with fd 0 closed:
>>
>> [mpiexec at login4] HYDT_dmx_register_fd (./tools/demux/demux.c:98): registering duplicate fd 0
>> [mpiexec at login4] HYD_pmcd_pmiserv_proxy_init_cb (./pm/pmiserv/pmiserv_cb.c:561): unable to register fd
>> [mpiexec at login4] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
>> [mpiexec at login4] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
>> [mpiexec at login4] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
>> mcs$
>>
>> === midway.rcc.uchicago.edu:
>>
>> mid$ which mpiexec
>> /software/mvapich2-1.8-el6-x86_64/bin/mpiexec
>> mid$ ./runme.sh
>>
>> testing with fd 0 open:
>>
>> ./tinympi: rank is 0
>> ./tinympi: rank is 1
>>
>> Testing with fd 0 closed:
>>
>> [mpiexec at midway-login1] HYDT_dmx_register_fd (./tools/demux/demux.c:98): registering duplicate fd 0
>> [mpiexec at midway-login1] HYD_pmcd_pmiserv_proxy_init_cb (./pm/pmiserv/pmiserv_cb.c:563): unable to register fd
>> [mpiexec at midway-login1] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
>> [mpiexec at midway-login1] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:181): error waiting for event
>> [mpiexec at midway-login1] main (./ui/mpich/mpiexec.c:405): process manager error waiting for completion
>> mid$
>>
>> === fusion.lcrc.anl.gov:
>>
>> fusion$ which mpiexec
>> /soft/mvapich2/1.4.1-gcc-4.1.2-r2/bin/mpiexec
>> fusion$ ./runme.sh
>>
>> testing with fd 0 open:
>>
>> libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
>> libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs1
>> ./tinympi: rank is 1
>> ./tinympi: rank is 0
>>
>> Testing with fd 0 closed:
>>
>> [mpiexec at flogin2] HYDT_dmx_register_fd (./tools/demux/demux.c:98): registering duplicate fd 0
>> [mpiexec at flogin2] HYD_pmcd_pmiserv_proxy_init_cb (./pm/pmiserv/pmiserv_cb.c:487): unable to register fd
>> [mpiexec at flogin2] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
>> [mpiexec at flogin2] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:206): error waiting for event
>> [mpiexec at flogin2] main (./ui/mpich/mpiexec.c:404): process manager error waiting for completion
>> fusion$
>>
>>
>>
>>
>>
>>>
>>> -- Pavan
>>>
>>> On 11/29/2012 04:44 PM US Central Time, Michael Wilde wrote:
>>>> Hi Pavan,
>>>>
>>>> I think Ive resolved this. It seems indeed due to the fact that
>>>> Swift was closing stdin (fd 0) in one of its execution modes
>>>> ("coasters"). I thought this was the problem early on, but got
>>>> misled because I commented out the close in the wrong swift module.
>>>>
>>>> The problem seems similar to this issue:
>>>>
>>>> http://lists.mcs.anl.gov/pipermail/mpich-discuss/2011-February/009018.html
>>>> http://trac.mpich.org/projects/mpich/ticket/1029
>>>>
>>>> When I leave fd 0 open to /dev/null before exec()'ing mpiexec, (as
>>>> you suggest in the posting above) my mpiexec works fine.
>>>>
>>>> Without this (ie with fd 0 closed) it fails on eureka, on uchicago
>>>> and fusion mvapich2, but works on the mpich2 running on mcs login
>>>> (which looks like hydra 1.4.1)
>>>>
>>>> I think we're OK now, with the above fix in Swift, but I still need
>>>> to test on Eureka when it comes back up.
>>>>
>>>> - Mike
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Pavan Balaji" <balaji at mcs.anl.gov>
>>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>> Cc: "Justin M Wozniak" <wozniak at mcs.anl.gov>, "Sheri Mickelson"
>>>>> <mickelso at mcs.anl.gov>, "Jayesh Krishna"
>>>>> <jayesh at mcs.anl.gov>, "Robert Jacob" <jacob at mcs.anl.gov>, "Mihael
>>>>> Hategan" <hategan at mcs.anl.gov>
>>>>> Sent: Monday, November 26, 2012 12:37:00 PM
>>>>> Subject: Re: Need help with MPICH2 launch problem
>>>>> Hi Mike,
>>>>>
>>>>> I'm at Argonne this week. Did you want to sync up on this?
>>>>>
>>>>> -- Pavan
>>>>>
>>>>> On 11/11/2012 09:53 AM US Central Time, Michael Wilde wrote:
>>>>>> OK, thanks, Pavan - this can wait till you get back. I'll try to
>>>>>> turn on some MPICH debugging and isolate it to a simple test case
>>>>>> that you or anyone on the MPI team can reproduce.
>>>>>>
>>>>>> - Mike
>>>>>>
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Pavan Balaji" <balaji at mcs.anl.gov>
>>>>>>> To: "Michael Wilde" <wilde at mcs.anl.gov>
>>>>>>> Cc: "Justin M Wozniak" <wozniak at mcs.anl.gov>, "Sheri Mickelson"
>>>>>>> <mickelso at mcs.anl.gov>, "Jayesh Krishna"
>>>>>>> <jayesh at mcs.anl.gov>, "Robert Jacob" <jacob at mcs.anl.gov>, "Mihael
>>>>>>> Hategan" <hategan at mcs.anl.gov>
>>>>>>> Sent: Saturday, November 10, 2012 3:56:32 PM
>>>>>>> Subject: Re: Need help with MPICH2 launch problem
>>>>>>> Hi Mike,
>>>>>>>
>>>>>>> Unfortunately, this is too little information for me to tell
>>>>>>> what's
>>>>>>> going on. Perhaps it's best to sit together and look at it. I'm
>>>>>>> at
>>>>>>> SC.
>>>>>>>
>>>>>>> -- Pavan
>>>>>>>
>>>>>>> On 11/09/2012 02:14 PM, Michael Wilde wrote:
>>>>>>>>
>>>>>>>> Hi Pavan,
>>>>>>>>
>>>>>>>> We are having a problem running the MPI app "Pagoda" (a parallel
>>>>>>>> netcdf processing tool) under Swift for Rob Jacob's ParVis
>>>>>>>> project.
>>>>>>>>
>>>>>>>> It seems that when we do an MPICH2 launch on Eureka under
>>>>>>>> Swift's
>>>>>>>> worker-node agents, we break something that MPICH is expecting
>>>>>>>> in
>>>>>>>> the environment (in the general sense) in which mpiexec is being
>>>>>>>> run.
>>>>>>>>
>>>>>>>> Ive check the env variables, and *think* that we are passing
>>>>>>>> everything on to mpiexec without damage. Im more suspicious of
>>>>>>>> having damaged a file descriptor or done something to break ssh
>>>>>>>> connectivity, etc.
>>>>>>>>
>>>>>>>> What I get from MPICH2 mpiexec is:
>>>>>>>>
>>>>>>>>>> [mpiexec at vs37] HYDT_dmx_register_fd
>>>>>>>>>> (./tools/demux/demux.c:82):
>>>>>>>>>> registering duplicate fd 0
>>>>>>>>>> [mpiexec at vs37] HYDT_bscd_external_launch_procs
>>>>>>>>>> (./tools/bootstrap/external/external_launch.c:295): demux
>>>>>>>>>> returned
>>>>>>>>>> error registering fd
>>>>>>>>>> [mpiexec at vs37] HYDT_bsci_launch_procs
>>>>>>>>>> (./tools/bootstrap/src/bsci_launch.c:21): bootstrap device
>>>>>>>>>> returned
>>>>>>>>>> error while launching processes
>>>>>>>>>> [mpiexec at vs37] HYD_pmci_launch_procs
>>>>>>>>>> (./pm/pmiserv/pmiserv_pmci.c:298): bootstrap server cannot
>>>>>>>>>> launch
>>>>>>>>>> processes
>>>>>>>>>> [mpiexec at vs37] main (./ui/mpich/mpiexec.c:298): process
>>>>>>>>>> manager
>>>>>>>>>> returned error launching processes
>>>>>>>>
>>>>>>>> I think the failure is related to something that either perl, or
>>>>>>>> our
>>>>>>>> worker.pl perl code (which is Swift's worker-node execution
>>>>>>>> agent,
>>>>>>>> soft of like Swift's "Hydra"), is doing when it forks the job. I
>>>>>>>> thought the culprit was that worker.pl closes STDIN, but
>>>>>>>> commenting
>>>>>>>> out that close doesnt correct the problem.
>>>>>>>>
>>>>>>>> I'll continue to isolate what the difference is between running
>>>>>>>> mpiexec under Swift vs running under, say, multiple levels of
>>>>>>>> nested
>>>>>>>> shell on the "lead" worker node, but if you have some
>>>>>>>> suggestions,
>>>>>>>> Pavan, as to what the above error is telling us that MPI doesnt
>>>>>>>> like, that would be a great help!
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> - Mike
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Pavan Balaji
>>>>>>> http://www.mcs.anl.gov/~balaji
>>>>>>
>>>>>
>>>>> --
>>>>> Pavan Balaji
>>>>> http://www.mcs.anl.gov/~balaji
>>>>
>>>
>>> --
>>> Pavan Balaji
>>> http://www.mcs.anl.gov/~balaji
>>
>
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
On 5/8/14, 8:33 PM, Michael Wilde wrote:
> Perhaps try redirecting stdin on the $turbine call with "< /dev/null"
>
> Some older versions of mpiexec had a bug when stdin was not a tty or
> ptty.
>
> Not sure if thats the case here, but easy to try.
>
> - Mike
>
> On 5/8/14, 5:10 PM, Ketan Maheshwari wrote:
>> Hi,
>>
>> Trying invoke turbine from Galaxy as follows:
>>
>> $turbine -V -n $n $wdir/script.tcl "${swiftargs}"
>>
>> Each of the variables being set in previous lines.
>>
>> The process tree shows mpiexec as defunct processes:
>> PID TTY TIME CMD
>> 389 ? 00:00:00 sshd
>> 390 pts/39 00:00:00 bash
>> 804 pts/39 00:00:00 sh
>> 990 pts/39 00:00:38 python
>> 1189 pts/39 00:00:00 sh
>> 1190 pts/39 00:00:00 bash
>> 1330 pts/39 00:00:00 turbine
>> 1332 pts/39 00:00:00 mpiexec <defunct>
>> 1547 pts/39 00:00:00 sh
>> 1548 pts/39 00:00:00 bash
>> 1683 pts/39 00:00:00 turbine
>> 1686 pts/39 00:00:00 mpiexec <defunct>
>> 2046 pts/39 00:00:00 sh
>> 2047 pts/39 00:00:00 bash
>> 2299 pts/39 00:00:00 turbine
>> 2302 pts/39 00:00:00 mpiexec <defunct>
>>
>>
>> Not sure how to debug this. The same call works outside of Galaxy.
>>
>> Any suggestions?
>>
>> Thanks,
>> Ketan
>>
>>
>>
>> _______________________________________________
>> ExM-user mailing list
>> ExM-user at lists.mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
>
>
>
> _______________________________________________
> ExM-user mailing list
> ExM-user at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/exm-user
--
Michael Wilde
Mathematics and Computer Science Computation Institute
Argonne National Laboratory The University of Chicago
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20140508/8cb71b48/attachment-0001.html>
More information about the ExM-user
mailing list