[Swift-devel] Need advice on running multi-node coaster MPI jobs

Justin M Wozniak wozniak at mcs.anl.gov
Mon Nov 5 11:09:07 CST 2012


Do you know why they need OpenMPI?  If we are going to launch the 
application using JETS, we should be able to use a user MPICH, 
regardless of what the scheduler or whatever uses by default.

Would getting this working on Gadzooks as a first step still help?

On 11/5/2012 10:31 AM, Michael Wilde wrote:
> Hi All,
>
> Urgent for me this week is to get Pagoda MPI runs for the ParVis project, due Friday.
>
> Im stuck getting MPI multi-node jobs running under coasters.
>
> Im trying two approaches: David's approach with standard, fixed-size coaster jobs, and Justin's Jets-like mechanism.
>
>  From everying I can see, David's approach should work. When I test this in a simple shell, even with nested shells, it works, but when mpiexec is run from a coaster worker, even a local one, it fails.
>
> What I get from MPICH2 mpiexec is:
>
> [mpiexec at vs37] HYDT_dmx_register_fd (./tools/demux/demux.c:82): registering duplicate fd 0
> [mpiexec at vs37] HYDT_bscd_external_launch_procs (./tools/bootstrap/external/external_launch.c:295): demux returned error registering fd
> [mpiexec at vs37] HYDT_bsci_launch_procs (./tools/bootstrap/src/bsci_launch.c:21): bootstrap device returned error while launching processes
> [mpiexec at vs37] HYD_pmci_launch_procs (./pm/pmiserv/pmiserv_pmci.c:298): bootstrap server cannot launch processes
> [mpiexec at vs37] main (./ui/mpich/mpiexec.c:298): process manager returned error launching processes
>
>
> I think the failure is related to something that either perl, or the worker.pl perl code, is doing when it forks the job. I thought the culprit was that worker.pl closes STDIN, but commenting out that close doesnt correct the problem.
>
> Any suggestions?
>
> Now, I just discovered that ParVis needs these runs done on an ORNL PBS system called "lens" with OpenMPI. So Im shifting my tests to PADS OpenMPI for now, which may work or fail differently.
>
> But regardless we should get this mechanism working.
>
> I'll also go back and re-test with Justin's  muti-node coaster MPI launching mechanism as well. But that wont work for OpenMPI, so for now I need to stick with making David's mechanism work multinode on OpenMPI.
>
> Ive requested a 4-node PADS reservation for testing.
>
> - Mike
>




More information about the Swift-devel mailing list