[Swift-devel] Need advice on running multi-node coaster MPI jobs

Michael Wilde wilde at mcs.anl.gov
Mon Nov 5 12:37:56 CST 2012


The need OpenMPI because they are running the ORNL "lens" system, and thats what they use there and all their tools are built for.  I dont have access to that system yet, so I cant readily explore other options there.

But if David's experience with OpenMPI bears out, that might work in our favor.

- Mike

----- Original Message -----
> From: "Justin M Wozniak" <wozniak at mcs.anl.gov>
> To: swift-devel at ci.uchicago.edu
> Sent: Monday, November 5, 2012 11:09:07 AM
> Subject: Re: [Swift-devel] Need advice on running multi-node coaster MPI jobs
> Do you know why they need OpenMPI? If we are going to launch the
> application using JETS, we should be able to use a user MPICH,
> regardless of what the scheduler or whatever uses by default.
> 
> Would getting this working on Gadzooks as a first step still help?
> 
> On 11/5/2012 10:31 AM, Michael Wilde wrote:
> > Hi All,
> >
> > Urgent for me this week is to get Pagoda MPI runs for the ParVis
> > project, due Friday.
> >
> > Im stuck getting MPI multi-node jobs running under coasters.
> >
> > Im trying two approaches: David's approach with standard, fixed-size
> > coaster jobs, and Justin's Jets-like mechanism.
> >
> >  From everying I can see, David's approach should work. When I test
> >  this in a simple shell, even with nested shells, it works, but when
> >  mpiexec is run from a coaster worker, even a local one, it fails.
> >
> > What I get from MPICH2 mpiexec is:
> >
> > [mpiexec at vs37] HYDT_dmx_register_fd (./tools/demux/demux.c:82):
> > registering duplicate fd 0
> > [mpiexec at vs37] HYDT_bscd_external_launch_procs
> > (./tools/bootstrap/external/external_launch.c:295): demux returned
> > error registering fd
> > [mpiexec at vs37] HYDT_bsci_launch_procs
> > (./tools/bootstrap/src/bsci_launch.c:21): bootstrap device returned
> > error while launching processes
> > [mpiexec at vs37] HYD_pmci_launch_procs
> > (./pm/pmiserv/pmiserv_pmci.c:298): bootstrap server cannot launch
> > processes
> > [mpiexec at vs37] main (./ui/mpich/mpiexec.c:298): process manager
> > returned error launching processes
> >
> >
> > I think the failure is related to something that either perl, or the
> > worker.pl perl code, is doing when it forks the job. I thought the
> > culprit was that worker.pl closes STDIN, but commenting out that
> > close doesnt correct the problem.
> >
> > Any suggestions?
> >
> > Now, I just discovered that ParVis needs these runs done on an ORNL
> > PBS system called "lens" with OpenMPI. So Im shifting my tests to
> > PADS OpenMPI for now, which may work or fail differently.
> >
> > But regardless we should get this mechanism working.
> >
> > I'll also go back and re-test with Justin's muti-node coaster MPI
> > launching mechanism as well. But that wont work for OpenMPI, so for
> > now I need to stick with making David's mechanism work multinode on
> > OpenMPI.
> >
> > Ive requested a 4-node PADS reservation for testing.
> >
> > - Mike
> >
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel

-- 
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory




More information about the Swift-devel mailing list