[Swift-devel] Need advice on running multi-node coaster MPI jobs

David Kelly davidk at ci.uchicago.edu
Mon Nov 5 10:59:12 CST 2012


Mike,

I have seen a very similar error on Fusion when trying to use mvapich2. With mvapich2 on Fusion I can run the executable directly from qsub -I and it works fine. When the same executable is run by swift it fails with errors similar to what you are seeing. The only workaround I have found for this is to use OpenMPI, which seems to consistently work for me.

----- Original Message -----
> From: "Michael Wilde" <wilde at mcs.anl.gov>
> To: "swift-devel" <swift-devel at ci.uchicago.edu>
> Sent: Monday, November 5, 2012 10:31:23 AM
> Subject: [Swift-devel] Need advice on running multi-node coaster MPI jobs
> Hi All,
> 
> Urgent for me this week is to get Pagoda MPI runs for the ParVis
> project, due Friday.
> 
> Im stuck getting MPI multi-node jobs running under coasters.
> 
> Im trying two approaches: David's approach with standard, fixed-size
> coaster jobs, and Justin's Jets-like mechanism.
> 
> From everying I can see, David's approach should work. When I test
> this in a simple shell, even with nested shells, it works, but when
> mpiexec is run from a coaster worker, even a local one, it fails.
> 
> What I get from MPICH2 mpiexec is:
> 
> [mpiexec at vs37] HYDT_dmx_register_fd (./tools/demux/demux.c:82):
> registering duplicate fd 0
> [mpiexec at vs37] HYDT_bscd_external_launch_procs
> (./tools/bootstrap/external/external_launch.c:295): demux returned
> error registering fd
> [mpiexec at vs37] HYDT_bsci_launch_procs
> (./tools/bootstrap/src/bsci_launch.c:21): bootstrap device returned
> error while launching processes
> [mpiexec at vs37] HYD_pmci_launch_procs
> (./pm/pmiserv/pmiserv_pmci.c:298): bootstrap server cannot launch
> processes
> [mpiexec at vs37] main (./ui/mpich/mpiexec.c:298): process manager
> returned error launching processes
> 
> 
> I think the failure is related to something that either perl, or the
> worker.pl perl code, is doing when it forks the job. I thought the
> culprit was that worker.pl closes STDIN, but commenting out that close
> doesnt correct the problem.
> 
> Any suggestions?
> 
> Now, I just discovered that ParVis needs these runs done on an ORNL
> PBS system called "lens" with OpenMPI. So Im shifting my tests to PADS
> OpenMPI for now, which may work or fail differently.
> 
> But regardless we should get this mechanism working.
> 
> I'll also go back and re-test with Justin's muti-node coaster MPI
> launching mechanism as well. But that wont work for OpenMPI, so for
> now I need to stick with making David's mechanism work multinode on
> OpenMPI.
> 
> Ive requested a 4-node PADS reservation for testing.
> 
> - Mike
> 
> --
> Michael Wilde
> Computation Institute, University of Chicago
> Mathematics and Computer Science Division
> Argonne National Laboratory
> 
> _______________________________________________
> Swift-devel mailing list
> Swift-devel at ci.uchicago.edu
> https://lists.ci.uchicago.edu/cgi-bin/mailman/listinfo/swift-devel



More information about the Swift-devel mailing list