[Swift-devel] Need advice on running multi-node coaster MPI jobs
Michael Wilde
wilde at mcs.anl.gov
Mon Nov 5 10:31:23 CST 2012
Hi All,
Urgent for me this week is to get Pagoda MPI runs for the ParVis project, due Friday.
Im stuck getting MPI multi-node jobs running under coasters.
Im trying two approaches: David's approach with standard, fixed-size coaster jobs, and Justin's Jets-like mechanism.
>From everying I can see, David's approach should work. When I test this in a simple shell, even with nested shells, it works, but when mpiexec is run from a coaster worker, even a local one, it fails.
What I get from MPICH2 mpiexec is:
[mpiexec at vs37] HYDT_dmx_register_fd (./tools/demux/demux.c:82): registering duplicate fd 0
[mpiexec at vs37] HYDT_bscd_external_launch_procs (./tools/bootstrap/external/external_launch.c:295): demux returned error registering fd
[mpiexec at vs37] HYDT_bsci_launch_procs (./tools/bootstrap/src/bsci_launch.c:21): bootstrap device returned error while launching processes
[mpiexec at vs37] HYD_pmci_launch_procs (./pm/pmiserv/pmiserv_pmci.c:298): bootstrap server cannot launch processes
[mpiexec at vs37] main (./ui/mpich/mpiexec.c:298): process manager returned error launching processes
I think the failure is related to something that either perl, or the worker.pl perl code, is doing when it forks the job. I thought the culprit was that worker.pl closes STDIN, but commenting out that close doesnt correct the problem.
Any suggestions?
Now, I just discovered that ParVis needs these runs done on an ORNL PBS system called "lens" with OpenMPI. So Im shifting my tests to PADS OpenMPI for now, which may work or fail differently.
But regardless we should get this mechanism working.
I'll also go back and re-test with Justin's muti-node coaster MPI launching mechanism as well. But that wont work for OpenMPI, so for now I need to stick with making David's mechanism work multinode on OpenMPI.
Ive requested a 4-node PADS reservation for testing.
- Mike
--
Michael Wilde
Computation Institute, University of Chicago
Mathematics and Computer Science Division
Argonne National Laboratory
More information about the Swift-devel
mailing list