[mpich-discuss] Multiple mpds launched by one user on same machine

Volker Amedick vasc at gmx.net
Tue Oct 28 10:40:43 CDT 2008


Hi,

I have an issue with MPICH2 which I cannot resolve on my own. In our computer environment we have a multi-processor compute grid (SMP). The access to the compute grid is controlled by a queueing system called Condor which also manages the CPU resources of the grid. IT allows access to the compute grid only through Condor. 

When I now want to run a program using MPICH2 I submit a shell script to Condor that (1) launches a ring of mpds on the CPUs provided by Condor, (2) launches the program which uses MPICH2. When the program finishes, the shell script finishes also. As the script has finished the ring of mpds is stopped by Condor. This is working fine as long as I run only one job on the compute grid. When I submit a second job, I am not able to launch a 2nd ring of mpds on the same computer but need to use the existing ring of the 1st job. Unfortunately, when the 1st job has finished, its ring of mpds is stopped by Condor causing the 2nd job to fail as it lost connection to the mpds.

To circumvent this issue I suggested to IT to run an mpd as root allowing other users to access root's ring of mpds. But IT declined as this might allow users to bypass Condor. Is there an alternative possibility to launch multiple rings of mpds on the same computer by the same user. And if so, how does a MPICH2 program then know to which ring of mpds it belongs.

Any help is highly appreciated. Thanks in advance,
Volker
-- 
Psssst! Schon vom neuen GMX MultiMessenger gehört? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger



More information about the mpich-discuss mailing list