[mpich-discuss] Error when calling mpiexec from within a process

Reuti reuti at staff.uni-marburg.de
Fri Oct 7 11:14:46 CDT 2011


Hi,

Am 07.10.2011 um 02:46 schrieb Pramod:

> I am trying to write a simple job scheduler using MPI. It schedules
> bunch of jobs to run in parallel on different hosts. Each job is an
> MPI application (uses mpiexec) that runs on multiple cores of each
> host. A child scheduling process runs on each host and executes the
> parallel job, given to it by the master process, using system( ). The
> executable is given, it runs on the multiple cores of host with some
> affinity settings, and I cannot modify it.  I am sure there are other
> ways for a job scheduler, but whats wrong with this?

will all jobs an all nodes then have the same runtime as a consequence?

I would assume, that the second mpiexec inherits some of the already  
set environment variables, and uses this information by accident. If  
you unset them, it could work.

When you check the /proc/12345/environ you will find this of an  
running mpi process:

OMPI_MCA_orte_precondition_transports=510befec2b70bcee-945da280a132e0fb
OMPI_MCA_plm=rsh
OMPI_MCA_orte_hnp_uri=1104674816.0;tcp://192.168.151.101:52964
OMPI_MCA_ess=env
OMPI_MCA_orte_ess_jobid=1104674817
OMPI_MCA_orte_ess_vpid=2
OMPI_MCA_orte_ess_num_procs=4
OMPI_MCA_orte_local_daemon_uri=1104674816.1;tcp://192.168.151.70:50363
OMPI_MCA_mpi_yield_when_idle=1
OMPI_MCA_orte_app_num=0
OMPI_UNIVERSE_SIZE=4
OMPI_COMM_WORLD_SIZE=4
OMPI_COMM_WORLD_LOCAL_SIZE=2
OMPI_COMM_WORLD_RANK=2
OMPI_COMM_WORLD_LOCAL_RANK=0
OPAL_OUTPUT_STDERR_FD=17

The other point is the directory openmpi-sessions-reuti at foobar_0 where  
some things are stored. Maybe you need another temporary directory to  
separate the two and have two orted running (like it's done with  
queuing systems, where this directory is stored in the job specific  
temporary directory provided by the queuing system).

-- Reuti


> -Pramod
>
> On Thu, Oct 6, 2011 at 4:28 PM, Reuti <reuti at staff.uni-marburg.de>  
> wrote:
>> Hi,
>>
>> Am 06.10.2011 um 22:06 schrieb Pramod:
>>
>>> Hi,
>>>
>>> I have an application where I need to call mpiexec from within a  
>>> child
>>> process launched by mpiexec. I am using "system()" to call the  
>>> mpiexec
>>> process from the child process.  I am using mpich2-1.4.1 and the  
>>> hydra
>>> process manger. The errors I see are below. I am attaching the  
>>> source
>>> file main.c. Let me know what I am doing wrong here and if you need
>>> more information.
>>>
>>> To compile:
>>>
>>> /home/install/mpich/mpich2-1.4.1/linux_x86_64//bin/mpicc   main.c
>>> -I/home/install/mpich/mpich2-1.4.1/linux_x86_64/include
>>>
>>> When I run the test on multiple nodes I get the following errors:
>>> mpiexec -n 3 -f hosts.list a.out
>>
>> what do you want to achieve in detail? Would you like to use  
>> another hostlist for this call, so that each child decides on its  
>> own where to start grandson processes?
>>
>> Spawning additional processes within MPI is not an option?
>>
>> -- Reuti
>>
>>
>>> proxy:0:0 at machine3] HYDU_create_process
>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/utils/launch/ 
>>> launch.c:36):
>>> dup2 error (Bad file descriptor)
>>> [proxy:0:0 at machine3] launch_procs
>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/ 
>>> pmip_cb.c:751):
>>> create process returned error
>>> [proxy:0:0 at machine3] HYD_pmcd_pmip_control_cmd_cb
>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/ 
>>> pmip_cb.c:935):
>>> launch_procs returned error
>>> [proxy:0:0 at machine3] HYDT_dmxu_poll_wait_for_event
>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/tools/demux/ 
>>> demux_poll.c:77):
>>> callback returned error status
>>> [proxy:0:0 at machine3] main
>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/ 
>>> pmip.c:226):
>>> demux engine error waiting for event
>>> [mpiexec at machine1.abc.com] control_cb
>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/ 
>>> pmiserv_cb.c:215):
>>> assert (!closed) failed
>>> [mpiexec at machine1.abc.com] HYDT_dmxu_poll_wait_for_event
>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/tools/demux/ 
>>> demux_poll.c:77):
>>> callback returned error status
>>> [mpiexec at machine1.abc.com] HYD_pmci_wait_for_completion
>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/ 
>>> pmiserv_pmci.c:181):
>>> error waiting for event
>>> [mpiexec at machine1.abc.com] main
>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/ui/mpich/ 
>>> mpiexec.c:405):
>>> process manager error waiting for completion
>>>
>>> ------
>>> On a single node I get the following.
>>> mpiexec -n 3 a.out
>>> [proxy:0:0 at machine1.abc.com] [proxy:0:0 at machine1.abc.com] Killed
>>> <main.c>_______________________________________________
>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>



More information about the mpich-discuss mailing list