[mpich-discuss] Error when calling mpiexec from within a process

Pramod pramodc at gmail.com
Mon Oct 17 12:29:39 CDT 2011


Done!
https://trac.mcs.anl.gov/projects/mpich2/ticket/1539
-pramod

On Sat, Oct 15, 2011 at 2:51 AM, Pavan Balaji <balaji at mcs.anl.gov> wrote:
>
> You specifically need to reset all the PMI_ variables. before calling
> mpiexec again.
>
> Embedded mpiexec's were supposed to be working at some point, but maybe we
> introduced a bug that caused it to fail. I can look into it. Would you be
> able to create a ticket for it?
>
> https://trac.mcs.anl.gov/projects/mpich2/newticket
>
>  -- Pavan
>
> On 10/12/2011 03:04 PM, Reuti wrote:
>>
>> Am 12.10.2011 um 21:31 schrieb Pramod:
>>
>>> Unsetting those envs in the system() call before calling mpiexec did
>>> not really help (i still see the same error). From your response I
>>> understand that calling mpiexec from within an MPI process is not a
>>> common usage model
>>
>> Correct.
>>
>> BTW: I got confused as maybe I'm subscribed to too many lists it seems.
>>
>> I listed stuff for Open MPI, but this is MPICH2. So, looking for the
>> correct vars in MPICH2 and reset them might work.
>>
>> Sorry for the confusion.
>>
>> -- Reuti
>>
>>
>>> and perhaps not expected to work (?). May be I
>>> should think of a different approach to solve my problem.
>>>
>>> Thank you,
>>> Pramod
>>>
>>> On Fri, Oct 7, 2011 at 9:14 AM, Reuti<reuti at staff.uni-marburg.de>  wrote:
>>>>
>>>> Hi,
>>>>
>>>> Am 07.10.2011 um 02:46 schrieb Pramod:
>>>>
>>>>> I am trying to write a simple job scheduler using MPI. It schedules
>>>>> bunch of jobs to run in parallel on different hosts. Each job is an
>>>>> MPI application (uses mpiexec) that runs on multiple cores of each
>>>>> host. A child scheduling process runs on each host and executes the
>>>>> parallel job, given to it by the master process, using system( ). The
>>>>> executable is given, it runs on the multiple cores of host with some
>>>>> affinity settings, and I cannot modify it.  I am sure there are other
>>>>> ways for a job scheduler, but whats wrong with this?
>>>>
>>>> will all jobs an all nodes then have the same runtime as a consequence?
>>>>
>>>> I would assume, that the second mpiexec inherits some of the already set
>>>> environment variables, and uses this information by accident. If you
>>>> unset
>>>> them, it could work.
>>>>
>>>> When you check the /proc/12345/environ you will find this of an running
>>>> mpi
>>>> process:
>>>>
>>>> OMPI_MCA_orte_precondition_transports=510befec2b70bcee-945da280a132e0fb
>>>> OMPI_MCA_plm=rsh
>>>> OMPI_MCA_orte_hnp_uri=1104674816.0;tcp://192.168.151.101:52964
>>>> OMPI_MCA_ess=env
>>>> OMPI_MCA_orte_ess_jobid=1104674817
>>>> OMPI_MCA_orte_ess_vpid=2
>>>> OMPI_MCA_orte_ess_num_procs=4
>>>> OMPI_MCA_orte_local_daemon_uri=1104674816.1;tcp://192.168.151.70:50363
>>>> OMPI_MCA_mpi_yield_when_idle=1
>>>> OMPI_MCA_orte_app_num=0
>>>> OMPI_UNIVERSE_SIZE=4
>>>> OMPI_COMM_WORLD_SIZE=4
>>>> OMPI_COMM_WORLD_LOCAL_SIZE=2
>>>> OMPI_COMM_WORLD_RANK=2
>>>> OMPI_COMM_WORLD_LOCAL_RANK=0
>>>> OPAL_OUTPUT_STDERR_FD=17
>>>>
>>>> The other point is the directory openmpi-sessions-reuti at foobar_0 where
>>>> some
>>>> things are stored. Maybe you need another temporary directory to
>>>> separate
>>>> the two and have two orted running (like it's done with queuing systems,
>>>> where this directory is stored in the job specific temporary directory
>>>> provided by the queuing system).
>>>>
>>>> -- Reuti
>>>>
>>>>
>>>>> -Pramod
>>>>>
>>>>> On Thu, Oct 6, 2011 at 4:28 PM, Reuti<reuti at staff.uni-marburg.de>
>>>>>  wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Am 06.10.2011 um 22:06 schrieb Pramod:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have an application where I need to call mpiexec from within a
>>>>>>> child
>>>>>>> process launched by mpiexec. I am using "system()" to call the
>>>>>>> mpiexec
>>>>>>> process from the child process.  I am using mpich2-1.4.1 and the
>>>>>>> hydra
>>>>>>> process manger. The errors I see are below. I am attaching the source
>>>>>>> file main.c. Let me know what I am doing wrong here and if you need
>>>>>>> more information.
>>>>>>>
>>>>>>> To compile:
>>>>>>>
>>>>>>> /home/install/mpich/mpich2-1.4.1/linux_x86_64//bin/mpicc   main.c
>>>>>>> -I/home/install/mpich/mpich2-1.4.1/linux_x86_64/include
>>>>>>>
>>>>>>> When I run the test on multiple nodes I get the following errors:
>>>>>>> mpiexec -n 3 -f hosts.list a.out
>>>>>>
>>>>>> what do you want to achieve in detail? Would you like to use another
>>>>>> hostlist for this call, so that each child decides on its own where to
>>>>>> start
>>>>>> grandson processes?
>>>>>>
>>>>>> Spawning additional processes within MPI is not an option?
>>>>>>
>>>>>> -- Reuti
>>>>>>
>>>>>>
>>>>>>> proxy:0:0 at machine3] HYDU_create_process
>>>>>>>
>>>>>>>
>>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/utils/launch/launch.c:36):
>>>>>>> dup2 error (Bad file descriptor)
>>>>>>> [proxy:0:0 at machine3] launch_procs
>>>>>>>
>>>>>>>
>>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:751):
>>>>>>> create process returned error
>>>>>>> [proxy:0:0 at machine3] HYD_pmcd_pmip_control_cmd_cb
>>>>>>>
>>>>>>>
>>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:935):
>>>>>>> launch_procs returned error
>>>>>>> [proxy:0:0 at machine3] HYDT_dmxu_poll_wait_for_event
>>>>>>>
>>>>>>>
>>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/tools/demux/demux_poll.c:77):
>>>>>>> callback returned error status
>>>>>>> [proxy:0:0 at machine3] main
>>>>>>>
>>>>>>>
>>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip.c:226):
>>>>>>> demux engine error waiting for event
>>>>>>> [mpiexec at machine1.abc.com] control_cb
>>>>>>>
>>>>>>>
>>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:215):
>>>>>>> assert (!closed) failed
>>>>>>> [mpiexec at machine1.abc.com] HYDT_dmxu_poll_wait_for_event
>>>>>>>
>>>>>>>
>>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/tools/demux/demux_poll.c:77):
>>>>>>> callback returned error status
>>>>>>> [mpiexec at machine1.abc.com] HYD_pmci_wait_for_completion
>>>>>>>
>>>>>>>
>>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:181):
>>>>>>> error waiting for event
>>>>>>> [mpiexec at machine1.abc.com] main
>>>>>>>
>>>>>>>
>>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/ui/mpich/mpiexec.c:405):
>>>>>>> process manager error waiting for completion
>>>>>>>
>>>>>>> ------
>>>>>>> On a single node I get the following.
>>>>>>> mpiexec -n 3 a.out
>>>>>>> [proxy:0:0 at machine1.abc.com] [proxy:0:0 at machine1.abc.com] Killed
>>>>>>> <main.c>_______________________________________________
>>>>>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>>>>>> To manage subscription options or unsubscribe:
>>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>
>>>>>> _______________________________________________
>>>>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>> _______________________________________________
>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>


More information about the mpich-discuss mailing list