[mpich-discuss] Error when calling mpiexec from within a process
Pavan Balaji
balaji at mcs.anl.gov
Sat Oct 15 04:51:11 CDT 2011
You specifically need to reset all the PMI_ variables. before calling
mpiexec again.
Embedded mpiexec's were supposed to be working at some point, but maybe
we introduced a bug that caused it to fail. I can look into it. Would
you be able to create a ticket for it?
https://trac.mcs.anl.gov/projects/mpich2/newticket
-- Pavan
On 10/12/2011 03:04 PM, Reuti wrote:
> Am 12.10.2011 um 21:31 schrieb Pramod:
>
>> Unsetting those envs in the system() call before calling mpiexec did
>> not really help (i still see the same error). From your response I
>> understand that calling mpiexec from within an MPI process is not a
>> common usage model
>
> Correct.
>
> BTW: I got confused as maybe I'm subscribed to too many lists it seems.
>
> I listed stuff for Open MPI, but this is MPICH2. So, looking for the correct vars in MPICH2 and reset them might work.
>
> Sorry for the confusion.
>
> -- Reuti
>
>
>> and perhaps not expected to work (?). May be I
>> should think of a different approach to solve my problem.
>>
>> Thank you,
>> Pramod
>>
>> On Fri, Oct 7, 2011 at 9:14 AM, Reuti<reuti at staff.uni-marburg.de> wrote:
>>> Hi,
>>>
>>> Am 07.10.2011 um 02:46 schrieb Pramod:
>>>
>>>> I am trying to write a simple job scheduler using MPI. It schedules
>>>> bunch of jobs to run in parallel on different hosts. Each job is an
>>>> MPI application (uses mpiexec) that runs on multiple cores of each
>>>> host. A child scheduling process runs on each host and executes the
>>>> parallel job, given to it by the master process, using system( ). The
>>>> executable is given, it runs on the multiple cores of host with some
>>>> affinity settings, and I cannot modify it. I am sure there are other
>>>> ways for a job scheduler, but whats wrong with this?
>>>
>>> will all jobs an all nodes then have the same runtime as a consequence?
>>>
>>> I would assume, that the second mpiexec inherits some of the already set
>>> environment variables, and uses this information by accident. If you unset
>>> them, it could work.
>>>
>>> When you check the /proc/12345/environ you will find this of an running mpi
>>> process:
>>>
>>> OMPI_MCA_orte_precondition_transports=510befec2b70bcee-945da280a132e0fb
>>> OMPI_MCA_plm=rsh
>>> OMPI_MCA_orte_hnp_uri=1104674816.0;tcp://192.168.151.101:52964
>>> OMPI_MCA_ess=env
>>> OMPI_MCA_orte_ess_jobid=1104674817
>>> OMPI_MCA_orte_ess_vpid=2
>>> OMPI_MCA_orte_ess_num_procs=4
>>> OMPI_MCA_orte_local_daemon_uri=1104674816.1;tcp://192.168.151.70:50363
>>> OMPI_MCA_mpi_yield_when_idle=1
>>> OMPI_MCA_orte_app_num=0
>>> OMPI_UNIVERSE_SIZE=4
>>> OMPI_COMM_WORLD_SIZE=4
>>> OMPI_COMM_WORLD_LOCAL_SIZE=2
>>> OMPI_COMM_WORLD_RANK=2
>>> OMPI_COMM_WORLD_LOCAL_RANK=0
>>> OPAL_OUTPUT_STDERR_FD=17
>>>
>>> The other point is the directory openmpi-sessions-reuti at foobar_0 where some
>>> things are stored. Maybe you need another temporary directory to separate
>>> the two and have two orted running (like it's done with queuing systems,
>>> where this directory is stored in the job specific temporary directory
>>> provided by the queuing system).
>>>
>>> -- Reuti
>>>
>>>
>>>> -Pramod
>>>>
>>>> On Thu, Oct 6, 2011 at 4:28 PM, Reuti<reuti at staff.uni-marburg.de> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Am 06.10.2011 um 22:06 schrieb Pramod:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have an application where I need to call mpiexec from within a child
>>>>>> process launched by mpiexec. I am using "system()" to call the mpiexec
>>>>>> process from the child process. I am using mpich2-1.4.1 and the hydra
>>>>>> process manger. The errors I see are below. I am attaching the source
>>>>>> file main.c. Let me know what I am doing wrong here and if you need
>>>>>> more information.
>>>>>>
>>>>>> To compile:
>>>>>>
>>>>>> /home/install/mpich/mpich2-1.4.1/linux_x86_64//bin/mpicc main.c
>>>>>> -I/home/install/mpich/mpich2-1.4.1/linux_x86_64/include
>>>>>>
>>>>>> When I run the test on multiple nodes I get the following errors:
>>>>>> mpiexec -n 3 -f hosts.list a.out
>>>>>
>>>>> what do you want to achieve in detail? Would you like to use another
>>>>> hostlist for this call, so that each child decides on its own where to start
>>>>> grandson processes?
>>>>>
>>>>> Spawning additional processes within MPI is not an option?
>>>>>
>>>>> -- Reuti
>>>>>
>>>>>
>>>>>> proxy:0:0 at machine3] HYDU_create_process
>>>>>>
>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/utils/launch/launch.c:36):
>>>>>> dup2 error (Bad file descriptor)
>>>>>> [proxy:0:0 at machine3] launch_procs
>>>>>>
>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:751):
>>>>>> create process returned error
>>>>>> [proxy:0:0 at machine3] HYD_pmcd_pmip_control_cmd_cb
>>>>>>
>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:935):
>>>>>> launch_procs returned error
>>>>>> [proxy:0:0 at machine3] HYDT_dmxu_poll_wait_for_event
>>>>>>
>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/tools/demux/demux_poll.c:77):
>>>>>> callback returned error status
>>>>>> [proxy:0:0 at machine3] main
>>>>>>
>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip.c:226):
>>>>>> demux engine error waiting for event
>>>>>> [mpiexec at machine1.abc.com] control_cb
>>>>>>
>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:215):
>>>>>> assert (!closed) failed
>>>>>> [mpiexec at machine1.abc.com] HYDT_dmxu_poll_wait_for_event
>>>>>>
>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/tools/demux/demux_poll.c:77):
>>>>>> callback returned error status
>>>>>> [mpiexec at machine1.abc.com] HYD_pmci_wait_for_completion
>>>>>>
>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:181):
>>>>>> error waiting for event
>>>>>> [mpiexec at machine1.abc.com] main
>>>>>>
>>>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/ui/mpich/mpiexec.c:405):
>>>>>> process manager error waiting for completion
>>>>>>
>>>>>> ------
>>>>>> On a single node I get the following.
>>>>>> mpiexec -n 3 a.out
>>>>>> [proxy:0:0 at machine1.abc.com] [proxy:0:0 at machine1.abc.com] Killed
>>>>>> <main.c>_______________________________________________
>>>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>>>> To manage subscription options or unsubscribe:
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>>> To manage subscription options or unsubscribe:
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>>
>>>> _______________________________________________
>>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>> _______________________________________________
>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich-discuss
mailing list