[mpich-discuss] Error when calling mpiexec from within a process

Pramod pramodc at gmail.com
Wed Oct 12 14:31:19 CDT 2011


Hi,

Unsetting those envs in the system() call before calling mpiexec did
not really help (i still see the same error). From your response I
understand that calling mpiexec from within an MPI process is not a
common usage model and perhaps not expected to work (?). May be I
should think of a different approach to solve my problem.

Thank you,
Pramod

On Fri, Oct 7, 2011 at 9:14 AM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
>
> Am 07.10.2011 um 02:46 schrieb Pramod:
>
>> I am trying to write a simple job scheduler using MPI. It schedules
>> bunch of jobs to run in parallel on different hosts. Each job is an
>> MPI application (uses mpiexec) that runs on multiple cores of each
>> host. A child scheduling process runs on each host and executes the
>> parallel job, given to it by the master process, using system( ). The
>> executable is given, it runs on the multiple cores of host with some
>> affinity settings, and I cannot modify it.  I am sure there are other
>> ways for a job scheduler, but whats wrong with this?
>
> will all jobs an all nodes then have the same runtime as a consequence?
>
> I would assume, that the second mpiexec inherits some of the already set
> environment variables, and uses this information by accident. If you unset
> them, it could work.
>
> When you check the /proc/12345/environ you will find this of an running mpi
> process:
>
> OMPI_MCA_orte_precondition_transports=510befec2b70bcee-945da280a132e0fb
> OMPI_MCA_plm=rsh
> OMPI_MCA_orte_hnp_uri=1104674816.0;tcp://192.168.151.101:52964
> OMPI_MCA_ess=env
> OMPI_MCA_orte_ess_jobid=1104674817
> OMPI_MCA_orte_ess_vpid=2
> OMPI_MCA_orte_ess_num_procs=4
> OMPI_MCA_orte_local_daemon_uri=1104674816.1;tcp://192.168.151.70:50363
> OMPI_MCA_mpi_yield_when_idle=1
> OMPI_MCA_orte_app_num=0
> OMPI_UNIVERSE_SIZE=4
> OMPI_COMM_WORLD_SIZE=4
> OMPI_COMM_WORLD_LOCAL_SIZE=2
> OMPI_COMM_WORLD_RANK=2
> OMPI_COMM_WORLD_LOCAL_RANK=0
> OPAL_OUTPUT_STDERR_FD=17
>
> The other point is the directory openmpi-sessions-reuti at foobar_0 where some
> things are stored. Maybe you need another temporary directory to separate
> the two and have two orted running (like it's done with queuing systems,
> where this directory is stored in the job specific temporary directory
> provided by the queuing system).
>
> -- Reuti
>
>
>> -Pramod
>>
>> On Thu, Oct 6, 2011 at 4:28 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
>>>
>>> Hi,
>>>
>>> Am 06.10.2011 um 22:06 schrieb Pramod:
>>>
>>>> Hi,
>>>>
>>>> I have an application where I need to call mpiexec from within a child
>>>> process launched by mpiexec. I am using "system()" to call the mpiexec
>>>> process from the child process.  I am using mpich2-1.4.1 and the hydra
>>>> process manger. The errors I see are below. I am attaching the source
>>>> file main.c. Let me know what I am doing wrong here and if you need
>>>> more information.
>>>>
>>>> To compile:
>>>>
>>>> /home/install/mpich/mpich2-1.4.1/linux_x86_64//bin/mpicc   main.c
>>>> -I/home/install/mpich/mpich2-1.4.1/linux_x86_64/include
>>>>
>>>> When I run the test on multiple nodes I get the following errors:
>>>> mpiexec -n 3 -f hosts.list a.out
>>>
>>> what do you want to achieve in detail? Would you like to use another
>>> hostlist for this call, so that each child decides on its own where to start
>>> grandson processes?
>>>
>>> Spawning additional processes within MPI is not an option?
>>>
>>> -- Reuti
>>>
>>>
>>>> proxy:0:0 at machine3] HYDU_create_process
>>>>
>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/utils/launch/launch.c:36):
>>>> dup2 error (Bad file descriptor)
>>>> [proxy:0:0 at machine3] launch_procs
>>>>
>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:751):
>>>> create process returned error
>>>> [proxy:0:0 at machine3] HYD_pmcd_pmip_control_cmd_cb
>>>>
>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:935):
>>>> launch_procs returned error
>>>> [proxy:0:0 at machine3] HYDT_dmxu_poll_wait_for_event
>>>>
>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/tools/demux/demux_poll.c:77):
>>>> callback returned error status
>>>> [proxy:0:0 at machine3] main
>>>>
>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip.c:226):
>>>> demux engine error waiting for event
>>>> [mpiexec at machine1.abc.com] control_cb
>>>>
>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:215):
>>>> assert (!closed) failed
>>>> [mpiexec at machine1.abc.com] HYDT_dmxu_poll_wait_for_event
>>>>
>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/tools/demux/demux_poll.c:77):
>>>> callback returned error status
>>>> [mpiexec at machine1.abc.com] HYD_pmci_wait_for_completion
>>>>
>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:181):
>>>> error waiting for event
>>>> [mpiexec at machine1.abc.com] main
>>>>
>>>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/ui/mpich/mpiexec.c:405):
>>>> process manager error waiting for completion
>>>>
>>>> ------
>>>> On a single node I get the following.
>>>> mpiexec -n 3 a.out
>>>> [proxy:0:0 at machine1.abc.com] [proxy:0:0 at machine1.abc.com] Killed
>>>> <main.c>_______________________________________________
>>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>>> To manage subscription options or unsubscribe:
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>>> _______________________________________________
>>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>>> To manage subscription options or unsubscribe:
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>
>> _______________________________________________
>> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>


More information about the mpich-discuss mailing list