[mpich-discuss] Error when calling mpiexec from within a process
Pramod
pramodc at gmail.com
Thu Oct 6 19:46:43 CDT 2011
Hi,
I am trying to write a simple job scheduler using MPI. It schedules
bunch of jobs to run in parallel on different hosts. Each job is an
MPI application (uses mpiexec) that runs on multiple cores of each
host. A child scheduling process runs on each host and executes the
parallel job, given to it by the master process, using system( ). The
executable is given, it runs on the multiple cores of host with some
affinity settings, and I cannot modify it. I am sure there are other
ways for a job scheduler, but whats wrong with this?
-Pramod
On Thu, Oct 6, 2011 at 4:28 PM, Reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
>
> Am 06.10.2011 um 22:06 schrieb Pramod:
>
>> Hi,
>>
>> I have an application where I need to call mpiexec from within a child
>> process launched by mpiexec. I am using "system()" to call the mpiexec
>> process from the child process. I am using mpich2-1.4.1 and the hydra
>> process manger. The errors I see are below. I am attaching the source
>> file main.c. Let me know what I am doing wrong here and if you need
>> more information.
>>
>> To compile:
>>
>> /home/install/mpich/mpich2-1.4.1/linux_x86_64//bin/mpicc main.c
>> -I/home/install/mpich/mpich2-1.4.1/linux_x86_64/include
>>
>> When I run the test on multiple nodes I get the following errors:
>> mpiexec -n 3 -f hosts.list a.out
>
> what do you want to achieve in detail? Would you like to use another hostlist for this call, so that each child decides on its own where to start grandson processes?
>
> Spawning additional processes within MPI is not an option?
>
> -- Reuti
>
>
>> proxy:0:0 at machine3] HYDU_create_process
>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/utils/launch/launch.c:36):
>> dup2 error (Bad file descriptor)
>> [proxy:0:0 at machine3] launch_procs
>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:751):
>> create process returned error
>> [proxy:0:0 at machine3] HYD_pmcd_pmip_control_cmd_cb
>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip_cb.c:935):
>> launch_procs returned error
>> [proxy:0:0 at machine3] HYDT_dmxu_poll_wait_for_event
>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/tools/demux/demux_poll.c:77):
>> callback returned error status
>> [proxy:0:0 at machine3] main
>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmip.c:226):
>> demux engine error waiting for event
>> [mpiexec at machine1.abc.com] control_cb
>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmiserv_cb.c:215):
>> assert (!closed) failed
>> [mpiexec at machine1.abc.com] HYDT_dmxu_poll_wait_for_event
>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/tools/demux/demux_poll.c:77):
>> callback returned error status
>> [mpiexec at machine1.abc.com] HYD_pmci_wait_for_completion
>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:181):
>> error waiting for event
>> [mpiexec at machine1.abc.com] main
>> (/home/install/mpich/src/mpich2-1.4.1/src/pm/hydra/ui/mpich/mpiexec.c:405):
>> process manager error waiting for completion
>>
>> ------
>> On a single node I get the following.
>> mpiexec -n 3 a.out
>> [proxy:0:0 at machine1.abc.com] [proxy:0:0 at machine1.abc.com] Killed
>> <main.c>_______________________________________________
>> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
>> To manage subscription options or unsubscribe:
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
More information about the mpich-discuss
mailing list