[mpich-discuss] Problem while running example program Cpi with more than 1 task
Thejna Tharammal
ttharammal at marum.de
Thu Sep 2 06:55:47 CDT 2010
An update, cpi works with hydra say for 8 processes, But if I try >8 it
shows-
"[mpiexec at k1] HYDU_sock_read (./utils/sock/sock.c:277): read errno
(Connection reset by peer)
[mpiexec at k1] HYD_pmcd_pmi_cmd_cb (./pm/pmiserv/pmi_serv_cb.c:73): unable to
read the length of the command[mpiexec at k1] HYDT_dmx_wait_for_event
(./tools/demux/demux.c:168): callback returned error status
[mpiexec at k1] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmi_serv_launch.c:499): error waiting for event
[mpiexec at k1] main (./ui/mpiexec/mpiexec.c:277): process manager error
waiting for completion"
Is it a problem with Mpi or system?
Thank you,
Thejna.
----------------original message-----------------
From: "Thejna Tharammal" ttharammal at marum.de
To: mpich-discuss at mcs.anl.gov
Date: Thu, 2 Sep 2010 12:02:33 +0200
-------------------------------------------------
>
> Hi,
>
> I installed mpich2-1.2.1p1 on a Linux cluster with 6 nodes, (Intel
> xeon,3Gz/node, 64bit, Kernel 2.6.18-128.1.6.el5), with pgf90+pgcc
> compilers.
>
> While testing the example program cpi with more than 2 tasks, it shows the
> error,
>
> ================
>
> mpiexec -l -n 2 -host k4 ./cpi
> 0: Process 0 of 2 is on k4
> 1: Process 1 of 2 is on k4
> 0: pi is approximately 3.1415926544231318, Error is 0.0000000008333387
> 0: wall clock time = 0.000232
> rank 1 in job 10 k1_37752 caused collective abort of all ranks
> exit status of rank 1: killed by signal 11
> rank 0 in job 10 k1_37752 caused collective abort of all ranks
> =================
>
> And when I try with 2 hosts,
>
> mpiexec -l -n 2 -host k6 ./cpi : -n 2 -host k4 ./cpi
> 0: Process 0 of 4 is on k6
> 1: Process 1 of 4 is on k6
> 3: Process 3 of 4 is on k4
> 2: Process 2 of 4 is on k4
> 0: pi is approximately 3.1415926544231239, Error is 0.0000000008333307
> 0: wall clock time = 0.001073
> rank 0 in job 13 k1_37752 caused collective abort of all ranks
> exit status of rank 0: killed by signal 11
> ===================
>
> While the same with 1 task each works fine, like
>
> mpiexec -l -n 1 -host k6 ./cpi : -n 1 -host k4 ./cpi
> 0: Process 0 of 2 is on k6
> 1: Process 1 of 2 is on k4
> 0: pi is approximately 3.1415926544231318, Error is 0.0000000008333387
> 0: wall clock time = 0.033167
>
>
> What could be the reason for this?
>
> Thank you,
>
> Thejna.
>
>
More information about the mpich-discuss
mailing list