[mpich-discuss] Problem while running example program Cpi with more than 1 task

Gus Correa gus at ldeo.columbia.edu
Thu Sep 2 09:35:11 CDT 2010


Sounds as a system/connection problem, but it is hard to say.

Test your MPICH2 with a very simple program, not CCSM3,
which is a big monster.
Explore cpi with one, two, ... all nodes.
The cpi.c program in the MPICH2 "examples" directory is good for this.
Just calculates pi in parallel.
Compile with (full path) mpicc, and run it with (full path) mpirun/mpiexec.

Check also if you can ssh without password across every pair of nodes.
See the MPICH2 User Guide and Install Guide.

Gus Correa

Thejna Tharammal wrote:
> An update, cpi works with hydra say for 8 processes, But if I try >8 it
> shows-
> 
> "[mpiexec at k1] HYDU_sock_read (./utils/sock/sock.c:277): read errno
> (Connection reset by peer)
> [mpiexec at k1] HYD_pmcd_pmi_cmd_cb (./pm/pmiserv/pmi_serv_cb.c:73): unable to
> read the length of the command[mpiexec at k1] HYDT_dmx_wait_for_event
> (./tools/demux/demux.c:168): callback returned error status
> [mpiexec at k1] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmi_serv_launch.c:499): error waiting for event
> [mpiexec at k1] main (./ui/mpiexec/mpiexec.c:277): process manager error
> waiting for completion"
> Is it a problem with Mpi or system?
> Thank you,
> Thejna.
> 
> ----------------original message-----------------
> From: "Thejna Tharammal" ttharammal at marum.de
> To: mpich-discuss at mcs.anl.gov
> Date: Thu, 2 Sep 2010 12:02:33 +0200
> -------------------------------------------------
>  
>  
>> Hi,
>>
>> I installed mpich2-1.2.1p1 on a Linux cluster with 6 nodes, (Intel
>> xeon,3Gz/node, 64bit, Kernel 2.6.18-128.1.6.el5), with pgf90+pgcc 
>> compilers.
>>
>> While testing the example program cpi with more than 2 tasks, it shows the
>> error,
>>
>> ================
>>
>> mpiexec -l -n 2 -host k4 ./cpi
>> 0: Process 0 of 2 is on k4
>> 1: Process 1 of 2 is on k4
>> 0: pi is approximately 3.1415926544231318, Error is 0.0000000008333387
>> 0: wall clock time = 0.000232
>> rank 1 in job 10 k1_37752 caused collective abort of all ranks
>> exit status of rank 1: killed by signal 11
>> rank 0 in job 10 k1_37752 caused collective abort of all ranks
>> =================
>>
>> And when I try with 2 hosts,
>>
>> mpiexec -l -n 2 -host k6 ./cpi : -n 2 -host k4 ./cpi
>> 0: Process 0 of 4 is on k6
>> 1: Process 1 of 4 is on k6
>> 3: Process 3 of 4 is on k4
>> 2: Process 2 of 4 is on k4
>> 0: pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>> 0: wall clock time = 0.001073
>> rank 0 in job 13 k1_37752 caused collective abort of all ranks
>> exit status of rank 0: killed by signal 11
>> ===================
>>
>> While the same with 1 task each works fine, like
>>
>> mpiexec -l -n 1 -host k6 ./cpi : -n 1 -host k4 ./cpi
>> 0: Process 0 of 2 is on k6
>> 1: Process 1 of 2 is on k4
>> 0: pi is approximately 3.1415926544231318, Error is 0.0000000008333387
>> 0: wall clock time = 0.033167
>>
>>
>> What could be the reason for this?
>>
>> Thank you,
>>
>> Thejna.
>>
>>
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list