[mpich-discuss] Cannot use the main node to run a process of the programme

Waruna Ranasinghe Waruna.Ranasinghe at uom.lk
Thu Aug 7 11:36:13 CDT 2008


Hi,
I tried even mapping the drive as Jayesh mentioned, but the problem is still
the same.
If I run the programme only in the master node, then it will run. Otherwise
if I use other nodes including master node to run the programme, the
programme give the output but it won't exit (mpi finalize does not work or
called)

Please help me to over come this issue.

Regards,
Waruna Ranasinghe

2008/7/25 Jayesh Krishna <jayesh at mcs.anl.gov>

>  Hi,
>  You should be able to use all the nodes (with MPICH2 installed) for
> running your job (i.e., You should be able to use the main node to run your
> MPI processes).
>  If you are using a shared drive to run your program you should map the
> drive on all the nodes using the "-map" option of mpiexec (see the windows
> developer's guide, available at
> http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs,
> for details)
>
> Regards,
> Jayesh
>
>  ------------------------------
> *From:* owner-mpich-discuss at mcs.anl.gov [mailto:
> owner-mpich-discuss at mcs.anl.gov] *On Behalf Of *Waruna Ranasinghe
> *Sent:* Friday, July 25, 2008 3:06 AM
> *To:* mpich-discuss at mcs.anl.gov
> *Subject:* [mpich-discuss] Cannot use the main node to run a process of
> the programme
>
>  Hi all,
>
> I'm using MPICH2 in Windows.
> I can run my programme without errors if I don't use the machine in which I
> execute the command (Main node).
>
> mpiexec -channel ssm -n 3 -exitcodes -machinefile "c:\Program
> Files\MPICH2\bin\hosts.txt" -wdir //10.8.102.27/ClusterShared GBMTest
>
> If I use the main node also to execute one of the 3 processes, then it
> gives the error below. But it prints the output I wanted too. then it gives
> the error.
> I wanted to know whether this is an issue with my programme(GBMTest) or I
> cant use the main node to run the process.
> In the machinefile I have included three machines.
> 10.8.102.28
> 10.8.102.30
> 10.8.102.27 (main node)
>
> This works fine if I remove the main node and add another node instead.
>
> this is the error.
>
> ////////////////////////////////////////////////////////////////////////////////////
> Fatal error in MPI_Finalize: Other MPI error, error stack:
> MPI_Finalize(255)............: MPI_Finalize failed
> MPI_Finalize(154)............:
> MPID_Finalize(94)............:
> MPI_Barrier(406).............: MPI_Barrier(comm=0x44000002) failed
> MPIR_Barrier(77).............:
> MPIC_Sendrecv(120)...........:
> MPID_Isend(103)..............: failure occurred while attempting to send an
> eage
> r message
> MPIDI_CH3_iSend(168).........:
> MPIDI_CH3I_Sock_connect(1191): [ch3:sock] rank 1 unable to connect to rank
> 2 usi
> ng business card <port=1179 description=cse-365237834578 ifname=
> 10.8.102.27 shm_
> host=cse-365237834578 shm_queue=376D692D-A683-4917-BF58-13BD35D071E8
> shm_pid=284
> 0 >
> MPIDU_Sock_post_connect(1228): unable to connect to cse-365237834578 on
> port 117
> 9, exhausted all endpoints (errno -1)
> MPIDU_Sock_post_connect(1244): gethostbyname failed, The requested name is
> valid
>  and was found in the database, but it does not have the correct associated
> data
>  being resolved for. (errno 11004)
> job aborted:
> rank: node: exit code[: error message]
> 0: 10.8.102.28: 1
> 1: 10.8.102.30: 1: Fatal error in MPI_Finalize: Other MPI error, error
> stack:
> MPI_Finalize(255)............: MPI_Finalize failed
> MPI_Finalize(154)............:
> MPID_Finalize(94)............:
> MPI_Barrier(406).............: MPI_Barrier(comm=0x44000002) failed
> MPIR_Barrier(77).............:
> MPIC_Sendrecv(120)...........:
> MPID_Isend(103)..............: failure occurred while attempting to send an
> eage
> r message
> MPIDI_CH3_iSend(168).........:
> MPIDI_CH3I_Sock_connect(1191): [ch3:sock] rank 1 unable to connect to rank
> 2 usi
> ng business card <port=1179 description=cse-365237834578 ifname=
> 10.8.102.27 shm_
> host=cse-365237834578 shm_queue=376D692D-A683-4917-BF58-13BD35D071E8
> shm_pid=284
> 0 >
> MPIDU_Sock_post_connect(1228): unable to connect to cse-365237834578 on
> port 117
> 9, exhausted all endpoints (errno -1)
> MPIDU_Sock_post_connect(1244): gethostbyname failed, The requested name is
> valid
>  and was found in the database, but it does not have the correct associated
> data
>  being resolved for. (errno 11004)
> 2: 10.8.102.27: 1
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20080807/c6a08a8f/attachment.htm>


More information about the mpich-discuss mailing list