[mpich-discuss] MPID_nem_tcp_connpoll(1826): Communication error with rank 0: Connection refused
Miguel Angel Fernández
mafga74 at hotmail.com
Fri Nov 4 10:42:04 CDT 2011
Dear
Pavan
All machines can communicate to each others using
ssh.
The reason of doing the configuration
using root instead of any other user was the problem I found using
"mpi" user. I thought that It could be caused by a permission error or
something similar. I normally use "mpi" user.
> Date: Fri, 4 Nov 2011 09:57:12 -0500
> From: balaji at mcs.anl.gov
> To: mafga74 at hotmail.com
> CC: mpich-discuss at mcs.anl.gov
> Subject: Re: FW: [mpich-discuss] MPID_nem_tcp_connpoll(1826): Communication error with rank 0: Connection refused
>
>
> On 11/04/2011 03:57 AM, Miguel Angel Fernández wrote:
> > I thought I answered your email,... anyway, I'm doing to much things at
> > the same time ;-)
>
> If it was sent to me directly instead of the mpich-discuss mailing list,
> it was probably ignored. Please don't do that.
>
> > Yes, all machines can communicate to each others.
>
> They can communicate as in ssh to each other, or communicate over any port?
>
> > Attached, you have the output of the commands "configure", "make" and
> > "make install" for users "mpi" and "root".
>
> It doesn't matter which user you are doing this as, i.e., "mpi" or
> "root". Let's just pick one to avoid confusion. The build seems to have
> gone through fine.
>
> So far, as I understand it, the following works correctly:
>
> % mpiexec -f machinefile hostname
>
> But the following does not:
>
> % mpiexec -f machinefile ./mpi_application
> > Assuming the above is true, my guess is that there is a firewall issue > between the nodes. Note that many firewalls allow port 22 to pass
> through which is used for ssh. So you won't notice this with ssh.
>
Your assumption is right.
I was looking for any FW on each
machine,... in fact I disabled de "ufw" on all machines (one of your colleges
told me about that) and this is the unique FW that is installed automatically
on Ubuntu/Debian.
At this moment the status of ufw is
"disable" on all machines and the problem is still there.
> > I sent to your personal email "balaji at mcs.anl.gov"one document with the
> > configuration I am using. Maybe you can find the thing I am doing wrong.
>
I don't want to bother you but it is difficult
enough to install and configure a MPI cluster without the wrong info content on
the document. When the procedure to install and configure will be correct and proved I
will share it to the list. That is the reason I sent to you directly this
document. I apologise for it.
Please, can you help me to find what is the problem?, I have to recognize that I tried all ideas I had with no better results.
> Please send all emails to the mailing list.
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
Best regardsMiguel Angel Fernandez (PhD student)Polytechnic University of Madrid (Spain)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20111104/11e6a924/attachment.htm>
More information about the mpich-discuss
mailing list