[mpich-discuss] MPI_Recv crashes with mpd ring

Jain, Rohit Rohit_Jain at mentor.com
Thu Mar 3 13:44:52 CST 2011


So, coming back to original problem: crash with MPI_Recv.

1) I updated to mpich2-1.3.3rc1
2) configured it with suggested options: --enable-debuginfo
--enable-error-messages=all --enable-error-checking=all
--enable-g=dbg,mem,meminit
3) switched from mpd to hydra.

- I still see same crash with no improvement in either debug message or
additional information:
	Fatal error in MPI_Recv: Error message texts are not available

- I can still use same exec on multiple-cores of same machine without
issues. But, crash occurs only when using multiple machines.

- send/recv buffer sizes that is being communicated across processes are
around 300-2000 integer arrays. I hope there is no limit on these sizes
across machine.

- I also ran valgrind. Output is attached. But, can't figure out if it
is mpich2 issue or exec's.

I will appreciate help in debugging this issue.


- Another related query:
If exec crashes for some reason, does Hydra kills all processes and
their spawned out processes as well?
When I ran my exec (p1) over multiple machines and exec crashes (due to
above issue), I see there are some leftover processes (p2) on other
machines. p2 were spawned by p1. 
Also, I see some leftover 'hydra_pmi_proxy' process running on the
machine, which originally called mpiexec. Is it related to hanging
spawned processes p2?
If I do ctrl-c, I see this message "Ctrl-C caught... cleaning up
processes", but some spawned processes are still around.

Regards,
Rohit


-----Original Message-----
From: Pavan Balaji [mailto:balaji at mcs.anl.gov] 
Sent: Tuesday, February 15, 2011 4:02 PM
To: mpich-discuss at mcs.anl.gov
Cc: Jain, Rohit
Subject: Re: [mpich-discuss] MPI_Recv crashes with mpd ring

Rohit,

On 02/15/2011 05:52 PM, Jain, Rohit wrote:
> I am trying to use hydra (mpiexec.hydra) with 1.2.1.p1, but getting
some startup errors:
>
> The authenticity of host'XXX' can't be established.
> RSA key fingerprint is ed:ce:ca:7b:08:b9:49:fd:f6:af:14.
> Are you sure you want to continue connecting (yes/no)?
> The authenticity of host'XXX2' can't be established.
> RSA key fingerprint is fb:1b:7b:0c:bb:b1:a6:b1:7d:dc:05.

Try ssh'ing to the hosts first, and make sure ssh is correctly setup for

passwordless access.

You can google for "passwordless ssh setup" for more information on
this.

  -- Pavan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: valgrind2
Type: application/octet-stream
Size: 7148 bytes
Desc: valgrind2
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110303/d1a9c00f/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: valgrind1
Type: application/octet-stream
Size: 5475 bytes
Desc: valgrind1
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20110303/d1a9c00f/attachment-0001.obj>


More information about the mpich-discuss mailing list