[mpich-discuss] MPI_Recv crashes with mpd ring
Jain, Rohit
Rohit_Jain at mentor.com
Wed Feb 16 13:58:50 CST 2011
Hi Pavan,
a.exec doesn't exist in current dir. I had $path set to correct
locations for mpich2 and exec path in working shell.
I can now run the application using hydra, only after setting those
hardcoded paths in .cshrc. It is pretty limiting for users, when they
want to try with different versions of their execs.
Is there a way to export current shell path settings, while using hydra?
mpd does work that way.
Coming back to original issue, I did see same error in MPI_recv, as I
was seeing with mpd ring:
Disconnecting: Bad packet length 4270643401.
Disconnecting: Bad packet length 539212695.
Disconnecting: Bad packet length 2392100802.
Fatal error in MPI_Recv: Error message texts are not available**
Are there tools available to debug these situations? Fatal error message
isn't very helpful.
I ran valgrind, and see few errors:
==30213== Syscall param socketcall.getsockopt(optlen) points to
uninitialised byte(s)
==30213== at 0xC53E77: getsockopt (in /lib/tls/libc-2.3.4.so)
==30213== by 0x8C2757A: _MPID_nem_init (in a.exec)
==30213== by 0x8C27892: MPID_nem_init (in a.exec)
==30213== by 0x8C2E13B: MPIDI_CH3_Init (in a.exec)
==30213== by 0x8C13EEB: MPID_Init (in a.exec)
==30213== by 0x8C09E49: MPIR_Init_thread (in a.exec)
==30213== by 0x8C09A1E: MPI_Init (in a.exec)
==30213== Syscall param write(buf) points to uninitialised byte(s)
==30213== at 0xDC65F3: __write_nocancel (in
/lib/tls/libpthread-2.3.4.so)
==30213== by 0x8C2F658: MPIDI_CH3_iStartMsg (in a.exec)
==30213== by 0x8C12A5B: MPIDI_CH3_EagerContigShortSend (in a.exec)
==30213== by 0x8C15030: MPID_Send (in a.exec)
==30213== by 0x8C0A7EC: MPI_Send (in a.exec)
When I ran same application with hydra on SMP, it ran fine with no
valgrind errors. I am assuming that application doesn't need to be aware
of multi-core or multi-machine run, and hence doesn't require specific
handling. So, bottomline is where this problem could be?
Regards,
Rohit
-----Original Message-----
From: Pavan Balaji [mailto:balaji at mcs.anl.gov]
Sent: Wednesday, February 16, 2011 11:19 AM
To: mpich-discuss at mcs.anl.gov
Cc: Jain, Rohit
Subject: Re: [mpich-discuss] MPI_Recv crashes with mpd ring
Rohit,
Try this:
% mpiexec.hydra -f hosts -n 1 ./a.exec arg1 : -n 1 ./a.exec arg2 : -n 1
./a.exec arg3 : -n 1 ./a.exec arg4
-- Pavan
On 02/16/2011 12:37 PM, Jain, Rohit wrote:
> Thanks everyone for responses. I got around ssh issue.
>
> But, it seems some more setup is required to make hydra work:
>
> mpiexec.hydra -f hosts -n 4 a.exec arg1 : a.exec arg2 : a.exec arg3 :
> a.exec arg4
> [proxy at hansel] HYDU_create_process
> (/mpich/src/mpich2-1.2.1p1/src/pm/hydra/utils/launch/launch.c:72):
> execvp error on file a.exec (No such file or directory)
> [proxy at hansel] HYDU_create_process
> (/mpich/src/mpich2-1.2.1p1/src/pm/hydra/utils/launch/launch.c:72):
> execvp error on file a.exec (No such file or directory)
>
>
> Regards,
> Rohit
>
>
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
> Sent: Tuesday, February 15, 2011 7:21 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] MPI_Recv crashes with mpd ring
>
> On Feb 15, 2011, at 5:52 PM CST, Jain, Rohit wrote:
>
>> I had 1.2.1p1 built locally. So, I tried that. It also gave me same
> fatal error. I will try newer version, but I am less hopeful.
>
> There's a good chance that there is a bug in your code, since 1.0.6
was
> not fundamentally a broken version of MPICH2. However, it is
important
> for you to use a fairly recent version so that we can rule out the
~3-4
> years of bugs that have been fixed since 1.0.6 was released. Also,
> error messages and debugging facilities are typically only improved in
> later versions of MPICH2, which could help you track down your
problem.
> You should attempt to debug your program in all of the usual ways,
such
> as by enabling core dumps or running valgrind on your program.
>
>> I am trying to use hydra (mpiexec.hydra) with 1.2.1.p1, but getting
> some startup errors:
>>
>> The authenticity of host 'XXX' can't be established.
>> RSA key fingerprint is ed:ce:ca:7b:08:b9:49:fd:f6:af:14.
>> Are you sure you want to continue connecting (yes/no)?
>> The authenticity of host 'XXX2' can't be established.
>> RSA key fingerprint is fb:1b:7b:0c:bb:b1:a6:b1:7d:dc:05.
>>
>> Any pointers how to resolve them?
>
> See Pavan's mail for some tips here.
>
> -Dave
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich-discuss
mailing list