[mpich-discuss] MPI_Recv crashes with mpd ring

Dave Goodell goodell at mcs.anl.gov
Wed Feb 16 14:06:55 CST 2011


On Feb 16, 2011, at 1:58 PM CST, Jain, Rohit wrote:

> Hi Pavan,
> 
> a.exec doesn't exist in current dir. I had $path set to correct
> locations for mpich2 and exec path in working shell. 
> I can now run the application using hydra, only after setting those
> hardcoded paths in .cshrc. It is pretty limiting for users, when they
> want to try with different versions of their execs. 
> Is there a way to export current shell path settings, while using hydra?
> mpd does work that way.

So do it this way instead:

% mpiexec.hydra -f hosts -n 1 `which a.exec` arg1 : -n 1 `which a.exec` arg2 : -n 1 `which a.exec` arg3 : -n 1 `which a.exec` arg4

If you use zsh, there's an even better shorthand that you can use:

% mpiexec.hydra -f hosts -n 1 =a.exec arg1 : -n 1 =a.exec arg2 : -n 1 =a.exec arg3 : -n 1 =a.exec arg4

-Dave

> Coming back to original issue, I did see same error in MPI_recv, as I
> was seeing with mpd ring:
> 
> Disconnecting: Bad packet length 4270643401.
> Disconnecting: Bad packet length 539212695.
> Disconnecting: Bad packet length 2392100802.
> Fatal error in MPI_Recv: Error message texts are not available**

Those packet lengths are very large, larger than a signed integer can usually express, so it wouldn't be surprising if you were hitting some sort of bug here.  Are you intentionally sending very large messages?

> Are there tools available to debug these situations? Fatal error message
> isn't very helpful.

What options are you passing to configure?  If you pass "--enable-error-messages" and "--enable-error-checking" you'll get better error messages.  If you pass "--enable-g=dbg,mem,meminit" the MPI library will be built with debugging symbols and will initialize most buffers to be friendlier to valgrind.  Passing "--enable-fast" typically disables all of these sorts of things, so omit that if you are currently passing it and wish to debug your code more effectively.  That option should really only be used for production runs and benchmarking of known-good codes.

-Dave

> I ran valgrind, and see few errors:
> ==30213== Syscall param socketcall.getsockopt(optlen) points to
> uninitialised byte(s)
> ==30213==    at 0xC53E77: getsockopt (in /lib/tls/libc-2.3.4.so)
> ==30213==    by 0x8C2757A: _MPID_nem_init (in a.exec)
> ==30213==    by 0x8C27892: MPID_nem_init (in a.exec)
> ==30213==    by 0x8C2E13B: MPIDI_CH3_Init (in a.exec)
> ==30213==    by 0x8C13EEB: MPID_Init (in a.exec)
> ==30213==    by 0x8C09E49: MPIR_Init_thread (in a.exec)
> ==30213==    by 0x8C09A1E: MPI_Init (in a.exec)
> 
> ==30213== Syscall param write(buf) points to uninitialised byte(s)
> ==30213==    at 0xDC65F3: __write_nocancel (in
> /lib/tls/libpthread-2.3.4.so)
> ==30213==    by 0x8C2F658: MPIDI_CH3_iStartMsg (in a.exec)
> ==30213==    by 0x8C12A5B: MPIDI_CH3_EagerContigShortSend (in a.exec)
> ==30213==    by 0x8C15030: MPID_Send (in a.exec)
> ==30213==    by 0x8C0A7EC: MPI_Send (in a.exec)
> 
> When I ran same application with hydra on SMP, it ran fine with no
> valgrind errors. I am assuming that application doesn't need to be aware
> of multi-core or multi-machine run, and hence doesn't require specific
> handling. So, bottomline is where this problem could be?
> 
> Regards,
> Rohit
> 
> 
> -----Original Message-----
> From: Pavan Balaji [mailto:balaji at mcs.anl.gov] 
> Sent: Wednesday, February 16, 2011 11:19 AM
> To: mpich-discuss at mcs.anl.gov
> Cc: Jain, Rohit
> Subject: Re: [mpich-discuss] MPI_Recv crashes with mpd ring
> 
> Rohit,
> 
> Try this:
> 
> % mpiexec.hydra -f hosts -n 1 ./a.exec arg1 : -n 1 ./a.exec arg2 : -n 1 
> ./a.exec arg3 : -n 1 ./a.exec arg4
> 
>  -- Pavan
> 
> On 02/16/2011 12:37 PM, Jain, Rohit wrote:
>> Thanks everyone for responses. I got around ssh issue.
>> 
>> But, it seems some more setup is required to make hydra work:
>> 
>> mpiexec.hydra -f hosts -n 4 a.exec arg1 : a.exec arg2 : a.exec arg3 :
>> a.exec arg4
>> [proxy at hansel] HYDU_create_process
>> (/mpich/src/mpich2-1.2.1p1/src/pm/hydra/utils/launch/launch.c:72):
>> execvp error on file a.exec (No such file or directory)
>> [proxy at hansel] HYDU_create_process
>> (/mpich/src/mpich2-1.2.1p1/src/pm/hydra/utils/launch/launch.c:72):
>> execvp error on file a.exec (No such file or directory)
>> 
>> 
>> Regards,
>> Rohit
>> 
>> 
>> -----Original Message-----
>> From: mpich-discuss-bounces at mcs.anl.gov
>> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of Dave Goodell
>> Sent: Tuesday, February 15, 2011 7:21 PM
>> To: mpich-discuss at mcs.anl.gov
>> Subject: Re: [mpich-discuss] MPI_Recv crashes with mpd ring
>> 
>> On Feb 15, 2011, at 5:52 PM CST, Jain, Rohit wrote:
>> 
>>> I had 1.2.1p1 built locally. So, I tried that. It also gave me same
>> fatal error. I will try newer version, but I am less hopeful.
>> 
>> There's a good chance that there is a bug in your code, since 1.0.6
> was
>> not fundamentally a broken version of MPICH2.  However, it is
> important
>> for you to use a fairly recent version so that we can rule out the
> ~3-4
>> years of bugs that have been fixed since 1.0.6 was released.  Also,
>> error messages and debugging facilities are typically only improved in
>> later versions of MPICH2, which could help you track down your
> problem.
>> You should attempt to debug your program in all of the usual ways,
> such
>> as by enabling core dumps or running valgrind on your program.
>> 
>>> I am trying to use hydra (mpiexec.hydra) with 1.2.1.p1, but getting
>> some startup errors:
>>> 
>>> The authenticity of host 'XXX' can't be established.
>>> RSA key fingerprint is ed:ce:ca:7b:08:b9:49:fd:f6:af:14.
>>> Are you sure you want to continue connecting (yes/no)?
>>> The authenticity of host 'XXX2' can't be established.
>>> RSA key fingerprint is fb:1b:7b:0c:bb:b1:a6:b1:7d:dc:05.
>>> 
>>> Any pointers how to resolve them?
>> 
>> See Pavan's mail for some tips here.
>> 
>> -Dave
>> 
>> 
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> -- 
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list