[mpich-discuss] NFSv4 and MPICH2 (was: Program Crash)

Sat Jul 23 22:53:55 CDT 2011

After some work, I was able to narrow my problem down to some sort of issue
between MPICH2 and NFSv4. When I mount the NFS as version 3, the MPICH2 tests
seem to work without any major problems*. Here are the NFS mounting details
from "nfsstat -m":

NFSv4 (hangs and times out, with large quantity of NFS traffic):
/home from 192.168.2.11:/home/
Flags:
rw,relatime,vers=4,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,port=0,timeo=20,retrans=32,sec=sys,clientaddr=192.168.2.1,minorversion=0,local_lock=none,addr=192.168.2.11

/usr/local from 192.168.2.11:/usr/local/
Flags:
ro,relatime,vers=4,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,port=0,timeo=20,retrans=32,sec=sys,clientaddr=192.168.2.1,minorversion=0,local_lock=none,addr=192.168.2.11

NFSv3 (works; mounted in fstab same as above except with nfs replacing 
nfs4 and
nfsv and an extra option nfsvers=3):
/home from 192.168.2.11:/home
Flags:
rw,relatime,vers=3,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,timeo=20,retrans=32,sec=sys,mountaddr=192.168.2.11,mountvers=3,mountport=55359,mountproto=udp,local_lock=none,addr=192.168.2.11

/usr/local from 192.168.2.11:/usr/local
Flags:
ro,relatime,vers=3,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,timeo=20,retrans=32,sec=sys,mountaddr=192.168.2.11,mountvers=3,mountport=55359,mountproto=udp,local_lock=none,addr=192.168.2.11

I'm not sure whether my problem is related to the issues mentioned by others.
Even though I can get this working by switching to NFSv3, I would like to get
MPICH2 working with NFSv4, if possible. Does anyone have any suggestions?
Should I post this as a new bug/ticket or are there some other things I should
try first? I'd be happy to provide more details or answer questions to try to
narrow down or fix the issue between MPICH2 and NFSv4.

*A f77/io test failed one of the times I ran a partial set of the 
tests, but I'm
not too worried about this as my "reference" NFSv3 system also failed at the
same spot, and I'm thinking these I/O test failures may be par for the course
when running on an NFS system(?).

Thanks,
Greg

Quoting Gregory Magoon <gmagoon at MIT.EDU>:

> I have been encountering a similar issue when I run the MPICH2 v.1.4 
> tests on an
> NFS file system (the tests also cause an abnormally high amount of 
> NFS traffic).
> I don't have the same issues when running on a local filesystem. I'm 
> wondering
> if this might be related to ticket #1422 and/or ticket #1483:
> http://trac.mcs.anl.gov/projects/mpich2/ticket/1422
> http://trac.mcs.anl.gov/projects/mpich2/ticket/1483
>
> I'm new to mpich, so if anyone has any tips, it would be very much 
> appreciated.
> Here is the initial output of the failed tests:
>
> user at node01:~/Molpro/src/mpich2-1.4$ make testing
> (cd test && make testing)
> make[1]: Entering directory `/home/user/Molpro/src/mpich2-1.4/test'
> (NOXMLCLOSE=YES && export NOXMLCLOSE && cd mpi && make testing)
> make[2]: Entering directory `/home/user/Molpro/src/mpich2-1.4/test/mpi'
> ./runtests -srcdir=. -tests=testlist \
>                   -mpiexec=/home/user/Molpro/src/mpich2-install/bin/mpiexec \
>                   -xmlfile=summary.xml
> Looking in ./testlist
> Processing directory attr
> Looking in ./attr/testlist
> Processing directory coll
> Looking in ./coll/testlist
> Unexpected output in allred: [mpiexec at node01] APPLICATION TIMED OUT
> Unexpected output in allred: [proxy:0:0 at node01] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
> Unexpected output in allred: [proxy:0:0 at node01] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
> Unexpected output in allred: [proxy:0:0 at node01] main 
> (./pm/pmiserv/pmip.c:226):
> demux engine error waiting for event
> Unexpected output in allred: [mpiexec at node01] HYDT_bscu_wait_for_completion
> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated
> badly; aborting
> Unexpected output in allred: [mpiexec at node01] HYDT_bsci_wait_for_completion
> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
> Unexpected output in allred: [mpiexec at node01] HYD_pmci_wait_for_completion
> (./pm/pmiserv/pmiserv_pmci.c:189): launcher returned error waiting for
> completion
> Unexpected output in allred: [mpiexec at node01] main 
> (./ui/mpich/mpiexec.c:397):
> process manager error waiting for completion
> Program allred exited without No Errors
>
> Thanks,
> Greg
>
>> There could be some thread safety related issue in your code. If 
>> your code is
>> simple enough or you can reproduce it with a simple test program, 
>> you can post
>> the test here.
>>
>> Rajeev
>>
>> On Jul 1, 2011, at 12:31 AM, jarray52 jarray52 wrote:
>>
>> Hi,
>>
>> My code crashes, and I'm not sure how to debug the problem. I'm new to
> MPI/mpich programming, and any suggestions on debugging the problem would be
> appreciated. Here is the error output displayed by mpich:
>>
>> [proxy:0:1 at ComputeNodeIB101] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
>> [proxy:0:1 at ComputeNodeIB101] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
>> [proxy:0:1 at ComputeNodeIB101] main (./pm/pmiserv/pmip.c:222): demux engine
> error waiting for event
>> [proxy:0:2 at ComputeNodeIB102] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
>> [proxy:0:2 at ComputeNodeIB102] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
>> [proxy:0:2 at ComputeNodeIB102] main (./pm/pmiserv/pmip.c:222): demux engine
> error waiting for event
>> [proxy:0:3 at ComputeNodeIB103] HYD_pmcd_pmip_control_cmd_cb
> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
>> [proxy:0:3 at ComputeNodeIB103] HYDT_dmxu_poll_wait_for_event
> (./tools/demux/demux_poll.c:77): callback returned error status
>> [proxy:0:3 at ComputeNodeIB103] main (./pm/pmiserv/pmip.c:222): demux engine
> error waiting for event
>>
>> I'm using MPI_THREAD_MULTIPLE over an ib fabric. The problem doesn't 
>> occur all
> the time. I believe it occurs during a recv statement, but I'm not certain.
>>
>> Thanks,
>> Jay
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>