[mpich-discuss] NFSv4 and MPICH2 (was: Program Crash)

Rajeev Thakur thakur at mcs.anl.gov
Mon Jul 25 13:59:26 CDT 2011


Not sure why NFS should make a difference in the communication tests, such as allred.

Rajeev

On Jul 23, 2011, at 10:53 PM, Gregory Magoon wrote:

> After some work, I was able to narrow my problem down to some sort of issue
> between MPICH2 and NFSv4. When I mount the NFS as version 3, the MPICH2 tests
> seem to work without any major problems*. Here are the NFS mounting details
> from "nfsstat -m":
> 
> 
> NFSv4 (hangs and times out, with large quantity of NFS traffic):
> /home from 192.168.2.11:/home/
> Flags:
> rw,relatime,vers=4,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,port=0,timeo=20,retrans=32,sec=sys,clientaddr=192.168.2.1,minorversion=0,local_lock=none,addr=192.168.2.11
> 
> /usr/local from 192.168.2.11:/usr/local/
> Flags:
> ro,relatime,vers=4,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,port=0,timeo=20,retrans=32,sec=sys,clientaddr=192.168.2.1,minorversion=0,local_lock=none,addr=192.168.2.11
> 
> NFSv3 (works; mounted in fstab same as above except with nfs replacing nfs4 and
> nfsv and an extra option nfsvers=3):
> /home from 192.168.2.11:/home
> Flags:
> rw,relatime,vers=3,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,timeo=20,retrans=32,sec=sys,mountaddr=192.168.2.11,mountvers=3,mountport=55359,mountproto=udp,local_lock=none,addr=192.168.2.11
> 
> /usr/local from 192.168.2.11:/usr/local
> Flags:
> ro,relatime,vers=3,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,timeo=20,retrans=32,sec=sys,mountaddr=192.168.2.11,mountvers=3,mountport=55359,mountproto=udp,local_lock=none,addr=192.168.2.11
> 
> 
> I'm not sure whether my problem is related to the issues mentioned by others.
> Even though I can get this working by switching to NFSv3, I would like to get
> MPICH2 working with NFSv4, if possible. Does anyone have any suggestions?
> Should I post this as a new bug/ticket or are there some other things I should
> try first? I'd be happy to provide more details or answer questions to try to
> narrow down or fix the issue between MPICH2 and NFSv4.
> 
> *A f77/io test failed one of the times I ran a partial set of the tests, but I'm
> not too worried about this as my "reference" NFSv3 system also failed at the
> same spot, and I'm thinking these I/O test failures may be par for the course
> when running on an NFS system(?).
> 
> Thanks,
> Greg
> 
> Quoting Gregory Magoon <gmagoon at MIT.EDU>:
> 
>> I have been encountering a similar issue when I run the MPICH2 v.1.4 tests on an
>> NFS file system (the tests also cause an abnormally high amount of NFS traffic).
>> I don't have the same issues when running on a local filesystem. I'm wondering
>> if this might be related to ticket #1422 and/or ticket #1483:
>> http://trac.mcs.anl.gov/projects/mpich2/ticket/1422
>> http://trac.mcs.anl.gov/projects/mpich2/ticket/1483
>> 
>> I'm new to mpich, so if anyone has any tips, it would be very much appreciated.
>> Here is the initial output of the failed tests:
>> 
>> user at node01:~/Molpro/src/mpich2-1.4$ make testing
>> (cd test && make testing)
>> make[1]: Entering directory `/home/user/Molpro/src/mpich2-1.4/test'
>> (NOXMLCLOSE=YES && export NOXMLCLOSE && cd mpi && make testing)
>> make[2]: Entering directory `/home/user/Molpro/src/mpich2-1.4/test/mpi'
>> ./runtests -srcdir=. -tests=testlist \
>>                  -mpiexec=/home/user/Molpro/src/mpich2-install/bin/mpiexec \
>>                  -xmlfile=summary.xml
>> Looking in ./testlist
>> Processing directory attr
>> Looking in ./attr/testlist
>> Processing directory coll
>> Looking in ./coll/testlist
>> Unexpected output in allred: [mpiexec at node01] APPLICATION TIMED OUT
>> Unexpected output in allred: [proxy:0:0 at node01] HYD_pmcd_pmip_control_cmd_cb
>> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
>> Unexpected output in allred: [proxy:0:0 at node01] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>> Unexpected output in allred: [proxy:0:0 at node01] main (./pm/pmiserv/pmip.c:226):
>> demux engine error waiting for event
>> Unexpected output in allred: [mpiexec at node01] HYDT_bscu_wait_for_completion
>> (./tools/bootstrap/utils/bscu_wait.c:70): one of the processes terminated
>> badly; aborting
>> Unexpected output in allred: [mpiexec at node01] HYDT_bsci_wait_for_completion
>> (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
>> completion
>> Unexpected output in allred: [mpiexec at node01] HYD_pmci_wait_for_completion
>> (./pm/pmiserv/pmiserv_pmci.c:189): launcher returned error waiting for
>> completion
>> Unexpected output in allred: [mpiexec at node01] main (./ui/mpich/mpiexec.c:397):
>> process manager error waiting for completion
>> Program allred exited without No Errors
>> 
>> Thanks,
>> Greg
>> 
>>> There could be some thread safety related issue in your code. If your code is
>>> simple enough or you can reproduce it with a simple test program, you can post
>>> the test here.
>>> 
>>> Rajeev
>>> 
>>> On Jul 1, 2011, at 12:31 AM, jarray52 jarray52 wrote:
>>> 
>>> Hi,
>>> 
>>> My code crashes, and I'm not sure how to debug the problem. I'm new to
>> MPI/mpich programming, and any suggestions on debugging the problem would be
>> appreciated. Here is the error output displayed by mpich:
>>> 
>>> [proxy:0:1 at ComputeNodeIB101] HYD_pmcd_pmip_control_cmd_cb
>> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
>>> [proxy:0:1 at ComputeNodeIB101] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>>> [proxy:0:1 at ComputeNodeIB101] main (./pm/pmiserv/pmip.c:222): demux engine
>> error waiting for event
>>> [proxy:0:2 at ComputeNodeIB102] HYD_pmcd_pmip_control_cmd_cb
>> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
>>> [proxy:0:2 at ComputeNodeIB102] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>>> [proxy:0:2 at ComputeNodeIB102] main (./pm/pmiserv/pmip.c:222): demux engine
>> error waiting for event
>>> [proxy:0:3 at ComputeNodeIB103] HYD_pmcd_pmip_control_cmd_cb
>> (./pm/pmiserv/pmip_cb.c:906): assert (!closed) failed
>>> [proxy:0:3 at ComputeNodeIB103] HYDT_dmxu_poll_wait_for_event
>> (./tools/demux/demux_poll.c:77): callback returned error status
>>> [proxy:0:3 at ComputeNodeIB103] main (./pm/pmiserv/pmip.c:222): demux engine
>> error waiting for event
>>> 
>>> I'm using MPI_THREAD_MULTIPLE over an ib fabric. The problem doesn't occur all
>> the time. I believe it occurs during a recv statement, but I'm not certain.
>>> 
>>> Thanks,
>>> Jay
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> 
> 
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list