[mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on PVFS

汪迪 otheryou at yahoo.cn
Thu Mar 26 00:05:04 CDT 2009


Hi,  Rajeev

Other MPI programs runs correctly, including cpi under examples folder.  

gxwangdi at WANGDI ~/D/m/examples> pwd
~/Downloads/mpich2-1.0.8/examples
gxwangdi at WANGDI ~/D/m/examples> mpd &
gxwangdi at WANGDI ~/D/m/examples> mpdtrace
WANGDI
gxwangdi at WANGDI ~/D/m/examples> which mpiexec
/usr/local/bin/mpiexec
gxwangdi at WANGDI ~/D/m/examples> mpiexec -n 10 ./cpi
Process 2 of 10 is on WANGDI
Process 1 of 10 is on WANGDI
Process 0 of 10 is on WANGDI
Process 6 of 10 is on WANGDI
Process 4 of 10 is on WANGDI
Process 9 of 10 is on WANGDI
Process 3 of 10 is on WANGDI
Process 7 of 10 is on WANGDI
Process 8 of 10 is on WANGDI
Process 5 of 10 is on WANGDI
pi is approximately 3.1415926544231256, Error is 0.0000000008333325
wall clock time = 0.012229

Thanks for your suggestion, I run make testing in the top directory of mpich2.  There is a test that do not pass:
xwangdi at WANGDI ~/D/mpich2-1.0.8> mpd &
gxwangdi at WANGDI ~/D/mpich2-1.0.8> make testing
(cd test && make testing)
make[1]: Entering directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test'
(NOXMLCLOSE=YES && export NOXMLCLOSE && cd mpi && make testing)
make[2]: Entering directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi'
./runtests -srcdir=. -tests=testlist \
           -mpiexec=/usr/local/bin/mpiexec \
           -xmlfile=summary.xml
Looking in ./testlist
Processing directory attr
Looking in ./attr/testlist
Processing directory coll
Looking in ./coll/testlist
Processing directory comm
Looking in ./comm/testlist
Some programs (cmsplit) may still be running:
pids = 2279 
The executable (cmsplit) will not be removed.
Processing directory datatype
Looking in ./datatype/testlist
Processing directory errhan
Looking in ./errhan/testlist
Processing directory group
Looking in ./group/testlist
Processing directory info
Looking in ./info/testlist
Processing directory init
Looking in ./init/testlist
Processing directory pt2pt
Looking in ./pt2pt/testlist
Some programs (sendrecv3) may still be running:
pids = 4049 
The executable (sendrecv3) will not be removed.
Processing directory rma
Looking in ./rma/testlist
Some programs (transpose3) may still be running:
pids = 5713 
The executable (transpose3) will not be removed.
Processing directory spawn
Looking in ./spawn/testlist
Processing directory topo
Looking in ./topo/testlist
Processing directory perf
Looking in ./perf/testlist
Processing directory io
Looking in ./io/testlist
Processing directory cxx
Looking in ./cxx/testlist
Processing directory attr
Looking in ./cxx/attr/testlist
Processing directory pt2pt
Looking in ./cxx/pt2pt/testlist
Failed to build bsend1cxx; make[3]: Entering directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/cxx/pt2pt'
/usr/local/bin/mpicxx -DHAVE_CONFIG_H -I. -I. -I../../include -I./../../include -c bsend1cxx.cxx
bsend1cxx.cxx: In function ‘int main(int, char**)’:
bsend1cxx.cxx:81: error: ‘strcmp’ was not declared in this scope
bsend1cxx.cxx:91: error: ‘strcmp’ was not declared in this scope
make[3]: *** [bsend1cxx.o] Error 1
make[3]: Leaving directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/cxx/pt2pt'

Processing directory comm
Looking in ./cxx/comm/testlist
Processing directory coll
Looking in ./cxx/coll/testlist
Processing directory init
Looking in ./cxx/init/testlist
Processing directory info
Looking in ./cxx/info/testlist
Processing directory datatype
Looking in ./cxx/datatype/testlist
Processing directory io
Looking in ./cxx/io/testlist
Processing directory spawn
Looking in ./cxx/spawn/testlist
Processing directory rma
Looking in ./cxx/rma/testlist
Processing directory errors
Looking in ./errors/testlist
Processing directory attr
Looking in ./errors/attr/testlist
Processing directory coll
Looking in ./errors/coll/testlist
Processing directory comm
Looking in ./errors/comm/testlist
Processing directory group
Looking in ./errors/group/testlist
Processing directory pt2pt
Looking in ./errors/pt2pt/testlist
Processing directory topo
Looking in ./errors/topo/testlist
Processing directory rma
Looking in ./errors/rma/testlist
Processing directory spawn
Looking in ./errors/spawn/testlist
Processing directory io
Looking in ./errors/io/testlist
Processing directory cxx
Looking in ./errors/cxx/testlist
Processing directory errhan
Looking in ./errors/cxx/errhan/testlist
Processing directory io
Looking in ./errors/cxx/io/testlist
Processing directory threads
Looking in ./threads/testlist
Processing directory pt2pt
Looking in ./threads/pt2pt/testlist
Processing directory comm
Looking in ./threads/comm/testlist
1 tests failed out of 385
Details in /home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/summary.xml
make[2]: Leaving directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi'
(XMLFILE=../mpi/summary.xml && XMLCONTINUE=YES && \
    export XMLFILE && export XMLCONTINUE && \
    cd commands && make testing)
make[2]: Entering directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/commands'
make[2]: Nothing to be done for `testing'.
make[2]: Leaving directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/commands'
make[1]: Leaving directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test'

And the attachment is the /home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/summary.xml file, it is too long to paste all content here. I can not understand what is the problem yet.
From: Rajeev Thakur <thakur at mcs.anl.gov>
Subject: RE: [mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on PVFS
To: otheryou at yahoo.cn, mpich2-dev at mcs.anl.gov
Date: 2009,3/26,Thursday,2:41



 
Do the other MPICH2 tests run, such as the cpi example in 
the examples directory? If you run "make testing" in the top-level mpich2 
directory it will run the entire tests suite in test/mpi (can take more than an 
hour).
 
Rajeev
 


  
  
  From: mpich2-dev-bounces at mcs.anl.gov 
  [mailto:mpich2-dev-bounces at mcs.anl.gov] On Behalf Of ??
Sent: 
  Wednesday, March 25, 2009 12:11 AM
To: MPICH2-developer 
  mailling-list
Subject: [mpich2-dev] MPI_COMM_WORLD failed when using 
  pio-bench on PVFS


  
  
    
    
      Hi, 

I configure PVFS system server on my own 
        laptop, which uses MPICH2 as its trove storage system 
        implementation.  And now I tend to use pio-bench to get the trace 
        of the server. It works as I set the number of process as 1, but when I 
        set the number of process to more than 1, it does not work all the time. 
        It tips the MPI_COMM_WORLD failed, but I run hostname command with MPI 
        and it works. It is peculiar. I check the MPICH2 document and find no 
        particular configuration for single host MPI deployment. Do I do 
        anything wrong?


 By the way my laptop is IBM i386 
        architecture and CPU is intel centrino2 vPro, OS UBUNTU 8.10, MPICH2 
        1.0.8 is installed by default under  /usr/local, and pvfs2.7.1 is 
        under /root/pvfs-install/. 

gxwangdi at WANGDI:~/Desktop/pio-bench$ 
        sudo mpiexec -n 1 ./pio-bench
[sudo] password for gxwangdi:
File 
        under test: /mnt/pvfs2/ftpaccess
Number of Processes: 1
Sync: 
        off
Averaging: Off
the nested strided pattern needs to be run with 
        an even amount of processes
file pio-bench.c, line 586: access 
        pattern initialization error: 
        -1

gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 4 
        ./pio-bench
Fatal error in MPI_Barrier: Other MPI error, error 
        stack:
MPI_Barrier(406)..........................: 
        MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
        an error occurred while
handling an event returned by 
        MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: 
        connection failure
(set=0,sock=1,errno=104:Connection reset by 
        peer)[cli_0]: aborting job:
Fatal error in MPI_Barrier: Other MPI 
        error, error stack:
MPI_Barrier(406)..........................: 
        MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
        an error occurred while
handling an event returned by 
        MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_readFatal 
        error in MPI_Bcast: Other MPI error, 
        error
stack:
MPI_Bcast(786)............................: 
        MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) 
        failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
        an error occurred while
handling an event returned by 
        MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
        ch3|sock|immedread
0x1e5a0d60 0x1f329978 
        0x1f3258d0
MPIDU_Sock_readv(455).....................: the supplied 
        buffer contains
invalid memory (set=0,sock=1,errno=14:Bad 
        address)[cli_1]: aborting job:
Fatal error in MPI_Bcast: Other MPI 
        error, error stack:
MPI_Bcast(786)............................: 
        MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) 
        failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................(637)..............:
connection 
        failure (set=0,sock=1,errno=104:Connection reset by 
        peer)
:
MPIDI_CH3i_Progress_wait(215).............: an error 
        occurred while
handling an event returned by 
        MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
        ch3|sock|immedread
0x1e5a0d60 0x1f329978 
        0x1f3258d0
MPIDU_Sock_readv(455).....................: the supplied 
        buffer contains
invalid memory (set=0,sock=1,errno=14:Bad 
        address)
Fatal error in MPI_Bcast: Other MPI error, error 
        stack:
MPI_Bcast(786)............................: 
        MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) 
        failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
        an error occurred while
handling an event returned by 
        MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
        ch3|sock|immedread
0x1e5a0d60 0x1eb9f978 
        0x1eb9b8d0
MPIDU_Sock_readv(455).....................: the supplied 
        buffer contains
invalid memory (set=0,sock=1,errno=14:Bad 
        address)[cli_2]: aborting job:
Fatal error in MPI_Bcast: Other MPI 
        error, error stack:
MPI_Bcast(786)........................so....: 
        MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) 
        failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
        an error occurred while
handling an event returned by 
        MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
        ch3|sock|immedread
0x1e5a0d60 0x1eb9f978 
        0x1eb9b8d0
MPIDU_Sock_readv(455).....................: the supplied 
        buffer contains
invalid memory (set=0,sock=1,errno=14:Bad 
        address)
rank 1 in job 9  WANGDI_59039   caused 
        collective abort of all ranks
  exit status of rank 1: return 
        code 1
rank 0 in job 9  WANGDI_59039   caused 
        collective abort of all ranks
  exit status of rank 0: return 
        code 1
gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 4 
        hostname
WANGDI
WANGDI
WANGDI
WANGDI


and my 
        pio-bench.conf file is like:

Testfile 
        "/mnt/pvfs2/ftpaccess"

OutputToFile 
        "/home/gxwangdi/Desktop/pio-bench/results/result"

<ap_module>
ModuleName 
        "Nested Strided (read)"
ModuleReps 3
ModuleSettleTime 
        5
</ap_module>

<ap_module>so
ModuleName "Nested 
        Strided (write)"
ModuleReps 3
ModuleSettleTime 
        5
</ap_module>

<ap_module>
ModuleName "Nested 
        Strided (read-modify-write)"
ModuleReps 3
ModuleSettleTime 
        5
</ap_module>

<ap_module>
ModuleName "Nested 
        Strided (re-read)"
ModuleReps 3
ModuleSettleTime 
        5
</ap_module>

<ap_module>
ModuleName "Nested 
        Strided (re-write)"
ModuleReps 3
ModuleSettleTime 
        5
</ap_module>

Also I use another file that is not under 
        /mnt/pvfs2 for test, it tips like 
        below:

gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 4 
        ./pio-bench
Fatal error in MPI_Barrier: Other MPI error, error 
        stack:
MPI_Barrier(406)..........................: 
        MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
        an error occurred while
handling an event returned by 
        MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: 
        connection failure
(set=0,sock=1,errno=104:Connection reset by 
        peer)[cli_0]: aborting job:
Fatal error in MPI_Barrier: Other MPI 
        error, error stack:
MPI_Barrier(406)..........................: 
        MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
        an error occurred while
handling an event returned by 
        MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_readFatal 
        error in MPI_Bcast: Other MPI error, 
        error
stack:
MPI_Bcast(786)............................: 
        MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) 
        failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
        an error occurred while
handling an event returned by 
        MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
        ch3|sock|immedread
0x1e5a0d60 0x1f525978 
        0x1f5218d0
MPIDU_Sock_readv(455).....................: the supplied 
        buffer contains
invalid memory (set=0,sock=1,errno=14:Bad 
        address)[cli_2]: aborting job:
Fatal error in MPI_Bcast: Other MPI 
        error, error stack:
MPI_Bcast(786)............................: 
        MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) 
        failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................(637)..............:
connection 
        failure (set=0,sock=1,errno=104:Connection reset by 
        peer)
:
MPIDI_CH3i_Progress_wait(215).............: an error 
        occurred while
handling an event returned by 
        MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
        ch3|sock|immedread
0x1e5a0d60 0x1f525978 
        0x1f5218d0
MPIDU_Sock_readv(455).....................: the supplied 
        buffer contains
invalid memory (set=0,sock=1,errno=14:Bad 
        address)
Fatal error in MPI_Bcast: Other MPI error, error 
        stack:
MPI_Bcast(786)............................: 
        MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) 
        failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
        an error occurred while
handling an event returned by 
        MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
        ch3|sock|immedread
0x1e5a0d60 0x202d2978 
        0x202ce8d0
MPIDU_Sock_readv(455).....................: the supplied 
        buffer contains
invalid memory (set=0,sock=1,errno=14:Bad 
        address)[cli_1]: aborting job:
Fatal error in MPI_Bcast: Other MPI 
        error, error stack:
MPI_Bcast(786)............................: 
        MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) 
        failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
        an error occurred while
handling an event returned by 
        MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
        ch3|sock|immedread
0x1e5a0d60 0x202d2978 
        0x202ce8d0
MPIDU_Sock_readv(455).....................: the supplied 
        buffer contains
invalid memory (set=0,sock=1,errno=14:Bad 
        address)
rank 2 in job 11  WANGDI_59039   caused 
        collective abort of all ranks
  exit status of rank 2: return 
        code 1
rank 1 in job 11  WANGDI_59039   caused 
        collective abort of all ranks
  exit status of rank 1: return 
        code 1
rank 0 in job 11  WANGDI_59039   caused 
        collective abort of all ranks
  exit status of rank 0: return 
        code 1


The MPI_COMM_WORLD failed again but it caused 
        collective abort of all
ranks at the end, which is a little bit 
        different, as it has no syslog
for pio-bench to check, I do not 
        understand what happens and I can not
solve this 
        problem.

Appreciate your 
  responses.



  
  好玩贺卡等你发,邮箱贺卡全新上线!


      ___________________________________________________________ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20090326/a454943f/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: summary.xml
Type: text/xml
Size: 43399 bytes
Desc: not available
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20090326/a454943f/attachment-0001.bin>


More information about the mpich2-dev mailing list