[mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on PVFS

汪迪 otheryou at yahoo.cn
Wed Apr 1 01:23:59 CDT 2009


Hi, Thakur

I install MPICH2 by default configuration, and I install PVFS as root on that MPI installation, and again I install MPICH2, having "--with-pvfs=pvfs-route" option. I verify my PVFS installation and it works fine, and once I activate PVFS, the mpd would start automatically backend.  I verify my installation these two days, it works fine on any mpi programs including the cpi in the examples directory, except pio-bench and noncontig workbench.  pio-bench seems work fine if not running on mpi, and access a common file in linux file system. Even if it runs on mpich2, if the number of process is 1 it also works fine, so does noncontig by the way. The only failure occurs when the number of process is more than 1 (and more than 2 for noncontig).

gxwangdi at WANGDI:~/Desktop/pio-bench$ ./pio-bench
File under test: /home/gxwangdi/Desktop/ftpaccess
Number of Processes: 1
Sync: off
Averaging: Off
Sleeping for 5 seconds
Sleeping for 5 seconds
Averaging: Off
Sleeping for 5 seconds
Sleeping for 5 seconds
Averaging: Off
Sleeping for 5 seconds
Sleeping for 5 seconds
Averaging: Off
Sleeping for 5 seconds
Sleeping for 5 seconds
Averaging: Off
gxwangdi at WANGDI:~/Desktop/pio-bench$ mpd &
[1] 13777
gxwangdi at WANGDI:~/Desktop/pio-bench$ mpiexec -n 4 ./pio-bench
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)..........................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(77)..........................: 
MPIC_Sendrecv(126)........................: 
MPIC_Wait(270)............................: 
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420): 
MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=2,errno=104:Connection reset by peer)[cli_0]: aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)..........................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(77)..........................: 
MPIC_Sendrecv(126)........................: 
MPIC_Wait(270)............................: 
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420): 
MPIDU_Socki_handle_readFatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1f2efa98, count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................: 
MPIC_Recv(81).............................: 
MPIC_Wait(270)............................: 
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456): 
adjust_iov(973)...........................: ch3|sock|immedread 0x1e5a0d60 0x1fee6978 0x1fee28d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains invalid memory (set=0,sock=1,errno=14:Bad address)[cli_1]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1f2efa98, count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................: 
MPIC_Recv(81).............................: 
MPIC_Wait(270)............................(637)..............: connection failure (set=0,sock=2,errno=104:Connection reset by peer)
: 
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456): 
adjust_iov(973)...........................: ch3|sock|immedread 0x1e5a0d60 0x1fee6978 0x1fee28d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains invalid memory (set=0,sock=1,errno=14:Bad address)
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1f2efa98, count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................: 
MPIC_Recv(81).............................: 
MPIC_Wait(270)............................: 
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456): 
adjust_iov(973)...........................: ch3|sock|immedread 0x1e5a11a4 0x1fdec978 0x1fde88d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains invalid memory (set=0,sock=1,errno=14:Bad address)[cli_2]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1f2efa98, count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................: 
MPIC_Recv(81).............................: 
MPIC_Wait(270)............................: 
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456): 
adjust_iov(973)...........................: ch3|sock|immedread 0x1e5a11a4 0x1fdec978 0x1fde88d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains invalid memory (set=0,sock=1,errno=14:Bad address)
rank 2 in job 1  WANGDI_57913   caused collective abort of all ranks
  exit status of rank 2: return code 1 
rank 1 in job 1  WANGDI_57913   caused collective abort of all ranks
  exit status of rank 1: killed by signal 9 
rank 0 in job 1  WANGDI_57913   caused collective abort of all ranks
  exit status of rank 0: return code 1 

gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpd &
[1] 22125
gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 4 ./pio-bench
[sudo] password for gxwangdi: 
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1eed1ad0, count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................: 
MPIC_Recv(81).............................: 
MPIC_Wait(270)............................: 
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456): 
adjust_iov(973)...........................: ch3|sock|immedread 0x1e5a0d60 0x1ecee978 0x1ecea8d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains invalid memory (set=0,sock=1,errno=14:Bad address)[cli_2]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1eed1ad0, count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................: 
MPIC_Recv(81).............................: 
MPIC_Wait(270)............................Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)..........................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(77)..........................: 
MPIC_Sendrecv(126)........................: 
MPIC_Wait(270)............................: 
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420): 
MPIDU_Socki_handle_read(637)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)[cli_0]: aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)..........................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier(77)..........................: 
MPIC_Sendrecv(126)........................: 
MPIC_Wait(270)............................: 
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420): 
MPIDU_Socki_handle_read: 
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456): 
adjust_iov(973)...........................: ch3|sock|immedread 0x1e5a0d60 0x1ecee978 0x1ecea8d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains invalid memory (set=0,sock=1,errno=14:Bad address)
(637)..............: connection failure (set=0,sock=1,errno=104:Connection reset by peer)
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1eed1ad0, count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................: 
MPIC_Recv(81).............................: 
MPIC_Wait(270)............................: 
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456): 
adjust_iov(973)...........................: ch3|sock|immedread 0x1e5a0d60 0x1f9eb978 0x1f9e78d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains invalid memory (set=0,sock=1,errno=14:Bad address)[cli_1]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1eed1ad0, count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................: 
MPIC_Recv(81).............................: 
MPIC_Wait(270)............................: 
MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456): 
adjust_iov(973)...........................: ch3|sock|immedread 0x1e5a0d60 0x1f9eb978 0x1f9e78d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains invalid memory (set=0,sock=1,errno=14:Bad address)
rank 2 in job 1  WANGDI_49008   caused collective abort of all ranks
  exit status of rank 2: return code 1 
rank 1 in job 1  WANGDI_49008   caused collective abort of all ranks
  exit status of rank 1: return code 1 
rank 0 in job 1  WANGDI_49008   caused collective abort of all ranks
  exit status of rank 0: return code 1 

the pio-bench.conf is like below:

#Testfile "/mnt/pvfs2/ftpaccess"

TestFile "/home/gxwangdi/Desktop/ftpaccess"



OutputToFile "/home/gxwangdi/Desktop/pio-bench/results/result"



<ap_module>

    ModuleName "Simple Strided (read)"

    ModuleReps 3

    ModuleSettleTime 5

</ap_module>



<ap_module>

    ModuleName "Nested Strided (write)"

    ModuleReps 3

    ModuleSettleTime 5

</ap_module>



<ap_module>

    ModuleName "Simple Strided (read-modify-write)"

    ModuleReps 3

    ModuleSettleTime 5

</ap_module>



<ap_module>

    ModuleName "Nested Strided (re-read)"

    ModuleReps 3

    ModuleSettleTime 5

</ap_module>

And also the mpich2 can not pass the test all the time. I try to start mpd in normal/root mode, activate pvfs server instead, the results are the same. 

root at WANGDI:/home/gxwangdi/Downloads/mpich2-1.0.8# make testing
(cd test && make testing)
make[1]: Entering directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test'
(NOXMLCLOSE=YES && export NOXMLCLOSE && cd mpi && make testing)
make[2]: Entering directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi'
./runtests -srcdir=. -tests=testlist \
           -mpiexec=/usr/local/bin/mpiexec \
           -xmlfile=summary.xml
Looking in ./testlist
Processing directory attr
Looking in ./attr/testlist
Processing directory coll
Looking in ./coll/testlist
Processing directory comm
Looking in ./comm/testlist
Processing directory datatype
Looking in ./datatype/testlist
Processing directory errhan
Looking in ./errhan/testlist
Processing directory group
Looking in ./group/testlist
Processing directory info
Looking in ./info/testlist
Processing directory init
Looking in ./init/testlist
Processing directory pt2pt
Looking in ./pt2pt/testlist
Processing directory rma
Looking in ./rma/testlist
Processing directory spawn
Looking in ./spawn/testlist
Processing directory topo
Looking in ./topo/testlist
Processing directory perf
Looking in ./perf/testlist
Processing directory io
Looking in ./io/testlist
Processing directory cxx
Looking in ./cxx/testlist
Processing directory attr
Looking in ./cxx/attr/testlist
Processing directory pt2pt
Looking in ./cxx/pt2pt/testlist
Failed to build bsend1cxx; make[3]: Entering directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/cxx/pt2pt'
/usr/local/bin/mpicxx -DHAVE_CONFIG_H -I. -I. -I../../include -I./../../include -c bsend1cxx.cxx
bsend1cxx.cxx: In function ‘int main(int, char**)’:
bsend1cxx.cxx:81: error: ‘strcmp’ was not declared in this scope
bsend1cxx.cxx:91: error: ‘strcmp’ was not declared in this scope
make[3]: *** [bsend1cxx.o] Error 1
make[3]: Leaving directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/cxx/pt2pt'

Processing directory comm
Looking in ./cxx/comm/testlist
Processing directory coll
Looking in ./cxx/coll/testlist
Processing directory init
Looking in ./cxx/init/testlist
Processing directory info
Looking in ./cxx/info/testlist
Processing directory datatype
Looking in ./cxx/datatype/testlist
Processing directory io
Looking in ./cxx/io/testlist
Processing directory spawn
Looking in ./cxx/spawn/testlist
Processing directory rma
Looking in ./cxx/rma/testlist
Processing directory errors
Looking in ./errors/testlist
Processing directory attr
Looking in ./errors/attr/testlist
Processing directory coll
Looking in ./errors/coll/testlist
Processing directory comm
Looking in ./errors/comm/testlist
Processing directory group
Looking in ./errors/group/testlist
Processing directory pt2pt
Looking in ./errors/pt2pt/testlist
Processing directory topo
Looking in ./errors/topo/testlist
Processing directory rma
Looking in ./errors/rma/testlist
Processing directory spawn
Looking in ./errors/spawn/testlist
Processing directory io
Looking in ./errors/io/testlist
Processing directory cxx
Looking in ./errors/cxx/testlist
Processing directory errhan
Looking in ./errors/cxx/errhan/testlist
Processing directory io
Looking in ./errors/cxx/io/testlist
Processing directory threads
Looking in ./threads/testlist
Processing directory pt2pt
Looking in ./threads/pt2pt/testlist
Processing directory comm
Looking in ./threads/comm/testlist
1 tests failed out of 385
Details in /home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/summary.xml
make[2]: Leaving directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi'
(XMLFILE=../mpi/summary.xml && XMLCONTINUE=YES && \
    export XMLFILE && export XMLCONTINUE && \
    cd commands && make testing)
make[2]: Entering directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/commands'
make[2]: Nothing to be done for `testing'.
make[2]: Leaving directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/commands'
make[1]: Leaving directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test'

I can not figure out the problem from the summary.xml. I attach my configure.log, config.log, make.log, install.log and summary.xml in the attachment.



--- 09年3月27日,周五, Rajeev Thakur <thakur at mcs.anl.gov> 写道:
发件人: Rajeev Thakur <thakur at mcs.anl.gov>
主题: RE: [mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on PVFS
收件人: "'??'" <otheryou at yahoo.cn>
抄送: "'MPICH2-developer mailling-list'" <mpich2-dev at mcs.anl.gov>
日期: 2009,327,周五,2:51上午



 
What do you mean by "I configure PVFS system server on my own 
laptop, which uses MPICH2 as its trove storage system implementation"? You have 
installed MPICH2 and PVFS as two separate components, right? Does pio-bench 
work if you make it access a local file directly via the Linux file 
system?
 
Rajeev 
 
 


  
  
  From: ?? [mailto:otheryou at yahoo.cn] 
  
Sent: Thursday, March 26, 2009 12:05 AM
To: Rajeev 
  Thakur
Cc: MPICH2-developer mailling-list
Subject: RE: 
  [mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on 
  PVFS


  
  
    
    
      
Hi,  Rajeev

Other MPI programs runs 
        correctly, including cpi under examples folder.  
        

gxwangdi at WANGDI ~/D/m/examples> 
        pwd
~/Downloads/mpich2-1.0.8/examples
gxwangdi at WANGDI 
        ~/D/m/examples> mpd &
gxwangdi at WANGDI ~/D/m/examples> 
        mpdtrace
WANGDI
gxwangdi at WANGDI ~/D/m/examples> which 
        mpiexec
/usr/local/bin/mpiexec
gxwangdi at WANGDI ~/D/m/examples> 
        mpiexec -n 10 ./cpi
Process 2 of 10 is on WANGDI
Process 1 of 10 
        is on WANGDI
Process 0 of 10 is on WANGDI
Process 6 of 10 is on 
        WANGDI
Process 4 of 10 is on WANGDI
Process 9 of 10 is on 
        WANGDI
Process 3 of 10 is on WANGDI
Process 7 of 10 is on 
        WANGDI
Process 8 of 10 is on WANGDI
Process 5 of 10 is on 
        WANGDI
pi is approximately 3.1415926544231256, Error is 
        0.0000000008333325
wall clock time = 0.012229

Thanks for your 
        suggestion, I run make testing in the top directory of mpich2.  
        There is a test that do not pass:
xwangdi at WANGDI ~/D/mpich2-1.0.8> 
        mpd &
gxwangdi at WANGDI ~/D/mpich2-1.0.8> make testing
(cd 
        test && make testing)
make[1]: Entering directory 
        `/home/gxwangdi/Downloads/mpich2-1.0.8/test'
(NOXMLCLOSE=YES 
        && export NOXMLCLOSE && cd mpi && make 
        testing)
make[2]: Entering directory 
        `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi'
./runtests -srcdir=. 
        -tests=testlist \
           
        -mpiexec=/usr/local/bin/mpiexec \
    
               -xmlfile=summary.xml
Looking in 
        ./testlist
Processing directory attr
Looking in 
        ./attr/testlist
Processing directory coll
Looking in 
        ./coll/testlist
Processing directory comm
Looking in 
        ./comm/testlist
Some programs (cmsplit) may still be running:
pids 
        = 2279 
The executable (cmsplit) will not be removed.
Processing 
        directory datatype
Looking in ./datatype/testlist
Processing 
        directory errhan
Looking in ./errhan/testlist
Processing directory 
        group
Looking in ./group/testlist
Processing directory 
        info
Looking in ./info/testlist
Processing directory 
        init
Looking in ./init/testlist
Processing directory 
        pt2pt
Looking in ./pt2pt/testlist
Some programs (sendrecv3) may 
        still be running:
pids = 4049 
The executable (sendrecv3) will not 
        be removed.
Processing directory rma
Looking in 
        ./rma/testlist
Some programs (transpose3) may still be 
        running:
pids = 5713 
The executable (transpose3) will not be 
        removed.
Processing directory spawn
Looking in 
        ./spawn/testlist
Processing directory topo
Looking in 
        ./topo/testlist
Processing directory perf
Looking in 
        ./perf/testlist
Processing directory io
Looking in 
        ./io/testlist
Processing directory cxx
Looking in 
        ./cxx/testlist
Processing directory attr
Looking in 
        ./cxx/attr/testlist
Processing directory pt2pt
Looking in 
        ./cxx/pt2pt/testlist
Failed to build bsend1cxx; make[3]: Entering 
        directory 
        `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/cxx/pt2pt'
/usr/local/bin/mpicxx 
        -DHAVE_CONFIG_H -I. -I. -I../../include -I./../../include -c 
        bsend1cxx.cxx
bsend1cxx.cxx: In function ‘int main(int, 
        char**)’:
bsend1cxx.cxx:81: error: ‘strcmp’ was not declared in this 
        scope
bsend1cxx.cxx:91: error: ‘strcmp’ was not declared in this 
        scope
make[3]: *** [bsend1cxx.o] Error 1
make[3]: Leaving 
        directory 
        `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/cxx/pt2pt'

Processing 
        directory comm
Looking in ./cxx/comm/testlist
Processing directory 
        coll
Looking in ./cxx/coll/testlist
Processing directory 
        init
Looking in ./cxx/init/testlist
Processing directory 
        info
Looking in ./cxx/info/testlist
Processing directory 
        datatype
Looking in ./cxx/datatype/testlist
Processing directory 
        io
Looking in ./cxx/io/testlist
Processing directory 
        spawn
Looking in ./cxx/spawn/testlist
Processing directory 
        rma
Looking in ./cxx/rma/testlist
Processing directory 
        errors
Looking in ./errors/testlist
Processing directory 
        attr
Looking in ./errors/attr/testlist
Processing directory 
        coll
Looking in ./errors/coll/testlist
Processing directory 
        comm
Looking in ./errors/comm/testlist
Processing directory 
        group
Looking in ./errors/group/testlist
Processing directory 
        pt2pt
Looking in ./errors/pt2pt/testlist
Processing directory 
        topo
Looking in ./errors/topo/testlist
Processing directory 
        rma
Looking in ./errors/rma/testlist
Processing directory 
        spawn
Looking in ./errors/spawn/testlist
Processing directory 
        io
Looking in ./errors/io/testlist
Processing directory 
        cxx
Looking in ./errors/cxx/testlist
Processing directory 
        errhan
Looking in ./errors/cxx/errhan/testlist
Processing 
        directory io
Looking in ./errors/cxx/io/testlist
Processing 
        directory threads
Looking in ./threads/testlist
Processing 
        directory pt2pt
Looking in ./threads/pt2pt/testlist
Processing 
        directory comm
Looking in ./threads/comm/testlist
1 tests failed 
        out of 385
Details in 
        /home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/summary.xml
make[2]: 
        Leaving directory 
        `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi'
(XMLFILE=../mpi/summary.xml 
        && XMLCONTINUE=YES && \
    export 
        XMLFILE && export XMLCONTINUE && \
    
        cd commands && make testing)
make[2]: Entering directory 
        `/home/gxwangdi/Downloads/mpich2-1.0.8/test/commands'
make[2]: 
        Nothing to be done for `testing'.
make[2]: Leaving directory 
        `/home/gxwangdi/Downloads/mpich2-1.0.8/test/commands'
make[1]: 
        Leaving directory 
        `/home/gxwangdi/Downloads/mpich2-1.0.8/test'

And the attachment 
        is the /home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/summary.xml file, 
        it is too long to paste all content here. I can not understand what is 
        the problem yet.

        From: 
          Rajeev Thakur <thakur at mcs.anl.gov>
Subject: RE: [mpich2-dev] 
          MPI_COMM_WORLD failed when using pio-bench on PVFS
To: 
          otheryou at yahoo.cn, mpich2-dev at mcs.anl.gov
Date: 
          2009,3/26,Thursday,2:41


          
          Do the other MPICH2 tests run, such as 
          the cpi example in the examples directory? If you run "make testing" 
          in the top-level mpich2 directory it will run the entire tests suite 
          in test/mpi (can take more than an hour).
           
          Rajeev
           

          
            
            
            From: mpich2-dev-bounces at mcs.anl.gov 
            [mailto:mpich2-dev-bounces at mcs.anl.gov] On Behalf Of 
            ??
Sent: Wednesday, March 25, 2009 12:11 
            AM
To: MPICH2-developer mailling-list
Subject: 
            [mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on 
            PVFS


            
            
              
              
                Hi, 

I configure PVFS system server on 
                  my own laptop, which uses MPICH2 as its trove storage system 
                  implementation.  And now I tend to use pio-bench to get 
                  the trace of the server. It works as I set the number of 
                  process as 1, but when I set the number of process to more 
                  than 1, it does not work all the time. It tips the 
                  MPI_COMM_WORLD failed, but I run hostname command with MPI and 
                  it works. It is peculiar. I check the MPICH2 document and find 
                  no particular configuration for single host MPI deployment. Do 
                  I do anything wrong?


 By the way my laptop is 
                  IBM i386 architecture and CPU is intel centrino2 vPro, OS 
                  UBUNTU 8.10, MPICH2 1.0.8 is installed by default under  
                  /usr/local, and pvfs2.7.1 is under /root/pvfs-install/. 
                  

gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 1 
                  ./pio-bench
[sudo] password for gxwangdi:
File under 
                  test: /mnt/pvfs2/ftpaccess
Number of Processes: 1
Sync: 
                  off
Averaging: Off
the nested strided pattern needs to 
                  be run with an even amount of processes
file pio-bench.c, 
                  line 586: access pattern initialization error: 
                  -1

gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 
                  4 ./pio-bench
Fatal error in MPI_Barrier: Other MPI error, 
                  error stack:
MPI_Barrier(406)..........................: 
                  MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
                  an error occurred while
handling an event returned by 
                  MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: 
                  connection failure
(set=0,sock=1,errno=104:Connection reset 
                  by peer)[cli_0]: aborting job:
Fatal error in MPI_Barrier: 
                  Other MPI error, error 
                  stack:
MPI_Barrier(406)..........................: 
                  MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
                  an error occurred while
handling an event returned by 
                  MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_readFatal 
                  error in MPI_Bcast: Other MPI error, 
                  error
stack:
MPI_Bcast(786)............................: 
                  MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, 
                  MPI_COMM_WORLD) 
                  failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
                  an error occurred while
handling an event returned by 
                  MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
                  ch3|sock|immedread
0x1e5a0d60 0x1f329978 
                  0x1f3258d0
MPIDU_Sock_readv(455).....................: the 
                  supplied buffer contains
invalid memory 
                  (set=0,sock=1,errno=14:Bad address)[cli_1]: aborting 
                  job:
Fatal error in MPI_Bcast: Other MPI error, error 
                  stack:
MPI_Bcast(786)............................: 
                  MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, 
                  MPI_COMM_WORLD) 
                  failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................(637)..............:
connection 
                  failure (set=0,sock=1,errno=104:Connection reset by 
                  peer)
:
MPIDI_CH3i_Progress_wait(215).............: an 
                  error occurred while
handling an event returned by 
                  MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
                  ch3|sock|immedread
0x1e5a0d60 0x1f329978 
                  0x1f3258d0
MPIDU_Sock_readv(455).....................: the 
                  supplied buffer contains
invalid memory 
                  (set=0,sock=1,errno=14:Bad address)
Fatal error in 
                  MPI_Bcast: Other MPI error, error 
                  stack:
MPI_Bcast(786)............................: 
                  MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, 
                  MPI_COMM_WORLD) 
                  failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
                  an error occurred while
handling an event returned by 
                  MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
                  ch3|sock|immedread
0x1e5a0d60 0x1eb9f978 
                  0x1eb9b8d0
MPIDU_Sock_readv(455).....................: the 
                  supplied buffer contains
invalid memory 
                  (set=0,sock=1,errno=14:Bad address)[cli_2]: aborting 
                  job:
Fatal error in MPI_Bcast: Other MPI error, error 
                  stack:
MPI_Bcast(786)........................so....: 
                  MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, 
                  MPI_COMM_WORLD) 
                  failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
                  an error occurred while
handling an event returned by 
                  MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
                  ch3|sock|immedread
0x1e5a0d60 0x1eb9f978 
                  0x1eb9b8d0
MPIDU_Sock_readv(455).....................: the 
                  supplied buffer contains
invalid memory 
                  (set=0,sock=1,errno=14:Bad address)
rank 1 in job 9  
                  WANGDI_59039   caused collective abort of all 
                  ranks
  exit status of rank 1: return code 1
rank 0 
                  in job 9  WANGDI_59039   caused collective 
                  abort of all ranks
  exit status of rank 0: return 
                  code 1
gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 
                  4 
                  hostname
WANGDI
WANGDI
WANGDI
WANGDI


and 
                  my pio-bench.conf file is like:

Testfile 
                  "/mnt/pvfs2/ftpaccess"

OutputToFile 
                  "/home/gxwangdi/Desktop/pio-bench/results/result"

<ap_module>
ModuleName 
                  "Nested Strided (read)"
ModuleReps 3
ModuleSettleTime 
                  5
</ap_module>

<ap_module>so
ModuleName 
                  "Nested Strided (write)"
ModuleReps 3
ModuleSettleTime 
                  5
</ap_module>

<ap_module>
ModuleName 
                  "Nested Strided (read-modify-write)"
ModuleReps 
                  3
ModuleSettleTime 
                  5
</ap_module>

<ap_module>
ModuleName 
                  "Nested Strided (re-read)"
ModuleReps 3
ModuleSettleTime 
                  5
</ap_module>

<ap_module>
ModuleName 
                  "Nested Strided (re-write)"
ModuleReps 
                  3
ModuleSettleTime 5
</ap_module>

Also I 
                  use another file that is not under /mnt/pvfs2 for test, it 
                  tips like below:

gxwangdi at WANGDI:~/Desktop/pio-bench$ 
                  sudo mpiexec -n 4 ./pio-bench
Fatal error in MPI_Barrier: 
                  Other MPI error, error 
                  stack:
MPI_Barrier(406)..........................: 
                  MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
                  an error occurred while
handling an event returned by 
                  MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: 
                  connection failure
(set=0,sock=1,errno=104:Connection reset 
                  by peer)[cli_0]: aborting job:
Fatal error in MPI_Barrier: 
                  Other MPI error, error 
                  stack:
MPI_Barrier(406)..........................: 
                  MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
                  an error occurred while
handling an event returned by 
                  MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_readFatal 
                  error in MPI_Bcast: Other MPI error, 
                  error
stack:
MPI_Bcast(786)............................: 
                  MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, 
                  MPI_COMM_WORLD) 
                  failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
                  an error occurred while
handling an event returned by 
                  MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
                  ch3|sock|immedread
0x1e5a0d60 0x1f525978 
                  0x1f5218d0
MPIDU_Sock_readv(455).....................: the 
                  supplied buffer contains
invalid memory 
                  (set=0,sock=1,errno=14:Bad address)[cli_2]: aborting 
                  job:
Fatal error in MPI_Bcast: Other MPI error, error 
                  stack:
MPI_Bcast(786)............................: 
                  MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, 
                  MPI_COMM_WORLD) 
                  failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................(637)..............:
connection 
                  failure (set=0,sock=1,errno=104:Connection reset by 
                  peer)
:
MPIDI_CH3i_Progress_wait(215).............: an 
                  error occurred while
handling an event returned by 
                  MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
                  ch3|sock|immedread
0x1e5a0d60 0x1f525978 
                  0x1f5218d0
MPIDU_Sock_readv(455).....................: the 
                  supplied buffer contains
invalid memory 
                  (set=0,sock=1,errno=14:Bad address)
Fatal error in 
                  MPI_Bcast: Other MPI error, error 
                  stack:
MPI_Bcast(786)............................: 
                  MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, 
                  MPI_COMM_WORLD) 
                  failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
                  an error occurred while
handling an event returned by 
                  MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
                  ch3|sock|immedread
0x1e5a0d60 0x202d2978 
                  0x202ce8d0
MPIDU_Sock_readv(455).....................: the 
                  supplied buffer contains
invalid memory 
                  (set=0,sock=1,errno=14:Bad address)[cli_1]: aborting 
                  job:
Fatal error in MPI_Bcast: Other MPI error, error 
                  stack:
MPI_Bcast(786)............................: 
                  MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, 
                  MPI_COMM_WORLD) 
                  failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: 
                  an error occurred while
handling an event returned by 
                  MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: 
                  ch3|sock|immedread
0x1e5a0d60 0x202d2978 
                  0x202ce8d0
MPIDU_Sock_readv(455).....................: the 
                  supplied buffer contains
invalid memory 
                  (set=0,sock=1,errno=14:Bad address)
rank 2 in job 11  
                  WANGDI_59039   caused collective abort of all 
                  ranks
  exit status of rank 2: return code 1
rank 1 
                  in job 11  WANGDI_59039   caused collective 
                  abort of all ranks
  exit status of rank 1: return 
                  code 1
rank 0 in job 11  WANGDI_59039   
                  caused collective abort of all ranks
  exit status of 
                  rank 0: return code 1


The MPI_COMM_WORLD failed 
                  again but it caused collective abort of all
ranks at the 
                  end, which is a little bit different, as it has no 
                  syslog
for pio-bench to check, I do not understand what 
                  happens and I can not
solve this problem.

Appreciate 
                  your responses.



            
            好玩贺卡等你发,邮箱贺卡全新上线!

  
  好玩贺卡等你发,邮箱贺卡全新上线!


      ___________________________________________________________ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20090401/a9e936fe/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: config.log
Type: text/x-log
Size: 167848 bytes
Desc: not available
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20090401/a9e936fe/attachment-0005.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: configure.log
Type: text/x-log
Size: 72093 bytes
Desc: not available
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20090401/a9e936fe/attachment-0006.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: install.log
Type: text/x-log
Size: 92450 bytes
Desc: not available
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20090401/a9e936fe/attachment-0007.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: make.log
Type: text/x-log
Size: 568058 bytes
Desc: not available
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20090401/a9e936fe/attachment-0008.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: summary.xml
Type: text/xml
Size: 43399 bytes
Desc: not available
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20090401/a9e936fe/attachment-0009.bin>


More information about the mpich2-dev mailing list