[mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on PVFS

汪迪 otheryou at yahoo.cn
Wed Mar 25 00:11:15 CDT 2009


Hi, 

I configure PVFS system server on my own laptop, which uses MPICH2 as its trove storage system implementation.  And now I tend to use pio-bench to get the trace of the server. It works as I set the number of process as 1, but when I set the number of process to more than 1, it does not work all the time. It tips the MPI_COMM_WORLD failed, but I run hostname command with MPI and it works. It is peculiar. I check the MPICH2 document and find no particular configuration for single host MPI deployment. Do I do anything wrong?


 By the way my laptop is IBM i386 architecture and CPU is intel centrino2 vPro, OS UBUNTU 8.10, MPICH2 1.0.8 is installed by default under  /usr/local, and pvfs2.7.1 is under /root/pvfs-install/. 

gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 1 ./pio-bench
[sudo] password for gxwangdi:
File under test: /mnt/pvfs2/ftpaccess
Number of Processes: 1
Sync: off
Averaging: Off
the nested strided pattern needs to be run with an even amount of processes
file pio-bench.c, line 586: access pattern initialization error: -1

gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 4 ./pio-bench
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)..........................: MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure
(set=0,sock=1,errno=104:Connection reset by peer)[cli_0]: aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)..........................: MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_readFatal error in MPI_Bcast: Other MPI error, error
stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x1f329978 0x1f3258d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)[cli_1]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................(637)..............:
connection failure (set=0,sock=1,errno=104:Connection reset by peer)
:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x1f329978 0x1f3258d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x1eb9f978 0x1eb9b8d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)[cli_2]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)........................so....: MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x1eb9f978 0x1eb9b8d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)
rank 1 in job 9  WANGDI_59039   caused collective abort of all ranks
  exit status of rank 1: return code 1
rank 0 in job 9  WANGDI_59039   caused collective abort of all ranks
  exit status of rank 0: return code 1
gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 4 hostname
WANGDI
WANGDI
WANGDI
WANGDI


and my pio-bench.conf file is like:

Testfile "/mnt/pvfs2/ftpaccess"

OutputToFile "/home/gxwangdi/Desktop/pio-bench/results/result"

<ap_module>
ModuleName "Nested Strided (read)"
ModuleReps 3
ModuleSettleTime 5
</ap_module>

<ap_module>so
ModuleName "Nested Strided (write)"
ModuleReps 3
ModuleSettleTime 5
</ap_module>

<ap_module>
ModuleName "Nested Strided (read-modify-write)"
ModuleReps 3
ModuleSettleTime 5
</ap_module>

<ap_module>
ModuleName "Nested Strided (re-read)"
ModuleReps 3
ModuleSettleTime 5
</ap_module>

<ap_module>
ModuleName "Nested Strided (re-write)"
ModuleReps 3
ModuleSettleTime 5
</ap_module>

Also I use another file that is not under /mnt/pvfs2 for test, it tips like below:

gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 4 ./pio-bench
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)..........................: MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure
(set=0,sock=1,errno=104:Connection reset by peer)[cli_0]: aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)..........................: MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_readFatal error in MPI_Bcast: Other MPI error, error
stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x1f525978 0x1f5218d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)[cli_2]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................(637)..............:
connection failure (set=0,sock=1,errno=104:Connection reset by peer)
:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x1f525978 0x1f5218d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x202d2978 0x202ce8d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)[cli_1]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x202d2978 0x202ce8d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)
rank 2 in job 11  WANGDI_59039   caused collective abort of all ranks
  exit status of rank 2: return code 1
rank 1 in job 11  WANGDI_59039   caused collective abort of all ranks
  exit status of rank 1: return code 1
rank 0 in job 11  WANGDI_59039   caused collective abort of all ranks
  exit status of rank 0: return code 1


The MPI_COMM_WORLD failed again but it caused collective abort of all
ranks at the end, which is a little bit different, as it has no syslog
for pio-bench to check, I do not understand what happens and I can not
solve this problem.

Appreciate your responses.




      ___________________________________________________________ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20090325/2148aa39/attachment.htm>


More information about the mpich2-dev mailing list