[mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on PVFS

Rajeev Thakur thakur at mcs.anl.gov
Mon Apr 13 11:33:49 CDT 2009


There could well be a bug in the benchmark. I haven't looked at the
benchmark myself, so I can't help you with the debugging.
 
Rajeev


  _____  

From: mpich2-dev-bounces at mcs.anl.gov [mailto:mpich2-dev-bounces at mcs.anl.gov]
On Behalf Of ??
Sent: Monday, April 13, 2009 10:47 AM
To: otheryou at yahoo.cn
Cc: 'MPICH2-developer mailling-list'; sbyna at iit.edu
Subject: Re: [mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on PVFS


Sorry I made a mistake on previous email. The http address is
http://newsgroups.derkeiler.com/Archive/Comp/comp.parallel.mpi/2006-04/msg00
044.html

--- 09年4月13日,周一, 汪迪 <otheryou at yahoo.cn> 写道:



发件人: 汪迪 <otheryou at yahoo.cn>
主题: RE: [mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on PVFS
收件人: "Rajeev Thakur" <thakur at mcs.anl.gov>
抄送: "'MPICH2-developer mailling-list'" <mpich2-dev at mcs.anl.gov>
日期: 2009,413,周一,11:20下午


Hi,  Rajeev

It is not a problem of my MPI installation, it is a problem of pio-bench.
Actually I found someone who has the same problem here:
http://newsgroups.derkeiler.com/Archive/Comp/comp.parallel.mpi/2006-04/msg00
046.html. And I tried his source code on my own machine, the other MPI error
occurs when running on root (while not on root it did not give the error
detail).

I think it is due to the dynamically allocated address semantic
incompatible. The gcc I use is (Ubuntu 4.3.2-1ubuntu12) 4.3.2.  If you can
run pio-bench without any problem, I want to know the edition of the gcc you
use.  Thanks. 

By the way I check the pio-bench source code and know the only statement
that causes this problem is:
 MPI_Bcast( qlist_entry(p, ap_module, link), sizeof(ap_module), MPI_BYTE, 0,
MPI_COMM_WORLD);   in main function of pio-bench.c.  Can you give me any
suggestion on modify this code? As I might as well consider modifying the
pio-bench source code if changing the gcc does not work.

--- 09年3月27日,周五, Rajeev Thakur <thakur at mcs.anl.gov> 写道:



发件人: Rajeev Thakur <thakur at mcs.anl.gov>
主题: RE: [mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on PVFS
收件人: "'??'" <otheryou at yahoo.cn>
抄送: "'MPICH2-developer mailling-list'" <mpich2-dev at mcs.anl.gov>
日期: 2009,327,周五,2:51上午


What do you mean by "I configure PVFS system server on my own laptop, which
uses MPICH2 as its trove storage system implementation"? You have installed
MPICH2 and PVFS as two separate components, right? Does pio-bench work if
you make it access a local file directly via the Linux file system?
 
Rajeev 
 
 



  _____  

From: ?? [mailto:otheryou at yahoo.cn] 
Sent: Thursday, March 26, 2009 12:05 AM
To: Rajeev Thakur
Cc: MPICH2-developer mailling-list
Subject: RE: [mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on PVFS




Hi,  Rajeev

Other MPI programs runs correctly, including cpi under examples folder.  

gxwangdi at WANGDI ~/D/m/examples> pwd
~/Downloads/mpich2-1.0.8/examples
gxwangdi at WANGDI ~/D/m/examples> mpd &
gxwangdi at WANGDI ~/D/m/examples> mpdtrace
WANGDI
gxwangdi at WANGDI ~/D/m/examples> which mpiexec
/usr/local/bin/mpiexec
gxwangdi at WANGDI ~/D/m/examples> mpiexec -n 10 ./cpi
Process 2 of 10 is on WANGDI
Process 1 of 10 is on WANGDI
Process 0 of 10 is on WANGDI
Process 6 of 10 is on WANGDI
Process 4 of 10 is on WANGDI
Process 9 of 10 is on WANGDI
Process 3 of 10 is on WANGDI
Process 7 of 10 is on WANGDI
Process 8 of 10 is on WANGDI
Process 5 of 10 is on WANGDI
pi is approximately 3.1415926544231256, Error is 0.0000000008333325
wall clock time = 0.012229

Thanks for your suggestion, I run make testing in the top directory of
mpich2.  There is a test that do not pass:
xwangdi at WANGDI ~/D/mpich2-1.0.8> mpd &
gxwangdi at WANGDI ~/D/mpich2-1.0.8> make testing
(cd test && make testing)
make[1]: Entering directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test'
(NOXMLCLOSE=YES && export NOXMLCLOSE && cd mpi && make testing)
make[2]: Entering directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi'
./runtests -srcdir=. -tests=testlist \
           -mpiexec=/usr/local/bin/mpiexec \
           -xmlfile=summary.xml
Looking in ./testlist
Processing directory attr
Looking in ./attr/testlist
Processing directory coll
Looking in ./coll/testlist
Processing directory comm
Looking in ./comm/testlist
Some programs (cmsplit) may still be running:
pids = 2279 
The executable (cmsplit) will not be removed.
Processing directory datatype
Looking in ./datatype/testlist
Processing directory errhan
Looking in ./errhan/testlist
Processing directory group
Looking in ./group/testlist
Processing directory info
Looking in ./info/testlist
Processing directory init
Looking in ./init/testlist
Processing directory pt2pt
Looking in ./pt2pt/testlist
Some programs (sendrecv3) may still be running:
pids = 4049 
The executable (sendrecv3) will not be removed.
Processing directory rma
Looking in ./rma/testlist
Some programs (transpose3) may still be running:
pids = 5713 
The executable (transpose3) will not be removed.
Processing directory spawn
Looking in ./spawn/testlist
Processing directory topo
Looking in ./topo/testlist
Processing directory perf
Looking in ./perf/testlist
Processing directory io
Looking in ./io/testlist
Processing directory cxx
Looking in ./cxx/testlist
Processing directory attr
Looking in ./cxx/attr/testlist
Processing directory pt2pt
Looking in ./cxx/pt2pt/testlist
Failed to build bsend1cxx; make[3]: Entering directory
`/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/cxx/pt2pt'
/usr/local/bin/mpicxx -DHAVE_CONFIG_H -I. -I. -I../../include
-I./../../include -c bsend1cxx.cxx
bsend1cxx.cxx: In function ‘int main(int, char**)’:
bsend1cxx.cxx:81: error: ‘strcmp’ was not declared in this scope
bsend1cxx.cxx:91: error: ‘strcmp’ was not declared in this scope
make[3]: *** [bsend1cxx.o] Error 1
make[3]: Leaving directory
`/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/cxx/pt2pt'

Processing directory comm
Looking in ./cxx/comm/testlist
Processing directory coll
Looking in ./cxx/coll/testlist
Processing directory init
Looking in ./cxx/init/testlist
Processing directory info
Looking in ./cxx/info/testlist
Processing directory datatype
Looking in ./cxx/datatype/testlist
Processing directory io
Looking in ./cxx/io/testlist
Processing directory spawn
Looking in ./cxx/spawn/testlist
Processing directory rma
Looking in ./cxx/rma/testlist
Processing directory errors
Looking in ./errors/testlist
Processing directory attr
Looking in ./errors/attr/testlist
Processing directory coll
Looking in ./errors/coll/testlist
Processing directory comm
Looking in ./errors/comm/testlist
Processing directory group
Looking in ./errors/group/testlist
Processing directory pt2pt
Looking in ./errors/pt2pt/testlist
Processing directory topo
Looking in ./errors/topo/testlist
Processing directory rma
Looking in ./errors/rma/testlist
Processing directory spawn
Looking in ./errors/spawn/testlist
Processing directory io
Looking in ./errors/io/testlist
Processing directory cxx
Looking in ./errors/cxx/testlist
Processing directory errhan
Looking in ./errors/cxx/errhan/testlist
Processing directory io
Looking in ./errors/cxx/io/testlist
Processing directory threads
Looking in ./threads/testlist
Processing directory pt2pt
Looking in ./threads/pt2pt/testlist
Processing directory comm
Looking in ./threads/comm/testlist
1 tests failed out of 385
Details in /home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/summary.xml
make[2]: Leaving directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi'
(XMLFILE=../mpi/summary.xml && XMLCONTINUE=YES && \
    export XMLFILE && export XMLCONTINUE && \
    cd commands && make testing)
make[2]: Entering directory
`/home/gxwangdi/Downloads/mpich2-1.0.8/test/commands'
make[2]: Nothing to be done for `testing'.
make[2]: Leaving directory
`/home/gxwangdi/Downloads/mpich2-1.0.8/test/commands'
make[1]: Leaving directory `/home/gxwangdi/Downloads/mpich2-1.0.8/test'

And the attachment is the
/home/gxwangdi/Downloads/mpich2-1.0.8/test/mpi/summary.xml file, it is too
long to paste all content here. I can not understand what is the problem
yet.



From: Rajeev Thakur <thakur at mcs.anl.gov>
Subject: RE: [mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on PVFS
To: otheryou at yahoo.cn, mpich2-dev at mcs.anl.gov
Date: 2009,3/26,Thursday,2:41


Do the other MPICH2 tests run, such as the cpi example in the examples
directory? If you run "make testing" in the top-level mpich2 directory it
will run the entire tests suite in test/mpi (can take more than an hour).
 
Rajeev
 


  _____  

From: mpich2-dev-bounces at mcs.anl.gov [mailto:mpich2-dev-bounces at mcs.anl.gov]
On Behalf Of ??
Sent: Wednesday, March 25, 2009 12:11 AM
To: MPICH2-developer mailling-list
Subject: [mpich2-dev] MPI_COMM_WORLD failed when using pio-bench on PVFS



Hi, 

I configure PVFS system server on my own laptop, which uses MPICH2 as its
trove storage system implementation.  And now I tend to use pio-bench to get
the trace of the server. It works as I set the number of process as 1, but
when I set the number of process to more than 1, it does not work all the
time. It tips the MPI_COMM_WORLD failed, but I run hostname command with MPI
and it works. It is peculiar. I check the MPICH2 document and find no
particular configuration for single host MPI deployment. Do I do anything
wrong?


 By the way my laptop is IBM i386 architecture and CPU is intel centrino2
vPro, OS UBUNTU 8.10, MPICH2 1.0.8 is installed by default under
/usr/local, and pvfs2.7.1 is under /root/pvfs-install/. 

gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 1 ./pio-bench
[sudo] password for gxwangdi:
File under test: /mnt/pvfs2/ftpaccess
Number of Processes: 1
Sync: off
Averaging: Off
the nested strided pattern needs to be run with an even amount of processes
file pio-bench.c, line 586: access pattern initialization error: -1

gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 4 ./pio-bench
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)..........................: MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure
(set=0,sock=1,errno=104:Connection reset by peer)[cli_0]: aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)..........................: MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_readFatal error in MPI_Bcast: Other MPI error, error
stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x1f329978 0x1f3258d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)[cli_1]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................(637)..............:
connection failure (set=0,sock=1,errno=104:Connection reset by peer)
:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x1f329978 0x1f3258d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x1eb9f978 0x1eb9b8d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)[cli_2]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)........................so....: MPI_Bcast(buf=0x1fd6ca78,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x1eb9f978 0x1eb9b8d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)
rank 1 in job 9  WANGDI_59039   caused collective abort of all ranks
  exit status of rank 1: return code 1
rank 0 in job 9  WANGDI_59039   caused collective abort of all ranks
  exit status of rank 0: return code 1
gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 4 hostname
WANGDI
WANGDI
WANGDI
WANGDI


and my pio-bench.conf file is like:

Testfile "/mnt/pvfs2/ftpaccess"

OutputToFile "/home/gxwangdi/Desktop/pio-bench/results/result"

<ap_module>
ModuleName "Nested Strided (read)"
ModuleReps 3
ModuleSettleTime 5
</ap_module>

<ap_module>so
ModuleName "Nested Strided (write)"
ModuleReps 3
ModuleSettleTime 5
</ap_module>

<ap_module>
ModuleName "Nested Strided (read-modify-write)"
ModuleReps 3
ModuleSettleTime 5
</ap_module>

<ap_module>
ModuleName "Nested Strided (re-read)"
ModuleReps 3
ModuleSettleTime 5
</ap_module>

<ap_module>
ModuleName "Nested Strided (re-write)"
ModuleReps 3
ModuleSettleTime 5
</ap_module>

Also I use another file that is not under /mnt/pvfs2 for test, it tips like
below:

gxwangdi at WANGDI:~/Desktop/pio-bench$ sudo mpiexec -n 4 ./pio-bench
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)..........................: MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_read(637)..............: connection failure
(set=0,sock=1,errno=104:Connection reset by peer)[cli_0]: aborting job:
Fatal error in MPI_Barrier: Other MPI error, error stack:
MPI_Barrier(406)..........................: MPI_Barrier(MPI_COMM_WORLD)
failed
MPIR_Barrier(77)..........................:
MPIC_Sendrecv(126)........................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(420):
MPIDU_Socki_handle_readFatal error in MPI_Bcast: Other MPI error, error
stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x1f525978 0x1f5218d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)[cli_2]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................(637)..............:
connection failure (set=0,sock=1,errno=104:Connection reset by peer)
:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x1f525978 0x1f5218d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x202d2978 0x202ce8d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)[cli_1]: aborting job:
Fatal error in MPI_Bcast: Other MPI error, error stack:
MPI_Bcast(786)............................: MPI_Bcast(buf=0x1f60fad0,
count=20, MPI_BYTE, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast(198)...........................:
MPIC_Recv(81).............................:
MPIC_Wait(270)............................:
MPIDI_CH3i_Progress_wait(215).............: an error occurred while
handling an event returned by MPIDU_Sock_Wait()
MPIDI_CH3I_Progress_handle_sock_event(456):
adjust_iov(973)...........................: ch3|sock|immedread
0x1e5a0d60 0x202d2978 0x202ce8d0
MPIDU_Sock_readv(455).....................: the supplied buffer contains
invalid memory (set=0,sock=1,errno=14:Bad address)
rank 2 in job 11  WANGDI_59039   caused collective abort of all ranks
  exit status of rank 2: return code 1
rank 1 in job 11  WANGDI_59039   caused collective abort of all ranks
  exit status of rank 1: return code 1
rank 0 in job 11  WANGDI_59039   caused collective abort of all ranks
  exit status of rank 0: return code 1


The MPI_COMM_WORLD failed again but it caused collective abort of all
ranks at the end, which is a little bit different, as it has no syslog
for pio-bench to check, I do not understand what happens and I can not
solve this problem.

Appreciate your responses.

	

  _____  

好玩贺卡等你发,邮箱贺卡全新上线!
<http://cn.rd.yahoo.com/mail_cn/tagline/card/*http://card.mail.cn.yahoo.com/
> 


  _____  

好玩贺卡等你发,邮箱贺卡全新上线!
<http://cn.rd.yahoo.com/mail_cn/tagline/card/*http://card.mail.cn.yahoo.com/
> 


  _____  

好玩贺卡等你发,邮箱贺卡全新上线!
<http://cn.rd.yahoo.com/mail_cn/tagline/card/*http://card.mail.cn.yahoo.com/
> 


  _____  

好玩贺卡等你发,邮箱贺卡全新上线!
<http://cn.rd.yahoo.com/mail_cn/tagline/card/*http://card.mail.cn.yahoo.com/
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mcs.anl.gov/mailman/private/mpich2-dev/attachments/20090413/403b6aa5/attachment-0001.htm>


More information about the mpich2-dev mailing list