[mpich-discuss] nemesis-local-lmt=knem and osu bibw test
Jerome Soumagne
soumagne at cscs.ch
Tue Oct 12 08:43:43 CDT 2010
Hi,
I've recently compiled and installed mpich2-1.3rc2 and mvapich2 1.5.1p1
with knem support enabled (using the options --with-device=ch3:nemesis
--with-nemesis-local-lmt=knem --with-knem=/usr/local/knem). The version
of knem that I use is 0.9.2
Doing a cat of /dev/knem gives:
knem 0.9.2
Driver ABI=0xc
Flags: forcing 0x0, ignoring 0x0
DMAEngine: KernelSupported Enabled NoChannelAvailable
Debug: NotBuilt
Requests Submitted : 119406
Requests Processed/DMA : 0
Requests Processed/Thread : 0
Requests Processed/PinLocal : 0
Requests Failed/NoMemory : 0
Requests Failed/ReadCmd : 0
Requests Failed/FindRegion : 6
Requests Failed/Pin : 0
Requests Failed/MemcpyToUser: 0
Requests Failed/MemcpyPinned: 0
Requests Failed/DMACopy : 0
Dmacpy Cleanup Timeout : 0
I ran several tests using IMB and osu benchmarks. All tests look fine
(and I get good bandwidth results, comparable to what I could get with
limic2) except the osu_bibw test from the osu benchmarks which throws
the following error with mpich2:
# OSU MPI Bi-Directional Bandwidth Test v3.1.2
# Size Bi-Bandwidth (MB/s)
1 3.41
2 7.15
4 12.06
8 39.66
16 73.20
32 156.94
64 266.58
128 370.34
256 977.24
512 2089.85
1024 3498.96
2048 5543.29
4096 7314.23
8192 8381.86
16384 9291.81
32768 5948.53
Fatal error in PMPI_Waitall: Other MPI error, error stack:
PMPI_Waitall(274)...............: MPI_Waitall(count=64,
req_array=0xa11a20, status_array=0xe23960) failed
MPIR_Waitall_impl(121)..........:
MPIDI_CH3I_Progress(393)........:
MPID_nem_handle_pkt(573)........:
pkt_RTS_handler(241)............:
do_cts(518).....................:
MPID_nem_lmt_dma_start_recv(365):
MPID_nem_lmt_send_COOKIE(173)...: ioctl failed errno=22 - Invalid argument
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
It seems to come from the nemesis source and from the mpid_nem_lmt_dma.c
file which uses knem but I don't really now what happens and I don't see
anything special in that test which measures the bi-directional
bandwidth. On another machine, I get the following error with mvapich2:
# OSU MPI Bi-Directional Bandwidth Test v3.1.2
# Size Bi-Bandwidth (MB/s)
1 1.92
2 3.86
4 7.72
8 15.44
16 30.75
32 61.44
64 122.54
128 232.62
256 416.85
512 718.60
1024 1148.63
2048 1462.37
4096 1659.45
8192 2305.22
16384 3153.85
32768 3355.30
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
Attaching gdb gives the following:
Program received signal SIGSEGV, Segmentation fault.
0x00007f12691e4c99 in MPID_nem_lmt_dma_progress ()
at
/project/csvis/soumagne/apps/src/eiger/mvapich2-1.5.1p1/src/mpid/ch3/channels/nemesis/nemesis/src/mpid_nem_lmt_dma.c:484
484 prev->next = cur->next;
Is there something wrong in my mpich2/knem configuration or does anyone
know where does this problem come from? (the osu_bibw.c file is attached)
Thanks in advance
Jerome
--
Jérôme Soumagne
Scientific Computing Research Group
CSCS, Swiss National Supercomputing Centre
Galleria 2, Via Cantonale | Tel: +41 (0)91 610 8258
CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101012/7eba98c4/attachment.htm>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: osu_bibw.c
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101012/7eba98c4/attachment.diff>
More information about the mpich-discuss
mailing list