[mpich-discuss] nemesis-local-lmt=knem and osu bibw test

Jerome Soumagne soumagne at cscs.ch
Tue Oct 12 08:43:43 CDT 2010


  Hi,

I've recently compiled and installed mpich2-1.3rc2 and mvapich2 1.5.1p1 
with knem support enabled (using the options --with-device=ch3:nemesis 
--with-nemesis-local-lmt=knem --with-knem=/usr/local/knem). The version 
of knem that I use is 0.9.2

Doing a cat of /dev/knem gives:
knem 0.9.2
  Driver ABI=0xc
  Flags: forcing 0x0, ignoring 0x0
  DMAEngine: KernelSupported Enabled NoChannelAvailable
  Debug: NotBuilt
  Requests Submitted          : 119406
  Requests Processed/DMA      : 0
  Requests Processed/Thread   : 0
  Requests Processed/PinLocal : 0
  Requests Failed/NoMemory    : 0
  Requests Failed/ReadCmd     : 0
  Requests Failed/FindRegion  : 6
  Requests Failed/Pin         : 0
  Requests Failed/MemcpyToUser: 0
  Requests Failed/MemcpyPinned: 0
  Requests Failed/DMACopy     : 0
  Dmacpy Cleanup Timeout      : 0

I ran several tests using IMB and osu benchmarks. All tests look fine 
(and I get good bandwidth results, comparable to what I could get with 
limic2) except the osu_bibw test from the osu benchmarks which throws 
the following error with mpich2:

# OSU MPI Bi-Directional Bandwidth Test v3.1.2
# Size     Bi-Bandwidth (MB/s)
1                         3.41
2                         7.15
4                        12.06
8                        39.66
16                       73.20
32                      156.94
64                      266.58
128                     370.34
256                     977.24
512                    2089.85
1024                   3498.96
2048                   5543.29
4096                   7314.23
8192                   8381.86
16384                  9291.81
32768                  5948.53
Fatal error in PMPI_Waitall: Other MPI error, error stack:
PMPI_Waitall(274)...............: MPI_Waitall(count=64, 
req_array=0xa11a20, status_array=0xe23960) failed
MPIR_Waitall_impl(121)..........:
MPIDI_CH3I_Progress(393)........:
MPID_nem_handle_pkt(573)........:
pkt_RTS_handler(241)............:
do_cts(518).....................:
MPID_nem_lmt_dma_start_recv(365):
MPID_nem_lmt_send_COOKIE(173)...: ioctl failed errno=22 - Invalid argument
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)

It seems to come from the nemesis source and from the mpid_nem_lmt_dma.c 
file which uses knem but I don't really now what happens and I don't see 
anything special in that test which measures the bi-directional 
bandwidth. On another machine, I get the following error with mvapich2:

# OSU MPI Bi-Directional Bandwidth Test v3.1.2
# Size     Bi-Bandwidth (MB/s)
1                         1.92
2                         3.86
4                         7.72
8                        15.44
16                       30.75
32                       61.44
64                      122.54
128                     232.62
256                     416.85
512                     718.60
1024                   1148.63
2048                   1462.37
4096                   1659.45
8192                   2305.22
16384                  3153.85
32768                  3355.30
APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)

Attaching gdb gives the following:
Program received signal SIGSEGV, Segmentation fault.
0x00007f12691e4c99 in MPID_nem_lmt_dma_progress ()
     at 
/project/csvis/soumagne/apps/src/eiger/mvapich2-1.5.1p1/src/mpid/ch3/channels/nemesis/nemesis/src/mpid_nem_lmt_dma.c:484
484                            prev->next = cur->next;

Is there something wrong in my mpich2/knem configuration or does anyone 
know where does this problem come from? (the osu_bibw.c file is attached)

Thanks in advance

Jerome

-- 
Jérôme Soumagne
Scientific Computing Research Group
CSCS, Swiss National Supercomputing Centre
Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101012/7eba98c4/attachment.htm>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: osu_bibw.c
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20101012/7eba98c4/attachment.diff>


More information about the mpich-discuss mailing list