[mpich-discuss] nemesis-local-lmt=knem and osu bibw test

Dave Goodell goodell at mcs.anl.gov
Tue Oct 12 10:55:04 CDT 2010


I think this is a known bug in MPICH2 that has slipped through the cracks without being fixed or logged in the bug tracker.  The problem is caused by incorrectly allocating the knem cookie on the stack instead of the heap in the nemesis LMT code.  I think Darius has a fix lying around somewhere, but we can cook one up for you soon even if he doesn't.

I can't speak to the MVAPICH2 bug, you'll have to take that up with the folks at OSU once we've got the other MPICH2 bug sorted out.

Thanks for bringing this (back) to our attention.

-Dave

On Oct 12, 2010, at 8:43 AM CDT, Jerome Soumagne wrote:

> Hi,
> 
> I've recently compiled and installed mpich2-1.3rc2 and mvapich2 1.5.1p1 with knem support enabled (using the options --with-device=ch3:nemesis --with-nemesis-local-lmt=knem --with-knem=/usr/local/knem). The version of knem that I use is 0.9.2
> 
> Doing a cat of /dev/knem gives:
> knem 0.9.2
>  Driver ABI=0xc
>  Flags: forcing 0x0, ignoring 0x0
>  DMAEngine: KernelSupported Enabled NoChannelAvailable
>  Debug: NotBuilt
>  Requests Submitted          : 119406
>  Requests Processed/DMA      : 0
>  Requests Processed/Thread   : 0
>  Requests Processed/PinLocal : 0
>  Requests Failed/NoMemory    : 0
>  Requests Failed/ReadCmd     : 0
>  Requests Failed/FindRegion  : 6
>  Requests Failed/Pin         : 0
>  Requests Failed/MemcpyToUser: 0
>  Requests Failed/MemcpyPinned: 0
>  Requests Failed/DMACopy     : 0
>  Dmacpy Cleanup Timeout      : 0
> 
> I ran several tests using IMB and osu benchmarks. All tests look fine (and I get good bandwidth results, comparable to what I could get with limic2) except the osu_bibw test from the osu benchmarks which throws the following error with mpich2:
> 
> # OSU MPI Bi-Directional Bandwidth Test v3.1.2
> # Size     Bi-Bandwidth (MB/s)
> 1                         3.41
> 2                         7.15
> 4                        12.06
> 8                        39.66
> 16                       73.20
> 32                      156.94
> 64                      266.58
> 128                     370.34
> 256                     977.24
> 512                    2089.85
> 1024                   3498.96
> 2048                   5543.29
> 4096                   7314.23
> 8192                   8381.86
> 16384                  9291.81
> 32768                  5948.53
> Fatal error in PMPI_Waitall: Other MPI error, error stack:
> PMPI_Waitall(274)...............: MPI_Waitall(count=64, req_array=0xa11a20, status_array=0xe23960) failed
> MPIR_Waitall_impl(121)..........: 
> MPIDI_CH3I_Progress(393)........: 
> MPID_nem_handle_pkt(573)........: 
> pkt_RTS_handler(241)............: 
> do_cts(518).....................: 
> MPID_nem_lmt_dma_start_recv(365): 
> MPID_nem_lmt_send_COOKIE(173)...: ioctl failed errno=22 - Invalid argument
> APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
> 
> It seems to come from the nemesis source and from the mpid_nem_lmt_dma.c file which uses knem but I don't really now what happens and I don't see anything special in that test which measures the bi-directional bandwidth. On another machine, I get the following error with mvapich2:
> 
> # OSU MPI Bi-Directional Bandwidth Test v3.1.2
> # Size     Bi-Bandwidth (MB/s)
> 1                         1.92
> 2                         3.86
> 4                         7.72
> 8                        15.44
> 16                       30.75
> 32                       61.44
> 64                      122.54
> 128                     232.62
> 256                     416.85
> 512                     718.60
> 1024                   1148.63
> 2048                   1462.37
> 4096                   1659.45
> 8192                   2305.22
> 16384                  3153.85
> 32768                  3355.30
> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
> 
> Attaching gdb gives the following:
> Program received signal SIGSEGV, Segmentation fault.
> 0x00007f12691e4c99 in MPID_nem_lmt_dma_progress ()
>     at /project/csvis/soumagne/apps/src/eiger/mvapich2-1.5.1p1/src/mpid/ch3/channels/nemesis/nemesis/src/mpid_nem_lmt_dma.c:484
> 484                            prev->next = cur->next;
> 
> Is there something wrong in my mpich2/knem configuration or does anyone know where does this problem come from? (the osu_bibw.c file is attached)
> 
> Thanks in advance
> 
> Jerome
> 
> -- 
> Jérôme Soumagne
> Scientific Computing Research Group
> CSCS, Swiss National Supercomputing Centre 
> Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
> CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282
> 
> 
> 
> 
> <osu_bibw.c>_______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list