[mpich-discuss] nemesis-local-lmt=knem and osu bibw test
Dave Goodell
goodell at mcs.anl.gov
Tue Oct 12 11:27:45 CDT 2010
Ahh yes, this is the same issue, it just didn't explicitly mention knem so I didn't find it in my ticket search. Thanks for pointing it out.
-Dave
On Oct 12, 2010, at 10:56 AM CDT, Jayesh Krishna wrote:
> FYI, https://trac.mcs.anl.gov/projects/mpich2/ticket/1039
>
> -Jayesh
> ----- Original Message -----
> From: "Dave Goodell" <goodell at mcs.anl.gov>
> To: mpich-discuss at mcs.anl.gov
> Sent: Tuesday, October 12, 2010 10:55:04 AM GMT -06:00 US/Canada Central
> Subject: Re: [mpich-discuss] nemesis-local-lmt=knem and osu bibw test
>
> I think this is a known bug in MPICH2 that has slipped through the cracks without being fixed or logged in the bug tracker. The problem is caused by incorrectly allocating the knem cookie on the stack instead of the heap in the nemesis LMT code. I think Darius has a fix lying around somewhere, but we can cook one up for you soon even if he doesn't.
>
> I can't speak to the MVAPICH2 bug, you'll have to take that up with the folks at OSU once we've got the other MPICH2 bug sorted out.
>
> Thanks for bringing this (back) to our attention.
>
> -Dave
>
> On Oct 12, 2010, at 8:43 AM CDT, Jerome Soumagne wrote:
>
>> Hi,
>>
>> I've recently compiled and installed mpich2-1.3rc2 and mvapich2 1.5.1p1 with knem support enabled (using the options --with-device=ch3:nemesis --with-nemesis-local-lmt=knem --with-knem=/usr/local/knem). The version of knem that I use is 0.9.2
>>
>> Doing a cat of /dev/knem gives:
>> knem 0.9.2
>> Driver ABI=0xc
>> Flags: forcing 0x0, ignoring 0x0
>> DMAEngine: KernelSupported Enabled NoChannelAvailable
>> Debug: NotBuilt
>> Requests Submitted : 119406
>> Requests Processed/DMA : 0
>> Requests Processed/Thread : 0
>> Requests Processed/PinLocal : 0
>> Requests Failed/NoMemory : 0
>> Requests Failed/ReadCmd : 0
>> Requests Failed/FindRegion : 6
>> Requests Failed/Pin : 0
>> Requests Failed/MemcpyToUser: 0
>> Requests Failed/MemcpyPinned: 0
>> Requests Failed/DMACopy : 0
>> Dmacpy Cleanup Timeout : 0
>>
>> I ran several tests using IMB and osu benchmarks. All tests look fine (and I get good bandwidth results, comparable to what I could get with limic2) except the osu_bibw test from the osu benchmarks which throws the following error with mpich2:
>>
>> # OSU MPI Bi-Directional Bandwidth Test v3.1.2
>> # Size Bi-Bandwidth (MB/s)
>> 1 3.41
>> 2 7.15
>> 4 12.06
>> 8 39.66
>> 16 73.20
>> 32 156.94
>> 64 266.58
>> 128 370.34
>> 256 977.24
>> 512 2089.85
>> 1024 3498.96
>> 2048 5543.29
>> 4096 7314.23
>> 8192 8381.86
>> 16384 9291.81
>> 32768 5948.53
>> Fatal error in PMPI_Waitall: Other MPI error, error stack:
>> PMPI_Waitall(274)...............: MPI_Waitall(count=64, req_array=0xa11a20, status_array=0xe23960) failed
>> MPIR_Waitall_impl(121)..........:
>> MPIDI_CH3I_Progress(393)........:
>> MPID_nem_handle_pkt(573)........:
>> pkt_RTS_handler(241)............:
>> do_cts(518).....................:
>> MPID_nem_lmt_dma_start_recv(365):
>> MPID_nem_lmt_send_COOKIE(173)...: ioctl failed errno=22 - Invalid argument
>> APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
>>
>> It seems to come from the nemesis source and from the mpid_nem_lmt_dma.c file which uses knem but I don't really now what happens and I don't see anything special in that test which measures the bi-directional bandwidth. On another machine, I get the following error with mvapich2:
>>
>> # OSU MPI Bi-Directional Bandwidth Test v3.1.2
>> # Size Bi-Bandwidth (MB/s)
>> 1 1.92
>> 2 3.86
>> 4 7.72
>> 8 15.44
>> 16 30.75
>> 32 61.44
>> 64 122.54
>> 128 232.62
>> 256 416.85
>> 512 718.60
>> 1024 1148.63
>> 2048 1462.37
>> 4096 1659.45
>> 8192 2305.22
>> 16384 3153.85
>> 32768 3355.30
>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>
>> Attaching gdb gives the following:
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x00007f12691e4c99 in MPID_nem_lmt_dma_progress ()
>> at /project/csvis/soumagne/apps/src/eiger/mvapich2-1.5.1p1/src/mpid/ch3/channels/nemesis/nemesis/src/mpid_nem_lmt_dma.c:484
>> 484 prev->next = cur->next;
>>
>> Is there something wrong in my mpich2/knem configuration or does anyone know where does this problem come from? (the osu_bibw.c file is attached)
>>
>> Thanks in advance
>>
>> Jerome
>>
>> --
>> Jérôme Soumagne
>> Scientific Computing Research Group
>> CSCS, Swiss National Supercomputing Centre
>> Galleria 2, Via Cantonale | Tel: +41 (0)91 610 8258
>> CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282
>>
>>
>>
>>
>> <osu_bibw.c>_______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list