[mpich-discuss] nemesis-local-lmt=knem and osu bibw test
Jerome Soumagne
soumagne at cscs.ch
Wed Oct 13 06:20:29 CDT 2010
Updating my mvapich2 build with the last fix and last knem ABI changes
(all the #define MPICH_NEW_KNEM_ABI_VERSION (0x0000000c) stuff in mpich2
1.3xxx) solved the problem that I had as well. That's great, everything
seems to be fixed.
Jerome
On 10/13/2010 09:59 AM, Jerome Soumagne wrote:
>
> Hi Darius,
>
> thanks a lot, you definitely fixed it. No more errors in my patched
> mpich2-1.3rc2 build as well. I'd like now to make this patch work for
> my mvapich2 build, I'll see with osu guys eventually.
>
> Thanks again.
>
> Jerome
>
> On 10/13/2010 12:54 AM, Darius Buntinas wrote:
>> Hi Jerome,
>>
>> It looks like I was able to fix it. You can get the patch here
>> (there's a link at the bottom of the page).
>>
>> https://trac.mcs.anl.gov/projects/mpich2/changeset/7334
>>
>> Let me know if this works for you.
>>
>> -d
>>
>> On Oct 12, 2010, at 11:43 AM, Jerome Soumagne wrote:
>>
>>> ok thanks. I would be glad if you can find a patch to fix that, I
>>> don't see any in the opened ticket. Did I miss something?
>>>
>>> Since I use the nemesis module in mvapich2 as well, I would expect
>>> this problem to be solved if it's fixed in mpich2 (even if things
>>> never go this way)
>>>
>>> Jerome
>>>
>>> On 10/12/2010 06:27 PM, Dave Goodell wrote:
>>>> Ahh yes, this is the same issue, it just didn't explicitly mention
>>>> knem so I didn't find it in my ticket search. Thanks for pointing
>>>> it out.
>>>>
>>>> -Dave
>>>>
>>>> On Oct 12, 2010, at 10:56 AM CDT, Jayesh Krishna wrote:
>>>>
>>>>> FYI, https://trac.mcs.anl.gov/projects/mpich2/ticket/1039
>>>>>
>>>>> -Jayesh
>>>>> ----- Original Message -----
>>>>> From: "Dave Goodell"<goodell at mcs.anl.gov>
>>>>> To: mpich-discuss at mcs.anl.gov
>>>>> Sent: Tuesday, October 12, 2010 10:55:04 AM GMT -06:00 US/Canada
>>>>> Central
>>>>> Subject: Re: [mpich-discuss] nemesis-local-lmt=knem and osu bibw test
>>>>>
>>>>> I think this is a known bug in MPICH2 that has slipped through the
>>>>> cracks without being fixed or logged in the bug tracker. The
>>>>> problem is caused by incorrectly allocating the knem cookie on the
>>>>> stack instead of the heap in the nemesis LMT code. I think Darius
>>>>> has a fix lying around somewhere, but we can cook one up for you
>>>>> soon even if he doesn't.
>>>>>
>>>>> I can't speak to the MVAPICH2 bug, you'll have to take that up
>>>>> with the folks at OSU once we've got the other MPICH2 bug sorted out.
>>>>>
>>>>> Thanks for bringing this (back) to our attention.
>>>>>
>>>>> -Dave
>>>>>
>>>>> On Oct 12, 2010, at 8:43 AM CDT, Jerome Soumagne wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've recently compiled and installed mpich2-1.3rc2 and mvapich2
>>>>>> 1.5.1p1 with knem support enabled (using the options
>>>>>> --with-device=ch3:nemesis --with-nemesis-local-lmt=knem
>>>>>> --with-knem=/usr/local/knem). The version of knem that I use is
>>>>>> 0.9.2
>>>>>>
>>>>>> Doing a cat of /dev/knem gives:
>>>>>> knem 0.9.2
>>>>>> Driver ABI=0xc
>>>>>> Flags: forcing 0x0, ignoring 0x0
>>>>>> DMAEngine: KernelSupported Enabled NoChannelAvailable
>>>>>> Debug: NotBuilt
>>>>>> Requests Submitted : 119406
>>>>>> Requests Processed/DMA : 0
>>>>>> Requests Processed/Thread : 0
>>>>>> Requests Processed/PinLocal : 0
>>>>>> Requests Failed/NoMemory : 0
>>>>>> Requests Failed/ReadCmd : 0
>>>>>> Requests Failed/FindRegion : 6
>>>>>> Requests Failed/Pin : 0
>>>>>> Requests Failed/MemcpyToUser: 0
>>>>>> Requests Failed/MemcpyPinned: 0
>>>>>> Requests Failed/DMACopy : 0
>>>>>> Dmacpy Cleanup Timeout : 0
>>>>>>
>>>>>> I ran several tests using IMB and osu benchmarks. All tests look
>>>>>> fine (and I get good bandwidth results, comparable to what I
>>>>>> could get with limic2) except the osu_bibw test from the osu
>>>>>> benchmarks which throws the following error with mpich2:
>>>>>>
>>>>>> # OSU MPI Bi-Directional Bandwidth Test v3.1.2
>>>>>> # Size Bi-Bandwidth (MB/s)
>>>>>> 1 3.41
>>>>>> 2 7.15
>>>>>> 4 12.06
>>>>>> 8 39.66
>>>>>> 16 73.20
>>>>>> 32 156.94
>>>>>> 64 266.58
>>>>>> 128 370.34
>>>>>> 256 977.24
>>>>>> 512 2089.85
>>>>>> 1024 3498.96
>>>>>> 2048 5543.29
>>>>>> 4096 7314.23
>>>>>> 8192 8381.86
>>>>>> 16384 9291.81
>>>>>> 32768 5948.53
>>>>>> Fatal error in PMPI_Waitall: Other MPI error, error stack:
>>>>>> PMPI_Waitall(274)...............: MPI_Waitall(count=64,
>>>>>> req_array=0xa11a20, status_array=0xe23960) failed
>>>>>> MPIR_Waitall_impl(121)..........:
>>>>>> MPIDI_CH3I_Progress(393)........:
>>>>>> MPID_nem_handle_pkt(573)........:
>>>>>> pkt_RTS_handler(241)............:
>>>>>> do_cts(518).....................:
>>>>>> MPID_nem_lmt_dma_start_recv(365):
>>>>>> MPID_nem_lmt_send_COOKIE(173)...: ioctl failed errno=22 - Invalid
>>>>>> argument
>>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
>>>>>>
>>>>>> It seems to come from the nemesis source and from the
>>>>>> mpid_nem_lmt_dma.c file which uses knem but I don't really now
>>>>>> what happens and I don't see anything special in that test which
>>>>>> measures the bi-directional bandwidth. On another machine, I get
>>>>>> the following error with mvapich2:
>>>>>>
>>>>>> # OSU MPI Bi-Directional Bandwidth Test v3.1.2
>>>>>> # Size Bi-Bandwidth (MB/s)
>>>>>> 1 1.92
>>>>>> 2 3.86
>>>>>> 4 7.72
>>>>>> 8 15.44
>>>>>> 16 30.75
>>>>>> 32 61.44
>>>>>> 64 122.54
>>>>>> 128 232.62
>>>>>> 256 416.85
>>>>>> 512 718.60
>>>>>> 1024 1148.63
>>>>>> 2048 1462.37
>>>>>> 4096 1659.45
>>>>>> 8192 2305.22
>>>>>> 16384 3153.85
>>>>>> 32768 3355.30
>>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>>>>>
>>>>>> Attaching gdb gives the following:
>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>> 0x00007f12691e4c99 in MPID_nem_lmt_dma_progress ()
>>>>>> at
>>>>>> /project/csvis/soumagne/apps/src/eiger/mvapich2-1.5.1p1/src/mpid/ch3/channels/nemesis/nemesis/src/mpid_nem_lmt_dma.c:484
>>>>>> 484 prev->next = cur->next;
>>>>>>
>>>>>> Is there something wrong in my mpich2/knem configuration or does
>>>>>> anyone know where does this problem come from? (the osu_bibw.c
>>>>>> file is attached)
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>> Jerome
>>>>>>
>>>>>> --
>>>>>> Jérôme Soumagne
>>>>>> Scientific Computing Research Group
>>>>>> CSCS, Swiss National Supercomputing Centre
>>>>>> Galleria 2, Via Cantonale | Tel: +41 (0)91 610 8258
>>>>>> CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> <osu_bibw.c>_______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list