[mpich-discuss] nemesis-local-lmt=knem and osu bibw test

Jerome Soumagne soumagne at cscs.ch
Wed Oct 13 06:20:29 CDT 2010


  Updating my mvapich2 build with the last fix and last knem ABI changes 
(all the #define MPICH_NEW_KNEM_ABI_VERSION (0x0000000c) stuff in mpich2 
1.3xxx) solved the problem that I had as well. That's great, everything 
seems to be fixed.

Jerome

On 10/13/2010 09:59 AM, Jerome Soumagne wrote:
>
>  Hi Darius,
>
> thanks a lot, you definitely fixed it. No more errors in my patched 
> mpich2-1.3rc2 build as well. I'd like now to make this patch work for 
> my mvapich2 build, I'll see with osu guys eventually.
>
> Thanks again.
>
> Jerome
>
> On 10/13/2010 12:54 AM, Darius Buntinas wrote:
>> Hi Jerome,
>>
>> It looks like I was able to fix it.  You can get the patch here 
>> (there's a link at the bottom of the page).
>>
>> https://trac.mcs.anl.gov/projects/mpich2/changeset/7334
>>
>> Let me know if this works for you.
>>
>> -d
>>
>> On Oct 12, 2010, at 11:43 AM, Jerome Soumagne wrote:
>>
>>> ok thanks. I would be glad if you can find a patch to fix that, I 
>>> don't see any in the opened ticket. Did I miss something?
>>>
>>> Since I use the nemesis module in mvapich2 as well, I would expect 
>>> this problem to be solved if it's fixed in mpich2 (even if things 
>>> never go this way)
>>>
>>> Jerome
>>>
>>> On 10/12/2010 06:27 PM, Dave Goodell wrote:
>>>> Ahh yes, this is the same issue, it just didn't explicitly mention 
>>>> knem so I didn't find it in my ticket search.  Thanks for pointing 
>>>> it out.
>>>>
>>>> -Dave
>>>>
>>>> On Oct 12, 2010, at 10:56 AM CDT, Jayesh Krishna wrote:
>>>>
>>>>> FYI, https://trac.mcs.anl.gov/projects/mpich2/ticket/1039
>>>>>
>>>>> -Jayesh
>>>>> ----- Original Message -----
>>>>> From: "Dave Goodell"<goodell at mcs.anl.gov>
>>>>> To: mpich-discuss at mcs.anl.gov
>>>>> Sent: Tuesday, October 12, 2010 10:55:04 AM GMT -06:00 US/Canada 
>>>>> Central
>>>>> Subject: Re: [mpich-discuss] nemesis-local-lmt=knem and osu bibw test
>>>>>
>>>>> I think this is a known bug in MPICH2 that has slipped through the 
>>>>> cracks without being fixed or logged in the bug tracker.  The 
>>>>> problem is caused by incorrectly allocating the knem cookie on the 
>>>>> stack instead of the heap in the nemesis LMT code.  I think Darius 
>>>>> has a fix lying around somewhere, but we can cook one up for you 
>>>>> soon even if he doesn't.
>>>>>
>>>>> I can't speak to the MVAPICH2 bug, you'll have to take that up 
>>>>> with the folks at OSU once we've got the other MPICH2 bug sorted out.
>>>>>
>>>>> Thanks for bringing this (back) to our attention.
>>>>>
>>>>> -Dave
>>>>>
>>>>> On Oct 12, 2010, at 8:43 AM CDT, Jerome Soumagne wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've recently compiled and installed mpich2-1.3rc2 and mvapich2 
>>>>>> 1.5.1p1 with knem support enabled (using the options 
>>>>>> --with-device=ch3:nemesis --with-nemesis-local-lmt=knem 
>>>>>> --with-knem=/usr/local/knem). The version of knem that I use is 
>>>>>> 0.9.2
>>>>>>
>>>>>> Doing a cat of /dev/knem gives:
>>>>>> knem 0.9.2
>>>>>> Driver ABI=0xc
>>>>>> Flags: forcing 0x0, ignoring 0x0
>>>>>> DMAEngine: KernelSupported Enabled NoChannelAvailable
>>>>>> Debug: NotBuilt
>>>>>> Requests Submitted          : 119406
>>>>>> Requests Processed/DMA      : 0
>>>>>> Requests Processed/Thread   : 0
>>>>>> Requests Processed/PinLocal : 0
>>>>>> Requests Failed/NoMemory    : 0
>>>>>> Requests Failed/ReadCmd     : 0
>>>>>> Requests Failed/FindRegion  : 6
>>>>>> Requests Failed/Pin         : 0
>>>>>> Requests Failed/MemcpyToUser: 0
>>>>>> Requests Failed/MemcpyPinned: 0
>>>>>> Requests Failed/DMACopy     : 0
>>>>>> Dmacpy Cleanup Timeout      : 0
>>>>>>
>>>>>> I ran several tests using IMB and osu benchmarks. All tests look 
>>>>>> fine (and I get good bandwidth results, comparable to what I 
>>>>>> could get with limic2) except the osu_bibw test from the osu 
>>>>>> benchmarks which throws the following error with mpich2:
>>>>>>
>>>>>> # OSU MPI Bi-Directional Bandwidth Test v3.1.2
>>>>>> # Size     Bi-Bandwidth (MB/s)
>>>>>> 1                         3.41
>>>>>> 2                         7.15
>>>>>> 4                        12.06
>>>>>> 8                        39.66
>>>>>> 16                       73.20
>>>>>> 32                      156.94
>>>>>> 64                      266.58
>>>>>> 128                     370.34
>>>>>> 256                     977.24
>>>>>> 512                    2089.85
>>>>>> 1024                   3498.96
>>>>>> 2048                   5543.29
>>>>>> 4096                   7314.23
>>>>>> 8192                   8381.86
>>>>>> 16384                  9291.81
>>>>>> 32768                  5948.53
>>>>>> Fatal error in PMPI_Waitall: Other MPI error, error stack:
>>>>>> PMPI_Waitall(274)...............: MPI_Waitall(count=64, 
>>>>>> req_array=0xa11a20, status_array=0xe23960) failed
>>>>>> MPIR_Waitall_impl(121)..........:
>>>>>> MPIDI_CH3I_Progress(393)........:
>>>>>> MPID_nem_handle_pkt(573)........:
>>>>>> pkt_RTS_handler(241)............:
>>>>>> do_cts(518).....................:
>>>>>> MPID_nem_lmt_dma_start_recv(365):
>>>>>> MPID_nem_lmt_send_COOKIE(173)...: ioctl failed errno=22 - Invalid 
>>>>>> argument
>>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
>>>>>>
>>>>>> It seems to come from the nemesis source and from the 
>>>>>> mpid_nem_lmt_dma.c file which uses knem but I don't really now 
>>>>>> what happens and I don't see anything special in that test which 
>>>>>> measures the bi-directional bandwidth. On another machine, I get 
>>>>>> the following error with mvapich2:
>>>>>>
>>>>>> # OSU MPI Bi-Directional Bandwidth Test v3.1.2
>>>>>> # Size     Bi-Bandwidth (MB/s)
>>>>>> 1                         1.92
>>>>>> 2                         3.86
>>>>>> 4                         7.72
>>>>>> 8                        15.44
>>>>>> 16                       30.75
>>>>>> 32                       61.44
>>>>>> 64                      122.54
>>>>>> 128                     232.62
>>>>>> 256                     416.85
>>>>>> 512                     718.60
>>>>>> 1024                   1148.63
>>>>>> 2048                   1462.37
>>>>>> 4096                   1659.45
>>>>>> 8192                   2305.22
>>>>>> 16384                  3153.85
>>>>>> 32768                  3355.30
>>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>>>>>
>>>>>> Attaching gdb gives the following:
>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>> 0x00007f12691e4c99 in MPID_nem_lmt_dma_progress ()
>>>>>>     at 
>>>>>> /project/csvis/soumagne/apps/src/eiger/mvapich2-1.5.1p1/src/mpid/ch3/channels/nemesis/nemesis/src/mpid_nem_lmt_dma.c:484
>>>>>> 484                            prev->next = cur->next;
>>>>>>
>>>>>> Is there something wrong in my mpich2/knem configuration or does 
>>>>>> anyone know where does this problem come from? (the osu_bibw.c 
>>>>>> file is attached)
>>>>>>
>>>>>> Thanks in advance
>>>>>>
>>>>>> Jerome
>>>>>>
>>>>>> -- 
>>>>>> Jérôme Soumagne
>>>>>> Scientific Computing Research Group
>>>>>> CSCS, Swiss National Supercomputing Centre
>>>>>> Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
>>>>>> CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> <osu_bibw.c>_______________________________________________
>>>>>> mpich-discuss mailing list
>>>>>> mpich-discuss at mcs.anl.gov
>>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>>> _______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list