[mpich-discuss] nemesis-local-lmt=knem and osu bibw test

Jerome Soumagne soumagne at cscs.ch
Wed Oct 13 03:09:25 CDT 2010


  Hi Dave,

On 10/13/2010 12:54 AM, Dave Goodell wrote:
> Darius just fixed this in the trunk: https://trac.mcs.anl.gov/projects/mpich2/changeset/7334
>
> mpich2-1.3 final will contain this fix.  You can also get a fixed version via the nightly snapshots after tonight: http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/trunk/
ok, I'm happy to see that this fix will come in mpich2-1.3 and not only 
in mpich2-1.3.1
Thanks for the quick response!
> -Dave
Jerome
> On Oct 12, 2010, at 11:43 AM CDT, Jerome Soumagne wrote:
>
>> ok thanks. I would be glad if you can find a patch to fix that, I don't see any in the opened ticket. Did I miss something?
>>
>> Since I use the nemesis module in mvapich2 as well, I would expect this problem to be solved if it's fixed in mpich2 (even if things never go this way)
>>
>> Jerome
>>
>> On 10/12/2010 06:27 PM, Dave Goodell wrote:
>>> Ahh yes, this is the same issue, it just didn't explicitly mention knem so I didn't find it in my ticket search.  Thanks for pointing it out.
>>>
>>> -Dave
>>>
>>> On Oct 12, 2010, at 10:56 AM CDT, Jayesh Krishna wrote:
>>>
>>>> FYI, https://trac.mcs.anl.gov/projects/mpich2/ticket/1039
>>>>
>>>> -Jayesh
>>>> ----- Original Message -----
>>>> From: "Dave Goodell"<goodell at mcs.anl.gov>
>>>> To: mpich-discuss at mcs.anl.gov
>>>> Sent: Tuesday, October 12, 2010 10:55:04 AM GMT -06:00 US/Canada Central
>>>> Subject: Re: [mpich-discuss] nemesis-local-lmt=knem and osu bibw test
>>>>
>>>> I think this is a known bug in MPICH2 that has slipped through the cracks without being fixed or logged in the bug tracker.  The problem is caused by incorrectly allocating the knem cookie on the stack instead of the heap in the nemesis LMT code.  I think Darius has a fix lying around somewhere, but we can cook one up for you soon even if he doesn't.
>>>>
>>>> I can't speak to the MVAPICH2 bug, you'll have to take that up with the folks at OSU once we've got the other MPICH2 bug sorted out.
>>>>
>>>> Thanks for bringing this (back) to our attention.
>>>>
>>>> -Dave
>>>>
>>>> On Oct 12, 2010, at 8:43 AM CDT, Jerome Soumagne wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've recently compiled and installed mpich2-1.3rc2 and mvapich2 1.5.1p1 with knem support enabled (using the options --with-device=ch3:nemesis --with-nemesis-local-lmt=knem --with-knem=/usr/local/knem). The version of knem that I use is 0.9.2
>>>>>
>>>>> Doing a cat of /dev/knem gives:
>>>>> knem 0.9.2
>>>>> Driver ABI=0xc
>>>>> Flags: forcing 0x0, ignoring 0x0
>>>>> DMAEngine: KernelSupported Enabled NoChannelAvailable
>>>>> Debug: NotBuilt
>>>>> Requests Submitted          : 119406
>>>>> Requests Processed/DMA      : 0
>>>>> Requests Processed/Thread   : 0
>>>>> Requests Processed/PinLocal : 0
>>>>> Requests Failed/NoMemory    : 0
>>>>> Requests Failed/ReadCmd     : 0
>>>>> Requests Failed/FindRegion  : 6
>>>>> Requests Failed/Pin         : 0
>>>>> Requests Failed/MemcpyToUser: 0
>>>>> Requests Failed/MemcpyPinned: 0
>>>>> Requests Failed/DMACopy     : 0
>>>>> Dmacpy Cleanup Timeout      : 0
>>>>>
>>>>> I ran several tests using IMB and osu benchmarks. All tests look fine (and I get good bandwidth results, comparable to what I could get with limic2) except the osu_bibw test from the osu benchmarks which throws the following error with mpich2:
>>>>>
>>>>> # OSU MPI Bi-Directional Bandwidth Test v3.1.2
>>>>> # Size     Bi-Bandwidth (MB/s)
>>>>> 1                         3.41
>>>>> 2                         7.15
>>>>> 4                        12.06
>>>>> 8                        39.66
>>>>> 16                       73.20
>>>>> 32                      156.94
>>>>> 64                      266.58
>>>>> 128                     370.34
>>>>> 256                     977.24
>>>>> 512                    2089.85
>>>>> 1024                   3498.96
>>>>> 2048                   5543.29
>>>>> 4096                   7314.23
>>>>> 8192                   8381.86
>>>>> 16384                  9291.81
>>>>> 32768                  5948.53
>>>>> Fatal error in PMPI_Waitall: Other MPI error, error stack:
>>>>> PMPI_Waitall(274)...............: MPI_Waitall(count=64, req_array=0xa11a20, status_array=0xe23960) failed
>>>>> MPIR_Waitall_impl(121)..........:
>>>>> MPIDI_CH3I_Progress(393)........:
>>>>> MPID_nem_handle_pkt(573)........:
>>>>> pkt_RTS_handler(241)............:
>>>>> do_cts(518).....................:
>>>>> MPID_nem_lmt_dma_start_recv(365):
>>>>> MPID_nem_lmt_send_COOKIE(173)...: ioctl failed errno=22 - Invalid argument
>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)
>>>>>
>>>>> It seems to come from the nemesis source and from the mpid_nem_lmt_dma.c file which uses knem but I don't really now what happens and I don't see anything special in that test which measures the bi-directional bandwidth. On another machine, I get the following error with mvapich2:
>>>>>
>>>>> # OSU MPI Bi-Directional Bandwidth Test v3.1.2
>>>>> # Size     Bi-Bandwidth (MB/s)
>>>>> 1                         1.92
>>>>> 2                         3.86
>>>>> 4                         7.72
>>>>> 8                        15.44
>>>>> 16                       30.75
>>>>> 32                       61.44
>>>>> 64                      122.54
>>>>> 128                     232.62
>>>>> 256                     416.85
>>>>> 512                     718.60
>>>>> 1024                   1148.63
>>>>> 2048                   1462.37
>>>>> 4096                   1659.45
>>>>> 8192                   2305.22
>>>>> 16384                  3153.85
>>>>> 32768                  3355.30
>>>>> APPLICATION TERMINATED WITH THE EXIT STRING: Terminated (signal 15)
>>>>>
>>>>> Attaching gdb gives the following:
>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>> 0x00007f12691e4c99 in MPID_nem_lmt_dma_progress ()
>>>>>     at /project/csvis/soumagne/apps/src/eiger/mvapich2-1.5.1p1/src/mpid/ch3/channels/nemesis/nemesis/src/mpid_nem_lmt_dma.c:484
>>>>> 484                            prev->next = cur->next;
>>>>>
>>>>> Is there something wrong in my mpich2/knem configuration or does anyone know where does this problem come from? (the osu_bibw.c file is attached)
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> Jerome
>>>>>
>>>>> -- 
>>>>> Jérôme Soumagne
>>>>> Scientific Computing Research Group
>>>>> CSCS, Swiss National Supercomputing Centre
>>>>> Galleria 2, Via Cantonale  | Tel: +41 (0)91 610 8258
>>>>> CH-6928 Manno, Switzerland | Fax: +41 (0)91 610 8282
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> <osu_bibw.c>_______________________________________________
>>>>> mpich-discuss mailing list
>>>>> mpich-discuss at mcs.anl.gov
>>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>>> _______________________________________________
>>>> mpich-discuss mailing list
>>>> mpich-discuss at mcs.anl.gov
>>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>>> _______________________________________________
>>> mpich-discuss mailing list
>>> mpich-discuss at mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list