[petsc-dev] mpi/cuda issue

Mon Mar 21 22:00:11 CDT 2016

> On Mar 21, 2016, at 9:50 PM, Satish Balay <balay at mcs.anl.gov> wrote:
> 
> BTW: perils of using 'gitcommit=origin/master'
> http://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2016/03/21/master.html
> 
> Perhaps we should switch superlu_dist to use a working snapshot?
>    self.gitcommit        = '35c3b21630d93b3f8392a68e607467c247b5e053'
> 
> balay at asterix /home/balay/petsc (master=)
> $ git grep origin/master config
> config/BuildSystem/config/packages/Chombo.py:    self.gitcommit        = 'origin/master'
> config/BuildSystem/config/packages/SuperLU_DIST.py:    self.gitcommit        = 'origin/master'
> config/BuildSystem/config/packages/amanzi.py:    self.gitcommit        = 'origin/master'
> config/BuildSystem/config/packages/saws.py:    self.gitcommit = 'origin/master'

  Satish and Hong,

Sherry has changed SuperLU_dist to have no name conflicts with SuperLU this means we need to update the SuperLU_dist interface for fix these problems. Once things have settled down we can use a release commit instead of master.

  Barry

> 
> Satish
> 
> On Mon, 21 Mar 2016, Satish Balay wrote:
> 
>> Hm - get a gtx 950 [2GB] and replace? [or gtx 970 4GB?]
>> 
>> http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-950/specifications
>> http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-970/specifications
>> 
>> There is a different machine with 2 M2090 cards - so I'll switch the
>> builds to that [es.mcs.anl.gov]. I was previously avoiding builds on
>> it - as its used as a general use machine [and perhaps ocassionally
>> for benchmark runs]
>> 
>> Satish
>> 
>> On Mon, 21 Mar 2016, Dominic Meiser wrote:
>> 
>>> Hi Jiri,
>>> 
>>> Thanks very much for the fast response.  That's very useful
>>> information.  I had no idea the memory footprint of the contexts
>>> was this large.
>>> 
>>> Satish, Barry is there any chance we can upgrade the GPU in the
>>> test machine to at least Fermi generation?  That way I can help
>>> much more easily because I'd be able to reproduce your setup
>>> locally.
>>> 
>>> Cheers,
>>> Dominic
>>> 
>>> On Mon, Mar 21, 2016 at 08:01:10PM +0000, Jiri Kraus wrote:
>>>> Hi Dominic,
>>>> 
>>>> I think the error messages you get is pretty descriptive regarding the root cause. You are probably running out of GPU memory. Since you are running on a GTX 285 you can't use MPS [1] therefore each MPI process has its own context on the GPU. Each context needs to initialize some data on the GPU (used for local variables and so on). The required amount needed for this depends on the size of the GPUs (essentially correlates with the maximum number of concurrently active threads). This can easily be 50-100MB. So with only 1GB of GPU memory you are probably using all GPUs memory for context data and nothing is available for your application. Unfortunately there is no good way to debug this with GeForce. On Tesla nvidia-smi does show you all processes that have a context on a GPU together with their memory consumption.
>>>> 
>>>> Hope this helps
>>>> 
>>>> Jiri
>>>> 
>>>> 
>>>> [1] https://docs.nvidia.com/deploy/mps/index.html 
>>>> 
>>>>> -----Original Message-----
>>>>> From: Dominic Meiser [mailto:dmeiser at txcorp.com]
>>>>> Sent: Montag, 21. März 2016 19:17
>>>>> To: Jiri Kraus <jkraus at nvidia.com>
>>>>> Cc: Karl Rupp <rupp at iue.tuwien.ac.at>; Barry Smith <bsmith at mcs.anl.gov>;
>>>>> balay at mcs.anl.gov
>>>>> Subject: mpi/cuda issue
>>>>> 
>>>>> Hi Jiri,
>>>>> 
>>>>> Hope things are going well.  We are trying to understand an
>>>>> mpi+cuda issue in the tests of the PETSc library and I was
>>>>> wondering if you could help us out.
>>>>> 
>>>>> The behavior we're seeing is that some of the tests fail intermittently with
>>>>> "out of memory" errors, e.g.
>>>>> 
>>>>> terminate called after throwing an instance of
>>>>> 'thrust::system::detail::bad_alloc'
>>>>>   what():  std::bad_alloc: out of memory
>>>>> 
>>>>> Other tests hang when we oversubscribe the GPU with a largish number of
>>>>> MPI processes (32 in one case).  Satish obtained info on the GPU
>>>>> configuration using nvidia-smi below.
>>>>> 
>>>>> Could you remind us what the requirements for MPI+cuda are, especially
>>>>> regarding over subscription?
>>>>> 
>>>>> Are there any other tools we can use to debug this problem?  Any
>>>>> suggestions on what we should look at next?
>>>>> 
>>>>> Thanks very much in advance.
>>>>> Cheers,
>>>>> Dominic
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Mar 21, 2016 at 01:09:14PM -0500, Satish Balay wrote:
>>>>>> balay at frog ~ $ nvidia-smi
>>>>>> Mon Mar 21 13:07:36 2016
>>>>>> +------------------------------------------------------+
>>>>>> | NVIDIA-SMI 340.93     Driver Version: 340.93         |
>>>>>> |-------------------------------+----------------------+----------------------+
>>>>>> | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
>>>>>> | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util
>>>>> Compute M. |
>>>>>> 
>>>>> |===============================+======================+======
>>>>> ================|
>>>>>> |   0  GeForce GTX 285     Off  | 0000:03:00.0     N/A |                  N/A |
>>>>>> | 40%   66C    P0    N/A /  N/A |      3MiB /  1023MiB |     N/A      Default |
>>>>>> +-------------------------------+----------------------+----------------------+
>>>>>> 
>>>>>> +-----------------------------------------------------------------------------+
>>>>>> | Compute processes:                                               GPU Memory |
>>>>>> |  GPU       PID  Process name                                     Usage      |
>>>>>> 
>>>>> |=============================================================
>>>>> ================|
>>>>>> |    0            Not Supported                                               |
>>>>>> +-----------------------------------------------------------------------------+
>>>>>> 
>>>>>> 
>>>>>> balay at frog ~/soft/NVIDIA_CUDA-5.5_Samples/bin/x86_64/linux/release $
>>>>>> ./deviceQuery ./deviceQuery Starting...
>>>>>> 
>>>>>> CUDA Device Query (Runtime API) version (CUDART static linking)
>>>>>> 
>>>>>> Detected 1 CUDA Capable device(s)
>>>>>> 
>>>>>> Device 0: "GeForce GTX 285"
>>>>>>  CUDA Driver Version / Runtime Version          6.5 / 5.5
>>>>>>  CUDA Capability Major/Minor version number:    1.3
>>>>>>  Total amount of global memory:                 1024 MBytes (1073414144
>>>>> bytes)
>>>>>>  (30) Multiprocessors, (  8) CUDA Cores/MP:     240 CUDA Cores
>>>>>>  GPU Clock rate:                                1476 MHz (1.48 GHz)
>>>>>>  Memory Clock rate:                             1242 Mhz
>>>>>>  Memory Bus Width:                              512-bit
>>>>>>  Maximum Texture Dimension Size (x,y,z)         1D=(8192), 2D=(65536,
>>>>> 32768), 3D=(2048, 2048, 2048)
>>>>>>  Maximum Layered 1D Texture Size, (num) layers  1D=(8192), 512 layers
>>>>>>  Maximum Layered 2D Texture Size, (num) layers  2D=(8192, 8192), 512
>>>>> layers
>>>>>>  Total amount of constant memory:               65536 bytes
>>>>>>  Total amount of shared memory per block:       16384 bytes
>>>>>>  Total number of registers available per block: 16384
>>>>>>  Warp size:                                     32
>>>>>>  Maximum number of threads per multiprocessor:  1024
>>>>>>  Maximum number of threads per block:           512
>>>>>>  Max dimension size of a thread block (x,y,z): (512, 512, 64)
>>>>>>  Max dimension size of a grid size    (x,y,z): (65535, 65535, 1)
>>>>>>  Maximum memory pitch:                          2147483647 bytes
>>>>>>  Texture alignment:                             256 bytes
>>>>>>  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
>>>>>>  Run time limit on kernels:                     No
>>>>>>  Integrated GPU sharing Host Memory:            No
>>>>>>  Support host page-locked memory mapping:       Yes
>>>>>>  Alignment requirement for Surfaces:            Yes
>>>>>>  Device has ECC support:                        Disabled
>>>>>>  Device supports Unified Addressing (UVA):      No
>>>>>>  Device PCI Bus ID / PCI location ID:           3 / 0
>>>>>>  Compute Mode:
>>>>>>     < Default (multiple host threads can use ::cudaSetDevice() with
>>>>>> device simultaneously) >
>>>>>> 
>>>>>> deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA
>>>>>> Runtime Version = 5.5, NumDevs = 1, Device0 = GeForce GTX 285 Result =
>>>>>> PASS balay at frog
>>>>>> ~/soft/NVIDIA_CUDA-5.5_Samples/bin/x86_64/linux/release $
>>>>>> 
>>>>>> 
>>>>>> On Mon, 21 Mar 2016, Dominic Meiser wrote:
>>>>>> 
>>>>>>> I have used over subscription of GPUs fairly routinely but it
>>>>>>> requires driver support (and I think at some point it also required
>>>>>>> a patched mpich, but that requirement is gone AFAIK).  I don't
>>>>>>> remember what driver version is needed.  Can you get the driver
>>>>>>> version on the test machine with nvidia-smi?
>>>>>>> 
>>>>>>> Also over subscription by such a large factor could be an issue.
>>>>>>> But given that the example doesn't actually use GPUs one would hope
>>>>>>> that it shouldn't matter ...
>>>>>>> 
>>>>>>> Karl, have you been able to reproduce this issue on a different
>>>>>>> machine?  Or any idea what's needed to reproduce the failures?
>>>>>>> I can try and hunt down a sm_13 GPU but if there's an easier way to
>>>>>>> reproduce that would be great.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Dominic
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Mar 21, 2016 at 12:11:08PM -0500, Satish Balay wrote:
>>>>>>>> I attempted to manually run the tests after the reboot - and then
>>>>>>>> they crashed/hanged
>>>>>>>> at:
>>>>>>>> 
>>>>>>>> [14]PETSC ERROR: --------------------- Error Message
>>>>>>>> --------------------------------------------------------------
>>>>>>>> [14]PETSC ERROR: Error in external library [14]PETSC ERROR: CUBLAS
>>>>>>>> error 1 [14]PETSC ERROR: See
>>>>>>>> http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble
>>>>> shooting.
>>>>>>>> [14]PETSC ERROR: Petsc Development GIT revision:
>>>>>>>> pre-tsfc-2225-g6da9565  GIT Date: 2016-03-20 23:47:14 -0500
>>>>>>>> [14]PETSC ERROR: ./ex36 on a arch-cuda-double named frog by balay
>>>>>>>> Mon Mar 21 10:49:24 2016 [14]PETSC ERROR: Configure options
>>>>>>>> --with-cuda=1 --with-cusp=1
>>>>>>>> -with-cusp-dir=/home/balay/soft/cusplibrary-0.4.0 --with-thrust=1
>>>>>>>> --with-precision=double --with-clanguage=c --with-cuda-arch=sm_13
>>>>>>>> --with-no-output -PETSC_ARCH=arch-cuda-double
>>>>>>>> -PETSC_DIR=/home/balay/petsc.clone
>>>>>>>> [14]PETSC ERROR: #1 PetscInitialize() line 922 in
>>>>>>>> /home/balay/petsc.clone/src/sys/objects/pinit.c
>>>>>>>> 
>>>>>>>> 
>>>>>>>> This one does: 'mpiexec -n 32 ./ex36'
>>>>>>>> 
>>>>>>>> Does such oversubscription of GPU supporsed to work? BTW: I don't
>>>>>>>> think this example is using cuda [but there is still cublas
>>>>>>>> initialization?]
>>>>>>>> 
>>>>>>>> I've rebooted the machine again - and the 'day'builds have just started..
>>>>>>>> 
>>>>>>>> Satish
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, 21 Mar 2016, Karl Rupp wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> the reboot may help, yes. I've observed such weird test failures
>>>>>>>>> twice over the years. In both cases they were gone after
>>>>>>>>> powering the machine off and powering them on again (at least in
>>>>> one case it was not sufficient to reboot).
>>>>>>>>> 
>>>>>>>>> Best regards,
>>>>>>>>> Karli
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 03/21/2016 04:38 PM, Satish Balay wrote:
>>>>>>>>>> The test mode [on this machine] didn't change in the past few
>>>>> months..
>>>>>>>>>> 
>>>>>>>>>> I've rebooted the box now..
>>>>>>>>>> 
>>>>>>>>>> Satish
>>>>>>>>>> 
>>>>>>>>>> On Mon, 21 Mar 2016, Dominic Meiser wrote:
>>>>>>>>>> 
>>>>>>>>>>> Really odd that these out-of-memory errors are occurring now.
>>>>>>>>>>> AFAIK nothing related to this has changed in the code.  Are
>>>>>>>>>>> the tests run any differently?  Perhaps more tests in
>>>>>>>>>>> parallel?  Is it possible to reset the driver or to reboot the test
>>>>> machine?
>>>>>>>>>>> 
>>>>>>>>>>> Dominic
>>>>>>>>>>> 
>>>>>>>>>>> On Sun, Mar 20, 2016 at 09:12:36PM -0500, Barry Smith wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> ftp://ftp.mcs.anl.gov/pub/petsc/nightlylogs/archive/2016/0
>>>>>>>>>>>> 3/20/examples_master_arch-cuda-double_frog.log
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> --
>>>>> Dominic Meiser
>>>>> Tech-X Corporation - 5621 Arapahoe Avenue - Boulder, CO 80303
>>>> NVIDIA GmbH, Wuerselen, Germany, Amtsgericht Aachen, HRB 8361
>>>> Managing Director: Karen Theresa Burns
>>>> 
>>>> -----------------------------------------------------------------------------------
>>>> This email message is for the sole use of the intended recipient(s) and may contain
>>>> confidential information.  Any unauthorized review, use, disclosure or distribution
>>>> is prohibited.  If you are not the intended recipient, please contact the sender by
>>>> reply email and destroy all copies of the original message.
>>>> -----------------------------------------------------------------------------------
>>>