[petsc-users] petsc on Cori Haswell

David Trebotich dptrebotich at lbl.gov
Wed Apr 15 16:26:51 CDT 2020


Matt is correct on his point 2.

And I'll get fresh output to send your way. Stay tuned.

On Wed, Apr 15, 2020, 2:21 PM Junchao Zhang <junchao.zhang at gmail.com> wrote:

> I want to know who called MPI_Init().  Petsc or Chombo?
> --Junchao Zhang
>
>
> On Wed, Apr 15, 2020 at 4:13 PM Matthew Knepley <knepley at gmail.com> wrote:
>
>> On Wed, Apr 15, 2020 at 5:10 PM Junchao Zhang <junchao.zhang at gmail.com>
>> wrote:
>>
>>> Was there a petsc error stack?
>>>
>>
>> 1) SNES ex5 is a highly scalable problem. Just give it  large enough m
>> and n.
>>
>> 2) Junchao, it looks like MPI_Init() is failing, which I believe comes
>> before we install our signal handler to get us the stack.
>>
>>   Thanks,
>>
>>      Matt
>>
>>
>>> --Junchao Zhang
>>>
>>>
>>> On Wed, Apr 15, 2020 at 3:41 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>
>>>> Whoops, this is actually Cori-KNL.
>>>>
>>>> On Wed, Apr 15, 2020 at 4:33 PM Mark Adams <mfadams at lbl.gov> wrote:
>>>>
>>>>> We have a problem when going from 32K to 64K cores on Cori-haswell.
>>>>> Does Anyone have any thoughts?
>>>>> Thanks,
>>>>> Mark
>>>>>
>>>>> ---------- Forwarded message ---------
>>>>> From: David Trebotich <dptrebotich at lbl.gov>
>>>>> Date: Wed, Apr 15, 2020 at 4:20 PM
>>>>> Subject: Re: petsc on Cori Haswell
>>>>> To: Mark Adams <mfadams at lbl.gov>
>>>>>
>>>>>
>>>>> Hey Mark-
>>>>> I am running into some issues that I am convinced are from the PETSc
>>>>> build. I am able to build and run on up to 32K cores. At 64K I start
>>>>> getting stuff like below (looks like two issues: pmi stuff and MPI_Init). I
>>>>> have been working with Brian Freisen to see if it's a NERSC problem. At
>>>>> this point I build without PETSc and then run native gmg in Chombo and have
>>>>> no problems. The problems only come with building with PETSc, and at larger
>>>>> concurrencies. The only thing that has changed is that this is a new PETSc
>>>>> installation. Perhaps something changed in the PETSc version you built from
>>>>> previously?  Thanks for the help.
>>>>> Treb
>>>>>
>>>>> Mon Apr 13 17:49:45 2020: [PE_101955]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=33, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_101958]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=33, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_101958]:_pmi_init:_pmi_mmap_init
>>>>> returned -1
>>>>> Mon Apr 13 17:49:45 2020: [PE_101979]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=33, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_101979]:_pmi_init:_pmi_mmap_init
>>>>> returned -1
>>>>> Mon Apr 13 17:49:45 2020: [PE_82712]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=28, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_17868]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=32, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_97918]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=33, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_17869]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=32, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_17869]:_pmi_init:_pmi_mmap_init returned
>>>>> -1
>>>>> Mon Apr 13 17:49:45 2020: [PE_110562]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=27, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_110562]:_pmi_init:_pmi_mmap_init
>>>>> returned -1
>>>>> Mon Apr 13 17:49:45 2020: [PE_110563]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=27, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_27899]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=38, pes_this_node=64, timeout=180 secs
>>>>> [Mon Apr 13 17:49:45 2020] [c7-4c1s6n0] Fatal error in MPI_Init: Other
>>>>> MPI error, error stack:
>>>>> MPIR_Init_thread(537):
>>>>> MPID_Init(246).......: channel initialization failed
>>>>> MPID_Init(647).......:  PMI2 init failed: 1
>>>>> Attempting to use an MPI routine before initializing MPICH
>>>>> [Mon Apr 13 17:49:45 2020] [c7-4c1s6n0] Fatal error in MPI_Init: Other
>>>>> MPI error, error stack:
>>>>> MPIR_Init_thread(537):
>>>>> MPID_Init(246).......: channel initialization failed
>>>>> MPID_Init(647).......:  PMI2 init failed: 1
>>>>> Attempting to use an MPI routine before initializing MPICH
>>>>> Mon Apr 13 17:49:45 2020: [PE_71961]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=35, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_71961]:_pmi_init:_pmi_mmap_init returned
>>>>> -1
>>>>> Mon Apr 13 17:49:45 2020: [PE_71962]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=35, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_64329]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=32, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_64335]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=32, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_64335]:_pmi_init:_pmi_mmap_init returned
>>>>> -1
>>>>> [Mon Apr 13 17:49:45 2020] [c6-1c2s5n2] Fatal error in MPI_Init: Other
>>>>> MPI error, error stack:
>>>>> MPIR_Init_thread(537):
>>>>> MPID_Init(246).......: channel initialization failed
>>>>> MPID_Init(647).......:  PMI2 init failed: 1
>>>>> Attempting to use an MPI routine before initializing MPICH
>>>>> [Mon Apr 13 17:49:45 2020] [c9-4c2s13n2] Fatal error in MPI_Init:
>>>>> Other MPI error, error stack:
>>>>> MPIR_Init_thread(537):
>>>>> MPID_Init(246).......: channel initialization failed
>>>>> MPID_Init(647).......:  PMI2 init failed: 1
>>>>> Attempting to use an MPI routine before initializing MPICH
>>>>> Mon Apr 13 17:49:45 2020: [PE_71960]:_pmi_mmap_tmp: Warning bootstrap
>>>>> barrier failed: num_syncd=35, pes_this_node=64, timeout=180 secs
>>>>> Mon Apr 13 17:49:45 2020: [PE_71960]:_pmi_init:_pmi_mmap_init returned
>>>>> -1
>>>>> [Mon Apr 13 17:49:45 2020] [c6-3c2s9n1] Fatal error in MPI_Init: Other
>>>>> MPI error, error stack:
>>>>> MPIR_Init_thread(537):
>>>>> MPID_Init(246).......: channel initialization failed
>>>>>
>>>>>
>>
>> --
>> What most experimenters take for granted before they begin their
>> experiments is infinitely more interesting than any results to which their
>> experiments lead.
>> -- Norbert Wiener
>>
>> https://www.cse.buffalo.edu/~knepley/
>> <http://www.cse.buffalo.edu/~knepley/>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20200415/1fdc790f/attachment.html>


More information about the petsc-users mailing list