[petsc-users] [petsc-maint] DMSwarm on multiple processors
Matthew Knepley
knepley at gmail.com
Tue Dec 19 07:29:58 CST 2023
On Tue, Dec 19, 2023 at 5:11 AM Joauma Marichal <
joauma.marichal at uclouvain.be> wrote:
> Hello,
>
>
>
> I have used Address Sanitizer to check any memory errors. On my computer,
> no errors are found. Unfortunately, on the supercomputer that I am using, I
> get lots of errors… I attach my log files (running on 1 and 70 procs).
>
> Do you have any idea of what I could do?
>
Run the same parallel configuration as you do on the supercomputer. If that
is fine, I would suggest Address Sanitizer there. Something is corrupting
the stack, and it appears that it is connected to that machine, rather than
the library. Do you have access to a second parallel machine?
Thanks,
Matt
> Thanks a lot for your help.
>
>
>
> Best regards,
>
>
>
> Joauma
>
>
>
> *De : *Matthew Knepley <knepley at gmail.com>
> *Date : *lundi, 18 décembre 2023 à 12:00
> *À : *Joauma Marichal <joauma.marichal at uclouvain.be>
> *Cc : *petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>,
> petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Objet : *Re: [petsc-maint] DMSwarm on multiple processors
>
> On Mon, Dec 18, 2023 at 5:09 AM Joauma Marichal <
> joauma.marichal at uclouvain.be> wrote:
>
> Hello,
>
>
>
> Sorry for the delay. I attach the file that I obtain when running the code
> with the debug mode.
>
>
>
> Okay, we can now see where this is happening:
>
>
>
> malloc_consolidate(): invalid chunk size
> [cns263:3265170] *** Process received signal ***
> [cns263:3265170] Signal: Aborted (6)
> [cns263:3265170] Signal code: (-6)
> [cns263:3265170] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7f3bd9148b20]
> [cns263:3265170] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f3bd9148a9f]
> [cns263:3265170] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f3bd911be05]
> [cns263:3265170] [ 3] /lib64/libc.so.6(+0x91037)[0x7f3bd918b037]
> [cns263:3265170] [ 4] /lib64/libc.so.6(+0x9819c)[0x7f3bd919219c]
> [cns263:3265170] [ 5] /lib64/libc.so.6(+0x98b68)[0x7f3bd9192b68]
> [cns263:3265170] [ 6] /lib64/libc.so.6(+0x9af18)[0x7f3bd9194f18]
> [cns263:3265170] [ 7] /lib64/libc.so.6(__libc_malloc+0x1e2)[0x7f3bd9196822]
> [cns263:3265170] [ 8] /lib64/libc.so.6(posix_memalign+0x3c)[0x7f3bd91980fc]
> [cns263:3265170] [ 9]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscMallocAlign+0x45)[0x7f3bda5f1625]
> [cns263:3265170] [10]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscMallocA+0x297)[0x7f3bda5f1b07]
> [cns263:3265170] [11]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMCreate+0x5b)[0x7f3bdaa73c1b]
> [cns263:3265170] [12]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMDACreate+0x9)[0x7f3bdab0a2f9]
> [cns263:3265170] [13]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMDACreate3d+0x9a)[0x7f3bdab07dea]
> [cns263:3265170] [14] ./cobpor[0x402de8]
> [cns263:3265170] [15]
> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f3bd9134cf3]
> [cns263:3265170] [16] ./cobpor[0x40304e]
> [cns263:3265170] *** End of error message ***
>
>
>
> However, this is not great. First, the amount of memory being allocated is
> quite small, and this does not appear to be an Out of Memory error. Second,
> the error occurs in libc:
>
>
>
> malloc_consolidate(): invalid chunk size
>
>
>
> which means something is wrong internally. I agree with this analysis (
> https://stackoverflow.com/questions/18760999/sample-example-program-to-get-the-malloc-consolidate-error)
> that says you have probably overwritten memory somewhere in your code. I
> recommend running under valgrind, or using Address Sanitizer from clang.
>
>
>
> Thanks,
>
>
>
> Matt
>
>
>
> Thanks for your help.
>
>
>
> Best regards,
>
>
>
> Joauma
>
>
>
> *De : *Matthew Knepley <knepley at gmail.com>
> *Date : *jeudi, 23 novembre 2023 à 15:32
> *À : *Joauma Marichal <joauma.marichal at uclouvain.be>
> *Cc : *petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>,
> petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Objet : *Re: [petsc-maint] DMSwarm on multiple processors
>
> On Thu, Nov 23, 2023 at 9:01 AM Joauma Marichal <
> joauma.marichal at uclouvain.be> wrote:
>
> Hello,
>
>
>
> My problem persists… Is there anything I could try?
>
>
>
> Yes. It appears to be failing from a call inside PetscSFSetUpRanks(). It
> does allocation, and the failure
>
> is in libc, and it only happens on larger examples, so I suspect some
> allocation problem. Can you rebuild with debugging and run this example?
> Then we can see if the allocation fails.
>
>
>
> Thanks,
>
> Matt
>
>
>
> Thanks a lot.
>
>
>
> Best regards,
>
>
>
> Joauma
>
>
>
> *De : *Matthew Knepley <knepley at gmail.com>
> *Date : *mercredi, 25 octobre 2023 à 14:45
> *À : *Joauma Marichal <joauma.marichal at uclouvain.be>
> *Cc : *petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>,
> petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Objet : *Re: [petsc-maint] DMSwarm on multiple processors
>
> On Wed, Oct 25, 2023 at 8:32 AM Joauma Marichal via petsc-maint <
> petsc-maint at mcs.anl.gov> wrote:
>
> Hello,
>
>
>
> I am using the DMSwarm library in some Eulerian-Lagrangian approach to
> have vapor bubbles in water.
>
> I have obtained nice results recently and wanted to perform bigger
> simulations. Unfortunately, when I increase the number of processors used
> to run the simulation, I get the following error:
>
>
>
> free(): invalid size
>
> [cns136:590327] *** Process received signal ***
>
> [cns136:590327] Signal: Aborted (6)
>
> [cns136:590327] Signal code: (-6)
>
> [cns136:590327] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7f56cd4c9b20]
>
> [cns136:590327] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f56cd4c9a9f]
>
> [cns136:590327] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f56cd49ce05]
>
> [cns136:590327] [ 3] /lib64/libc.so.6(+0x91037)[0x7f56cd50c037]
>
> [cns136:590327] [ 4] /lib64/libc.so.6(+0x9819c)[0x7f56cd51319c]
>
> [cns136:590327] [ 5] /lib64/libc.so.6(+0x99aac)[0x7f56cd514aac]
>
> [cns136:590327] [ 6]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscSFSetUpRanks+0x4c4)[0x7f56cea71e64]
>
> [cns136:590327] [ 7]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(+0x841642)[0x7f56cea83642]
>
> [cns136:590327] [ 8]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscSFSetUp+0x9e)[0x7f56cea7043e]
>
> [cns136:590327] [ 9]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(VecScatterCreate+0x164e)[0x7f56cea7bbde]
>
> [cns136:590327] [10]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp_DA_3D+0x3e38)[0x7f56cee84dd8]
>
> [cns136:590327] [11]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp_DA+0xd8)[0x7f56cee9b448]
>
> [cns136:590327] [12]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp+0x20)[0x7f56cededa20]
>
> [cns136:590327] [13] ./cobpor[0x4418dc]
>
> [cns136:590327] [14] ./cobpor[0x408b63]
>
> [cns136:590327] [15]
> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f56cd4b5cf3]
>
> [cns136:590327] [16] ./cobpor[0x40bdee]
>
> [cns136:590327] *** End of error message ***
>
> --------------------------------------------------------------------------
>
> Primary job terminated normally, but 1 process returned
>
> a non-zero exit code. Per user-direction, the job has been aborted.
>
> --------------------------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpiexec noticed that process rank 84 with PID 590327 on node cns136 exited
> on signal 6 (Aborted).
>
> --------------------------------------------------------------------------
>
>
>
> When I reduce the number of processors the error disappears and when I run
> my code without the vapor bubbles it also works.
>
> The problem seems to take place at this moment:
>
>
>
> DMCreate(PETSC_COMM_WORLD,swarm);
>
> DMSetType(*swarm,DMSWARM);
>
> DMSetDimension(*swarm,3);
>
> DMSwarmSetType(*swarm,DMSWARM_PIC);
>
> DMSwarmSetCellDM(*swarm,*dmcell);
>
>
>
>
>
> Thanks a lot for your help.
>
>
>
> Things that would help us track this down:
>
>
>
> 1) The smallest example where it fails
>
>
>
> 2) The smallest number of processes where it fails
>
>
>
> 3) A stack trace of the failure
>
>
>
> 4) A simple example that we can run that also fails
>
>
>
> Thanks,
>
>
>
> Matt
>
>
>
> Best regards,
>
>
>
> Joauma
>
>
>
>
> --
>
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
>
>
> --
>
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
>
>
> --
>
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231219/cb81cfed/attachment-0001.html>
More information about the petsc-users
mailing list