[petsc-users] [petsc-maint] DMSwarm on multiple processors
Matthew Knepley
knepley at gmail.com
Wed Dec 20 06:58:43 CST 2023
On Wed, Dec 20, 2023 at 4:12 AM Joauma Marichal <
joauma.marichal at uclouvain.be> wrote:
> Hello,
>
>
>
> I used Address Sanitizer on my laptop and I have no leaks.
>
> I do have access to another machine (managed by the same people as the
> previous one) but I obtain similar errors…
>
Let me understand:
1) You have run the exact same problem on two different parallel machines,
and gotten the same error,
meaning on the second machine, it printed
malloc_consolidate(): invalid chunk size
Is this true?
2) You have run the exact same problem on the same number of processes on
your own machine under Address Sanitizer with no errors?
Thanks,
Matt
> Thanks again for your help.
>
>
>
> Best regards,
>
>
>
> Joauma
>
>
>
> *De : *Matthew Knepley <knepley at gmail.com>
> *Date : *mardi, 19 décembre 2023 à 14:30
> *À : *Joauma Marichal <joauma.marichal at uclouvain.be>
> *Cc : *petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>,
> petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Objet : *Re: [petsc-maint] DMSwarm on multiple processors
>
> On Tue, Dec 19, 2023 at 5:11 AM Joauma Marichal <
> joauma.marichal at uclouvain.be> wrote:
>
> Hello,
>
>
>
> I have used Address Sanitizer to check any memory errors. On my computer,
> no errors are found. Unfortunately, on the supercomputer that I am using, I
> get lots of errors… I attach my log files (running on 1 and 70 procs).
>
> Do you have any idea of what I could do?
>
>
>
> Run the same parallel configuration as you do on the supercomputer. If
> that is fine, I would suggest Address Sanitizer there. Something is
> corrupting the stack, and it appears that it is connected to that machine,
> rather than the library. Do you have access to a second parallel machine?
>
>
>
> Thanks,
>
>
>
> Matt
>
>
>
> Thanks a lot for your help.
>
>
>
> Best regards,
>
>
>
> Joauma
>
>
>
> *De : *Matthew Knepley <knepley at gmail.com>
> *Date : *lundi, 18 décembre 2023 à 12:00
> *À : *Joauma Marichal <joauma.marichal at uclouvain.be>
> *Cc : *petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>,
> petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Objet : *Re: [petsc-maint] DMSwarm on multiple processors
>
> On Mon, Dec 18, 2023 at 5:09 AM Joauma Marichal <
> joauma.marichal at uclouvain.be> wrote:
>
> Hello,
>
>
>
> Sorry for the delay. I attach the file that I obtain when running the code
> with the debug mode.
>
>
>
> Okay, we can now see where this is happening:
>
>
>
> malloc_consolidate(): invalid chunk size
> [cns263:3265170] *** Process received signal ***
> [cns263:3265170] Signal: Aborted (6)
> [cns263:3265170] Signal code: (-6)
> [cns263:3265170] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7f3bd9148b20]
> [cns263:3265170] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f3bd9148a9f]
> [cns263:3265170] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f3bd911be05]
> [cns263:3265170] [ 3] /lib64/libc.so.6(+0x91037)[0x7f3bd918b037]
> [cns263:3265170] [ 4] /lib64/libc.so.6(+0x9819c)[0x7f3bd919219c]
> [cns263:3265170] [ 5] /lib64/libc.so.6(+0x98b68)[0x7f3bd9192b68]
> [cns263:3265170] [ 6] /lib64/libc.so.6(+0x9af18)[0x7f3bd9194f18]
> [cns263:3265170] [ 7] /lib64/libc.so.6(__libc_malloc+0x1e2)[0x7f3bd9196822]
> [cns263:3265170] [ 8] /lib64/libc.so.6(posix_memalign+0x3c)[0x7f3bd91980fc]
> [cns263:3265170] [ 9]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscMallocAlign+0x45)[0x7f3bda5f1625]
> [cns263:3265170] [10]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscMallocA+0x297)[0x7f3bda5f1b07]
> [cns263:3265170] [11]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMCreate+0x5b)[0x7f3bdaa73c1b]
> [cns263:3265170] [12]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMDACreate+0x9)[0x7f3bdab0a2f9]
> [cns263:3265170] [13]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMDACreate3d+0x9a)[0x7f3bdab07dea]
> [cns263:3265170] [14] ./cobpor[0x402de8]
> [cns263:3265170] [15]
> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f3bd9134cf3]
> [cns263:3265170] [16] ./cobpor[0x40304e]
> [cns263:3265170] *** End of error message ***
>
>
>
> However, this is not great. First, the amount of memory being allocated is
> quite small, and this does not appear to be an Out of Memory error. Second,
> the error occurs in libc:
>
>
>
> malloc_consolidate(): invalid chunk size
>
>
>
> which means something is wrong internally. I agree with this analysis (
> https://stackoverflow.com/questions/18760999/sample-example-program-to-get-the-malloc-consolidate-error)
> that says you have probably overwritten memory somewhere in your code. I
> recommend running under valgrind, or using Address Sanitizer from clang.
>
>
>
> Thanks,
>
>
>
> Matt
>
>
>
> Thanks for your help.
>
>
>
> Best regards,
>
>
>
> Joauma
>
>
>
> *De : *Matthew Knepley <knepley at gmail.com>
> *Date : *jeudi, 23 novembre 2023 à 15:32
> *À : *Joauma Marichal <joauma.marichal at uclouvain.be>
> *Cc : *petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>,
> petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Objet : *Re: [petsc-maint] DMSwarm on multiple processors
>
> On Thu, Nov 23, 2023 at 9:01 AM Joauma Marichal <
> joauma.marichal at uclouvain.be> wrote:
>
> Hello,
>
>
>
> My problem persists… Is there anything I could try?
>
>
>
> Yes. It appears to be failing from a call inside PetscSFSetUpRanks(). It
> does allocation, and the failure
>
> is in libc, and it only happens on larger examples, so I suspect some
> allocation problem. Can you rebuild with debugging and run this example?
> Then we can see if the allocation fails.
>
>
>
> Thanks,
>
> Matt
>
>
>
> Thanks a lot.
>
>
>
> Best regards,
>
>
>
> Joauma
>
>
>
> *De : *Matthew Knepley <knepley at gmail.com>
> *Date : *mercredi, 25 octobre 2023 à 14:45
> *À : *Joauma Marichal <joauma.marichal at uclouvain.be>
> *Cc : *petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>,
> petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Objet : *Re: [petsc-maint] DMSwarm on multiple processors
>
> On Wed, Oct 25, 2023 at 8:32 AM Joauma Marichal via petsc-maint <
> petsc-maint at mcs.anl.gov> wrote:
>
> Hello,
>
>
>
> I am using the DMSwarm library in some Eulerian-Lagrangian approach to
> have vapor bubbles in water.
>
> I have obtained nice results recently and wanted to perform bigger
> simulations. Unfortunately, when I increase the number of processors used
> to run the simulation, I get the following error:
>
>
>
> free(): invalid size
>
> [cns136:590327] *** Process received signal ***
>
> [cns136:590327] Signal: Aborted (6)
>
> [cns136:590327] Signal code: (-6)
>
> [cns136:590327] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7f56cd4c9b20]
>
> [cns136:590327] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f56cd4c9a9f]
>
> [cns136:590327] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f56cd49ce05]
>
> [cns136:590327] [ 3] /lib64/libc.so.6(+0x91037)[0x7f56cd50c037]
>
> [cns136:590327] [ 4] /lib64/libc.so.6(+0x9819c)[0x7f56cd51319c]
>
> [cns136:590327] [ 5] /lib64/libc.so.6(+0x99aac)[0x7f56cd514aac]
>
> [cns136:590327] [ 6]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscSFSetUpRanks+0x4c4)[0x7f56cea71e64]
>
> [cns136:590327] [ 7]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(+0x841642)[0x7f56cea83642]
>
> [cns136:590327] [ 8]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscSFSetUp+0x9e)[0x7f56cea7043e]
>
> [cns136:590327] [ 9]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(VecScatterCreate+0x164e)[0x7f56cea7bbde]
>
> [cns136:590327] [10]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp_DA_3D+0x3e38)[0x7f56cee84dd8]
>
> [cns136:590327] [11]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp_DA+0xd8)[0x7f56cee9b448]
>
> [cns136:590327] [12]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp+0x20)[0x7f56cededa20]
>
> [cns136:590327] [13] ./cobpor[0x4418dc]
>
> [cns136:590327] [14] ./cobpor[0x408b63]
>
> [cns136:590327] [15]
> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f56cd4b5cf3]
>
> [cns136:590327] [16] ./cobpor[0x40bdee]
>
> [cns136:590327] *** End of error message ***
>
> --------------------------------------------------------------------------
>
> Primary job terminated normally, but 1 process returned
>
> a non-zero exit code. Per user-direction, the job has been aborted.
>
> --------------------------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpiexec noticed that process rank 84 with PID 590327 on node cns136 exited
> on signal 6 (Aborted).
>
> --------------------------------------------------------------------------
>
>
>
> When I reduce the number of processors the error disappears and when I run
> my code without the vapor bubbles it also works.
>
> The problem seems to take place at this moment:
>
>
>
> DMCreate(PETSC_COMM_WORLD,swarm);
>
> DMSetType(*swarm,DMSWARM);
>
> DMSetDimension(*swarm,3);
>
> DMSwarmSetType(*swarm,DMSWARM_PIC);
>
> DMSwarmSetCellDM(*swarm,*dmcell);
>
>
>
>
>
> Thanks a lot for your help.
>
>
>
> Things that would help us track this down:
>
>
>
> 1) The smallest example where it fails
>
>
>
> 2) The smallest number of processes where it fails
>
>
>
> 3) A stack trace of the failure
>
>
>
> 4) A simple example that we can run that also fails
>
>
>
> Thanks,
>
>
>
> Matt
>
>
>
> Best regards,
>
>
>
> Joauma
>
>
>
>
> --
>
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
>
>
> --
>
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
>
>
> --
>
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
>
>
> --
>
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231220/570abf68/attachment-0001.html>
More information about the petsc-users
mailing list