[petsc-users] [petsc-maint] DMSwarm on multiple processors
Matthew Knepley
knepley at gmail.com
Mon Dec 18 05:00:02 CST 2023
On Mon, Dec 18, 2023 at 5:09 AM Joauma Marichal <
joauma.marichal at uclouvain.be> wrote:
> Hello,
>
>
>
> Sorry for the delay. I attach the file that I obtain when running the code
> with the debug mode.
>
Okay, we can now see where this is happening:
malloc_consolidate(): invalid chunk size
[cns263:3265170] *** Process received signal ***
[cns263:3265170] Signal: Aborted (6)
[cns263:3265170] Signal code: (-6)
[cns263:3265170] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7f3bd9148b20]
[cns263:3265170] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f3bd9148a9f]
[cns263:3265170] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f3bd911be05]
[cns263:3265170] [ 3] /lib64/libc.so.6(+0x91037)[0x7f3bd918b037]
[cns263:3265170] [ 4] /lib64/libc.so.6(+0x9819c)[0x7f3bd919219c]
[cns263:3265170] [ 5] /lib64/libc.so.6(+0x98b68)[0x7f3bd9192b68]
[cns263:3265170] [ 6] /lib64/libc.so.6(+0x9af18)[0x7f3bd9194f18]
[cns263:3265170] [ 7] /lib64/libc.so.6(__libc_malloc+0x1e2)[0x7f3bd9196822]
[cns263:3265170] [ 8] /lib64/libc.so.6(posix_memalign+0x3c)[0x7f3bd91980fc]
[cns263:3265170] [ 9]
/gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscMallocAlign+0x45)[0x7f3bda5f1625]
[cns263:3265170] [10]
/gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscMallocA+0x297)[0x7f3bda5f1b07]
[cns263:3265170] [11]
/gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMCreate+0x5b)[0x7f3bdaa73c1b]
[cns263:3265170] [12]
/gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMDACreate+0x9)[0x7f3bdab0a2f9]
[cns263:3265170] [13]
/gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMDACreate3d+0x9a)[0x7f3bdab07dea]
[cns263:3265170] [14] ./cobpor[0x402de8]
[cns263:3265170] [15]
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7f3bd9134cf3]
[cns263:3265170] [16] ./cobpor[0x40304e]
[cns263:3265170] *** End of error message ***
However, this is not great. First, the amount of memory being allocated is
quite small, and this does not appear to be an Out of Memory error. Second,
the error occurs in libc:
malloc_consolidate(): invalid chunk size
which means something is wrong internally. I agree with this analysis (
https://stackoverflow.com/questions/18760999/sample-example-program-to-get-the-malloc-consolidate-error)
that says you have probably overwritten memory somewhere in your code. I
recommend running under valgrind, or using Address Sanitizer from clang.
Thanks,
Matt
Thanks for your help.
>
>
>
> Best regards,
>
>
>
> Joauma
>
>
>
> *De : *Matthew Knepley <knepley at gmail.com>
> *Date : *jeudi, 23 novembre 2023 à 15:32
> *À : *Joauma Marichal <joauma.marichal at uclouvain.be>
> *Cc : *petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>,
> petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Objet : *Re: [petsc-maint] DMSwarm on multiple processors
>
> On Thu, Nov 23, 2023 at 9:01 AM Joauma Marichal <
> joauma.marichal at uclouvain.be> wrote:
>
> Hello,
>
>
>
> My problem persists… Is there anything I could try?
>
>
>
> Yes. It appears to be failing from a call inside PetscSFSetUpRanks(). It
> does allocation, and the failure
>
> is in libc, and it only happens on larger examples, so I suspect some
> allocation problem. Can you rebuild with debugging and run this example?
> Then we can see if the allocation fails.
>
>
>
> Thanks,
>
> Matt
>
>
>
> Thanks a lot.
>
>
>
> Best regards,
>
>
>
> Joauma
>
>
>
> *De : *Matthew Knepley <knepley at gmail.com>
> *Date : *mercredi, 25 octobre 2023 à 14:45
> *À : *Joauma Marichal <joauma.marichal at uclouvain.be>
> *Cc : *petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>,
> petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
> *Objet : *Re: [petsc-maint] DMSwarm on multiple processors
>
> On Wed, Oct 25, 2023 at 8:32 AM Joauma Marichal via petsc-maint <
> petsc-maint at mcs.anl.gov> wrote:
>
> Hello,
>
>
>
> I am using the DMSwarm library in some Eulerian-Lagrangian approach to
> have vapor bubbles in water.
>
> I have obtained nice results recently and wanted to perform bigger
> simulations. Unfortunately, when I increase the number of processors used
> to run the simulation, I get the following error:
>
>
>
> free(): invalid size
>
> [cns136:590327] *** Process received signal ***
>
> [cns136:590327] Signal: Aborted (6)
>
> [cns136:590327] Signal code: (-6)
>
> [cns136:590327] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7f56cd4c9b20]
>
> [cns136:590327] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f56cd4c9a9f]
>
> [cns136:590327] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f56cd49ce05]
>
> [cns136:590327] [ 3] /lib64/libc.so.6(+0x91037)[0x7f56cd50c037]
>
> [cns136:590327] [ 4] /lib64/libc.so.6(+0x9819c)[0x7f56cd51319c]
>
> [cns136:590327] [ 5] /lib64/libc.so.6(+0x99aac)[0x7f56cd514aac]
>
> [cns136:590327] [ 6]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscSFSetUpRanks+0x4c4)[0x7f56cea71e64]
>
> [cns136:590327] [ 7]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(+0x841642)[0x7f56cea83642]
>
> [cns136:590327] [ 8]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscSFSetUp+0x9e)[0x7f56cea7043e]
>
> [cns136:590327] [ 9]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(VecScatterCreate+0x164e)[0x7f56cea7bbde]
>
> [cns136:590327] [10]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp_DA_3D+0x3e38)[0x7f56cee84dd8]
>
> [cns136:590327] [11]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp_DA+0xd8)[0x7f56cee9b448]
>
> [cns136:590327] [12]
> /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp+0x20)[0x7f56cededa20]
>
> [cns136:590327] [13] ./cobpor[0x4418dc]
>
> [cns136:590327] [14] ./cobpor[0x408b63]
>
> [cns136:590327] [15]
> /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f56cd4b5cf3]
>
> [cns136:590327] [16] ./cobpor[0x40bdee]
>
> [cns136:590327] *** End of error message ***
>
> --------------------------------------------------------------------------
>
> Primary job terminated normally, but 1 process returned
>
> a non-zero exit code. Per user-direction, the job has been aborted.
>
> --------------------------------------------------------------------------
>
> --------------------------------------------------------------------------
>
> mpiexec noticed that process rank 84 with PID 590327 on node cns136 exited
> on signal 6 (Aborted).
>
> --------------------------------------------------------------------------
>
>
>
> When I reduce the number of processors the error disappears and when I run
> my code without the vapor bubbles it also works.
>
> The problem seems to take place at this moment:
>
>
>
> DMCreate(PETSC_COMM_WORLD,swarm);
>
> DMSetType(*swarm,DMSWARM);
>
> DMSetDimension(*swarm,3);
>
> DMSwarmSetType(*swarm,DMSWARM_PIC);
>
> DMSwarmSetCellDM(*swarm,*dmcell);
>
>
>
>
>
> Thanks a lot for your help.
>
>
>
> Things that would help us track this down:
>
>
>
> 1) The smallest example where it fails
>
>
>
> 2) The smallest number of processes where it fails
>
>
>
> 3) A stack trace of the failure
>
>
>
> 4) A simple example that we can run that also fails
>
>
>
> Thanks,
>
>
>
> Matt
>
>
>
> Best regards,
>
>
>
> Joauma
>
>
>
>
> --
>
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
>
>
>
> --
>
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
>
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>
--
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which their
experiments lead.
-- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231218/089c379b/attachment-0001.html>
More information about the petsc-users
mailing list