[petsc-users] [petsc-maint] DMSwarm on multiple processors

Joauma Marichal joauma.marichal at uclouvain.be
Tue Dec 19 04:10:56 CST 2023


Hello,

I have used Address Sanitizer to check any memory errors. On my computer, no errors are found. Unfortunately, on the supercomputer that I am using, I get lots of errors… I attach my log files (running on 1 and 70 procs).
Do you have any idea of what I could do?

Thanks a lot for your help.

Best regards,

Joauma

De : Matthew Knepley <knepley at gmail.com>
Date : lundi, 18 décembre 2023 à 12:00
À : Joauma Marichal <joauma.marichal at uclouvain.be>
Cc : petsc-maint at mcs.anl.gov <petsc-maint at mcs.anl.gov>, petsc-users at mcs.anl.gov <petsc-users at mcs.anl.gov>
Objet : Re: [petsc-maint] DMSwarm on multiple processors
On Mon, Dec 18, 2023 at 5:09 AM Joauma Marichal <joauma.marichal at uclouvain.be<mailto:joauma.marichal at uclouvain.be>> wrote:
Hello,

Sorry for the delay. I attach the file that I obtain when running the code with the debug mode.

Okay, we can now see where this is happening:

malloc_consolidate(): invalid chunk size
[cns263:3265170] *** Process received signal ***
[cns263:3265170] Signal: Aborted (6)
[cns263:3265170] Signal code:  (-6)
[cns263:3265170] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7f3bd9148b20]
[cns263:3265170] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f3bd9148a9f]
[cns263:3265170] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f3bd911be05]
[cns263:3265170] [ 3] /lib64/libc.so.6(+0x91037)[0x7f3bd918b037]
[cns263:3265170] [ 4] /lib64/libc.so.6(+0x9819c)[0x7f3bd919219c]
[cns263:3265170] [ 5] /lib64/libc.so.6(+0x98b68)[0x7f3bd9192b68]
[cns263:3265170] [ 6] /lib64/libc.so.6(+0x9af18)[0x7f3bd9194f18]
[cns263:3265170] [ 7] /lib64/libc.so.6(__libc_malloc+0x1e2)[0x7f3bd9196822]
[cns263:3265170] [ 8] /lib64/libc.so.6(posix_memalign+0x3c)[0x7f3bd91980fc]
[cns263:3265170] [ 9] /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscMallocAlign+0x45)[0x7f3bda5f1625]
[cns263:3265170] [10] /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscMallocA+0x297)[0x7f3bda5f1b07]
[cns263:3265170] [11] /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMCreate+0x5b)[0x7f3bdaa73c1b]
[cns263:3265170] [12] /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMDACreate+0x9)[0x7f3bdab0a2f9]
[cns263:3265170] [13] /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMDACreate3d+0x9a)[0x7f3bdab07dea]
[cns263:3265170] [14] ./cobpor[0x402de8]
[cns263:3265170] [15] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f3bd9134cf3]
[cns263:3265170] [16] ./cobpor[0x40304e]
[cns263:3265170] *** End of error message ***

However, this is not great. First, the amount of memory being allocated is quite small, and this does not appear to be an Out of Memory error. Second, the error occurs in libc:

  malloc_consolidate(): invalid chunk size

which means something is wrong internally. I agree with this analysis (https://stackoverflow.com/questions/18760999/sample-example-program-to-get-the-malloc-consolidate-error) that says you have probably overwritten memory somewhere in your code. I recommend running under valgrind, or using Address Sanitizer from clang.

  Thanks,

     Matt

Thanks for your help.

Best regards,

Joauma

De : Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>>
Date : jeudi, 23 novembre 2023 à 15:32
À : Joauma Marichal <joauma.marichal at uclouvain.be<mailto:joauma.marichal at uclouvain.be>>
Cc : petsc-maint at mcs.anl.gov<mailto:petsc-maint at mcs.anl.gov> <petsc-maint at mcs.anl.gov<mailto:petsc-maint at mcs.anl.gov>>, petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Objet : Re: [petsc-maint] DMSwarm on multiple processors
On Thu, Nov 23, 2023 at 9:01 AM Joauma Marichal <joauma.marichal at uclouvain.be<mailto:joauma.marichal at uclouvain.be>> wrote:
Hello,

My problem persists… Is there anything I could try?

Yes. It appears to be failing from a call inside PetscSFSetUpRanks(). It does allocation, and the failure
is in libc, and it only happens on larger examples, so I suspect some allocation problem. Can you rebuild with debugging and run this example? Then we can see if the allocation fails.

  Thanks,

     Matt

Thanks a lot.

Best regards,

Joauma

De : Matthew Knepley <knepley at gmail.com<mailto:knepley at gmail.com>>
Date : mercredi, 25 octobre 2023 à 14:45
À : Joauma Marichal <joauma.marichal at uclouvain.be<mailto:joauma.marichal at uclouvain.be>>
Cc : petsc-maint at mcs.anl.gov<mailto:petsc-maint at mcs.anl.gov> <petsc-maint at mcs.anl.gov<mailto:petsc-maint at mcs.anl.gov>>, petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov> <petsc-users at mcs.anl.gov<mailto:petsc-users at mcs.anl.gov>>
Objet : Re: [petsc-maint] DMSwarm on multiple processors
On Wed, Oct 25, 2023 at 8:32 AM Joauma Marichal via petsc-maint <petsc-maint at mcs.anl.gov<mailto:petsc-maint at mcs.anl.gov>> wrote:
Hello,

I am using the DMSwarm library in some Eulerian-Lagrangian approach to have vapor bubbles in water.
I have obtained nice results recently and wanted to perform bigger simulations. Unfortunately, when I increase the number of processors used to run the simulation, I get the following error:


free(): invalid size

[cns136:590327] *** Process received signal ***

[cns136:590327] Signal: Aborted (6)

[cns136:590327] Signal code:  (-6)

[cns136:590327] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7f56cd4c9b20]

[cns136:590327] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f56cd4c9a9f]

[cns136:590327] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f56cd49ce05]

[cns136:590327] [ 3] /lib64/libc.so.6(+0x91037)[0x7f56cd50c037]

[cns136:590327] [ 4] /lib64/libc.so.6(+0x9819c)[0x7f56cd51319c]

[cns136:590327] [ 5] /lib64/libc.so.6(+0x99aac)[0x7f56cd514aac]

[cns136:590327] [ 6] /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscSFSetUpRanks+0x4c4)[0x7f56cea71e64]

[cns136:590327] [ 7] /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(+0x841642)[0x7f56cea83642]

[cns136:590327] [ 8] /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(PetscSFSetUp+0x9e)[0x7f56cea7043e]

[cns136:590327] [ 9] /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(VecScatterCreate+0x164e)[0x7f56cea7bbde]

[cns136:590327] [10] /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp_DA_3D+0x3e38)[0x7f56cee84dd8]

[cns136:590327] [11] /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp_DA+0xd8)[0x7f56cee9b448]

[cns136:590327] [12] /gpfs/home/acad/ucl-tfl/marichaj/marha/lib_petsc/lib/libpetsc.so.3.019(DMSetUp+0x20)[0x7f56cededa20]

[cns136:590327] [13] ./cobpor[0x4418dc]

[cns136:590327] [14] ./cobpor[0x408b63]

[cns136:590327] [15] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f56cd4b5cf3]

[cns136:590327] [16] ./cobpor[0x40bdee]

[cns136:590327] *** End of error message ***

--------------------------------------------------------------------------

Primary job  terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

mpiexec noticed that process rank 84 with PID 590327 on node cns136 exited on signal 6 (Aborted).

--------------------------------------------------------------------------

When I reduce the number of processors the error disappears and when I run my code without the vapor bubbles it also works.
The problem seems to take place at this moment:

DMCreate(PETSC_COMM_WORLD,swarm);
    DMSetType(*swarm,DMSWARM);
    DMSetDimension(*swarm,3);
    DMSwarmSetType(*swarm,DMSWARM_PIC);
    DMSwarmSetCellDM(*swarm,*dmcell);


Thanks a lot for your help.

Things that would help us track this down:

1) The smallest example where it fails

2) The smallest number of processes where it fails

3) A stack trace of the failure

4) A simple example that we can run that also fails

  Thanks,

     Matt

Best regards,

Joauma


--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>


--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>


--
What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231219/39d3f0ca/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log_1proc
Type: application/octet-stream
Size: 14935 bytes
Desc: log_1proc
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231219/39d3f0ca/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: log_70proc
Type: application/octet-stream
Size: 174297 bytes
Desc: log_70proc
URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20231219/39d3f0ca/attachment-0003.obj>


More information about the petsc-users mailing list