[Darshan-users] Darshan crashing compute node

Thu Apr 2 15:44:49 CDT 2020

Hi Phil.

First of all, thank you very much for the attention and the information.

The "export DARSHAN_LOGHINTS=" solved the problem!

Now the darshan log file is written without problem.

I would like to point out that before the "configuration"
of DARSHAN_LOGHINTS, when I used the darshan-parser to extract some info
from the log file, I got the following:

# metadata: lib_ver = 3.1.8
# metadata: h = romio_no_indep_rw=true;cb_nodes=4

With the configuration of "DARSHAN_LOGHINTS=" there's no info about
metadata or the hints used.

Thank you very much for the valuable help!

Best regards.

On Thu, Apr 2, 2020 at 12:27 PM Carns, Philip H. <carns at mcs.anl.gov> wrote:

> Thanks for the bug reports André!  You are describing two different
> problems:
>
> 1) incompatibility between the Darshan Lustre instrumentation and recent
> Lustre releases: we'll definitely look into this.  As you found in the
> mailing list archives there has been a known problem here, but this is a
> new permutation that it causes a node crash now.  That's disappointing, to
> say the least.  If anyone happens to have a pointer to how to stand up a
> Lustre 2.11 instance in docker or a VM they can share, that would be
> greatly appreciated and would help us track compatibility better.  The
> Lustre systems we have access to are not new enough. In the mean time you
> are correct to disable that component at build time.
>
> 2) floating point or divide by zero when writing log to Lustre: This one
> is odd because at the point of the problem, Darshan is not instrumenting
> anything: it is simply an application writing data to Lustre.  We don't
> generally see problems here.  Darshan is doing a collective write here as
> the last step of aggregating instrumentation from all ranks into a single
> log file.  The only slightly unusual things that Darshan could be doing are
> a) setting MPI-IO hints or b) using particular datatypes to organize the
> writes.  Neither should be a problem, but it looks like they might have
> triggered a bug somewhere.
>
> If you don't mind trying another test case, could you repeat one of the
> experiments that crashes with a floating point or divide by zero bug down
> in MPI_File_write_at_all() with the following environment variable
> set?  (please keep disabling the Lustre module at build time, we have no
> other fix for that ready yet)
>
> export DARSHAN_LOGHINTS=
>
> (in other words, set the DARSHAN_LOGHINTS environment variable to an empty
> string)
>
> I believe that this will clear the MPI-IO hints that Darshan would
> normally specify when opening the output log file.  If that makes your
> example work cleanly, then we can narrow down which hint is the problem and
> maybe get some help outside of Darshan for the root cause.  If it still
> fails, then we need to look elsewhere for the problem.
>
> thanks,
> -Phil
> ------------------------------
> *From:* Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf
> of André R. Carneiro <andre.es at gmail.com>
> *Sent:* Wednesday, April 1, 2020 1:39 PM
> *To:* darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
> *Subject:* [Darshan-users] Darshan crashing compute node
>
> Hi,
>
> I'm traying to use darshan (tested all 3.1.X versions) on a HPC cluster
> running with Red Hat 7.6 (kernel 3.10.0-957.el7.x86_64) and using Lustre
> Filesystem version 2.11.0.300_cray_102_g3dbace1, but when the application
> starts, the compute node crashes with a kernel panic message on the
> console.
>
> While searching on the Darshan mailing list I came across this message:
> https://lists.mcs.anl.gov/mailman/htdig/darshan-users/2019-October/000542.html
>
> I disabled the Lustre module and recompiled darshan. The crash isn't
> happening anymore, but now it's unable to write the darshan log file on the
> lustre filesystem, with the following error:
>
> *Using OpenMPI 3.1.5 and GCC 7
>
> Program received signal SIGFPE: Floating-point exception - erroneous
> arithmetic operation.
> Backtrace for this error:
> #0  0x7f4f6759f27f in ???
> #1  0x7f4f687ababe in ???
> #2  0x7f4f687add06 in ???
> #3  0x7f4f687db6c0 in ???
> #4  0x7f4f687dbddb in ???
> #5  0x7f4f6879d6f1 in ???
> #6  0x7f4f6871892b in ???
> #7  0x7f4f691d0ae1 in MPI_File_write_at_all
> at lib/darshan-mpiio.c:536
> #8  0x7f4f691bea7f in darshan_log_append_all
> at lib/darshan-core.c:1800
> #9  0x7f4f691c1907 in darshan_log_write_name_record_hash
> at lib/darshan-core.c:1761
> #10  0x7f4f691c1907 in darshan_core_shutdown
> at lib/darshan-core.c:546
> #11  0x7f4f691be402 in MPI_Finalize
> at lib/darshan-core-init-finalize.c:82
> #12  0x7f4f68b6a798 in ???
> #13  0x4023bb in ???
> #14  0x401ae6 in ???
> #15  0x7f4f6758b3d4 in ???
> #16  0x401b16 in ???
> #17  0xffffffffffffffff in ???
> --------------------------------------------------------------------------
>
> *Using Intel PSXE 2018 with Intel MPI
>
> forrtl: severe (71): integer divide by zero
> Image              PC                Routine            Line        Source
>
> exec.exe           000000000045282E  Unknown               Unknown  Unknown
> libpthread-2.17.s  00002B8B5A5FE5D0  Unknown               Unknown  Unknown
> libmpi_lustre.so.  00002B8B659D4FDF  ADIOI_LUSTRE_Get_     Unknown  Unknown
> libmpi_lustre.so.  00002B8B659CFFD9  ADIOI_LUSTRE_Writ     Unknown  Unknown
> libmpi.so.12.0     00002B8B59A4C15C  Unknown               Unknown  Unknown
> libmpi.so.12       00002B8B59A4D1D5  PMPI_File_write_a     Unknown  Unknown
> libdarshan.so      00002B8B58F90312  MPI_File_write_at     Unknown  Unknown
> libdarshan.so      00002B8B58F7E63A  Unknown               Unknown  Unknown
> libdarshan.so      00002B8B58F815B0  darshan_core_shut     Unknown  Unknown
> libdarshan.so      00002B8B58F7DFF3  MPI_Finalize          Unknown  Unknown
> libmpifort.so.12.  00002B8B592414DA  pmpi_finalize__       Unknown  Unknown
> exec.exe           00000000004490A5  Unknown               Unknown  Unknown
> exec.exe           00000000004032DE  Unknown               Unknown  Unknown
> libc-2.17.so       00002B8B5AB2F3D5  __libc_start_main     Unknown
>  Unknown
> exec.exe           00000000004031E9  Unknown               Unknown  Unknown
>
>
> If I configure darshan to write on a different filesystem (local /tmp of
> the first compute node), it works out fine, but then I'm restricted to use
> only one compute node, since the output dir have to be shared among all
> nodes (MPI tasks).
>
> Is there a workaround for this? At the moment, my cluster only have the
> lustre filesystem as a shared filesystem among all compute nodes.
>
> Best regards.
>
> --
> Abraços³,
> André Ramos Carneiro.
>

-- 
Abraços³,
André Ramos Carneiro.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20200402/e80ece57/attachment.html>