[Darshan-users] Darshan crashing compute node
André R. Carneiro
andre.es at gmail.com
Wed Apr 1 12:39:10 CDT 2020
Hi,
I'm traying to use darshan (tested all 3.1.X versions) on a HPC cluster
running with Red Hat 7.6 (kernel 3.10.0-957.el7.x86_64) and using Lustre
Filesystem version 2.11.0.300_cray_102_g3dbace1, but when the application
starts, the compute node crashes with a kernel panic message on the
console.
While searching on the Darshan mailing list I came across this message:
https://lists.mcs.anl.gov/mailman/htdig/darshan-users/2019-October/000542.html
I disabled the Lustre module and recompiled darshan. The crash isn't
happening anymore, but now it's unable to write the darshan log file on the
lustre filesystem, with the following error:
*Using OpenMPI 3.1.5 and GCC 7
Program received signal SIGFPE: Floating-point exception - erroneous
arithmetic operation.
Backtrace for this error:
#0 0x7f4f6759f27f in ???
#1 0x7f4f687ababe in ???
#2 0x7f4f687add06 in ???
#3 0x7f4f687db6c0 in ???
#4 0x7f4f687dbddb in ???
#5 0x7f4f6879d6f1 in ???
#6 0x7f4f6871892b in ???
#7 0x7f4f691d0ae1 in MPI_File_write_at_all
at lib/darshan-mpiio.c:536
#8 0x7f4f691bea7f in darshan_log_append_all
at lib/darshan-core.c:1800
#9 0x7f4f691c1907 in darshan_log_write_name_record_hash
at lib/darshan-core.c:1761
#10 0x7f4f691c1907 in darshan_core_shutdown
at lib/darshan-core.c:546
#11 0x7f4f691be402 in MPI_Finalize
at lib/darshan-core-init-finalize.c:82
#12 0x7f4f68b6a798 in ???
#13 0x4023bb in ???
#14 0x401ae6 in ???
#15 0x7f4f6758b3d4 in ???
#16 0x401b16 in ???
#17 0xffffffffffffffff in ???
--------------------------------------------------------------------------
*Using Intel PSXE 2018 with Intel MPI
forrtl: severe (71): integer divide by zero
Image PC Routine Line Source
exec.exe 000000000045282E Unknown Unknown Unknown
libpthread-2.17.s 00002B8B5A5FE5D0 Unknown Unknown Unknown
libmpi_lustre.so. 00002B8B659D4FDF ADIOI_LUSTRE_Get_ Unknown Unknown
libmpi_lustre.so. 00002B8B659CFFD9 ADIOI_LUSTRE_Writ Unknown Unknown
libmpi.so.12.0 00002B8B59A4C15C Unknown Unknown Unknown
libmpi.so.12 00002B8B59A4D1D5 PMPI_File_write_a Unknown Unknown
libdarshan.so 00002B8B58F90312 MPI_File_write_at Unknown Unknown
libdarshan.so 00002B8B58F7E63A Unknown Unknown Unknown
libdarshan.so 00002B8B58F815B0 darshan_core_shut Unknown Unknown
libdarshan.so 00002B8B58F7DFF3 MPI_Finalize Unknown Unknown
libmpifort.so.12. 00002B8B592414DA pmpi_finalize__ Unknown Unknown
exec.exe 00000000004490A5 Unknown Unknown Unknown
exec.exe 00000000004032DE Unknown Unknown Unknown
libc-2.17.so 00002B8B5AB2F3D5 __libc_start_main Unknown Unknown
exec.exe 00000000004031E9 Unknown Unknown Unknown
If I configure darshan to write on a different filesystem (local /tmp of
the first compute node), it works out fine, but then I'm restricted to use
only one compute node, since the output dir have to be shared among all
nodes (MPI tasks).
Is there a workaround for this? At the moment, my cluster only have the
lustre filesystem as a shared filesystem among all compute nodes.
Best regards.
--
Abraços³,
André Ramos Carneiro.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20200401/feca1d93/attachment.html>
More information about the Darshan-users
mailing list