[Darshan-users] Darshan 3.3.1 aborting while generating log file
Snyder, Shane
ssnyder at mcs.anl.gov
Thu Jul 29 11:55:28 CDT 2021
Hi Andre,
Thanks for reporting these issues to us!
We think the 1st and 3rd issues you mention are related to a known bug in older versions of ROMIO's Lustre driver. This bug has since been fixed, but we probably do need to offer some sort of workaround in Darshan so we aren't crashing user codes. I've opened up an issue on our GitHub to track this problem (https://github.com/darshan-hpc/darshan/issues/424) -- our current plan is to offer a configure option that helps work around this issue for affected MPI versions, by avoiding usage of ROMIO's Lustre driver (where the bug is occurring).
I'll have to look into the 2nd issue you reported more to see if I can reproduce on systems that I have access to. I'll keep you posted on if I'm able to help narrow down what's going wrong there.
--Shane
________________________________
From: Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on behalf of André R. Carneiro <andre.es at gmail.com>
Sent: Tuesday, July 27, 2021 8:19 AM
To: darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Subject: [Darshan-users] Darshan 3.3.1 aborting while generating log file
Hi,
I'm testing version 3.3.1 with different MPI implementations. With newer versions OpenMPI (4.X with ROMIO v3.2.1) and Intel MPI (Parallel Studio XE 2019 and 2020 with ROMIO from MPICH v3.3) everything runs smoothly. But with older versions I'm getting the erros bellow while generating the log file when using a Lustre filesystem.
I'm only able to generate the log files with those older versions if I configure the environment variable DARSHAN_LOGHINTS with "".
The application I'm testing is the BT-IO from NAS NPB v3.3.1.
The version of the Lustre FS is 2.12.4.1_cray_139_g0763d21
======================================================
* OpenMPI 3.X with ROMIO v3.1.4
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x2b8da12e03ef in ???
#1 0x2b8da01e8abe in ???
#2 0x2b8da01ead06 in ???
#3 0x2b8da02186c0 in ???
#4 0x2b8da0218ddb in ???
#5 0x2b8da01da6f1 in ???
#6 0x2b8da015592b in ???
#7 0x2b8d9f813c40 in MPI_File_write_at_all
at lib/darshan-mpiio.c:573
#8 0x2b8d9f7f5134 in darshan_log_append
at lib/darshan-core.c:1884
#9 0x2b8d9f7f84bd in darshan_log_write_name_record_hash
at lib/darshan-core.c:1775
#10 0x2b8d9f7f84bd in darshan_core_shutdown
at lib/darshan-core.c:604
#11 0x2b8d9f7f4917 in MPI_Finalize
at lib/darshan-core-init-finalize.c:85
======================================================
* OpenMPI 3.X with OMPIO
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
#0 0x2b6fc617627f in ???
#1 0x2b6fc468bcfd in darshan_core_lookup_record_name
at lib/darshan-core.c:2389
#2 0x2b6fc46a5484 in darshan_stdio_lookup_record_name
at lib/darshan-stdio.c:1288
#3 0x2b6fc4694c87 in fileno
at lib/darshan-posix.c:768
#4 0x2b6fc7b68504 in ???
#5 0x2b6fc7b69920 in ???
#6 0x2b6fc506c817 in ???
#7 0x2b6fc500ef1a in ???
#8 0x2b6fc50b4361 in ???
#9 0x2b6fc506e5c8 in ???
#10 0x2b6fc4fbafeb in ???
#11 0x2b6fc4fe8903 in ???
#12 0x2b6fc46a798b in MPI_File_open
at lib/darshan-mpiio.c:345
#13 0x2b6fc468d4a1 in darshan_log_open
at lib/darshan-core.c:1604
#14 0x2b6fc468d4a1 in darshan_core_shutdown
at lib/darshan-core.c:584
#15 0x2b6fc468a917 in MPI_Finalize
at lib/darshan-core-init-finalize.c:85
#16 0x2b6fc4d3b798 in ???
#17 0x4025cc in ???
#18 0x402f39 in ???
#19 0x2b6fc61623d4 in ???
#20 0x401868 in ???
#21 0xffffffffffffffff in ???
======================================================
* Intel PSXE 2018 with ROMIO from MPICH v3.2
forrtl: severe (71): integer divide by zero
Image PC Routine Line Source
libifcoremt.so.5 00002B4887FFE4CF for__signal_handl Unknown Unknown
libpthread-2.17.s 00002B4887B6F630 Unknown Unknown Unknown
libmpi_lustre.so. 00002B488EF5EFDF ADIOI_LUSTRE_Get_ Unknown Unknown
libmpi_lustre.so. 00002B488EF59FD9 ADIOI_LUSTRE_Writ Unknown Unknown
libmpi.so.12.0 00002B4886FBD15C Unknown Unknown Unknown
libmpi.so.12 00002B4886FBE1D5 PMPI_File_write_a Unknown Unknown
libdarshan.so 00002B4886500B07 MPI_File_write_at Unknown Unknown
libdarshan.so 00002B48864E088D Unknown Unknown Unknown
libdarshan.so 00002B48864E3BC3 darshan_core_shut Unknown Unknown
libdarshan.so 00002B48864E00A8 MPI_Finalize Unknown Unknown
libmpifort.so.12. 00002B48867B24DA pmpi_finalize__ Unknown Unknown
bt.C.36.mpi_io_fu 0000000000402A35 Unknown Unknown Unknown
bt.C.36.mpi_io_fu 0000000000401D92 Unknown Unknown Unknown
libc-2.17.so<http://libc-2.17.so/> 00002B488A76F545 __libc_start_main Unknown Unknown
bt.C.36.mpi_io_fu 0000000000401C99 Unknown Unknown Unknown
======================================================
--
Abraços³,
André Ramos Carneiro.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20210729/fe3a3776/attachment.html>
More information about the Darshan-users
mailing list