[Darshan-users] darshan execl issue
Snyder, Shane
ssnyder at mcs.anl.gov
Tue Jul 12 14:21:38 CDT 2022
Thanks for the details.
That sequence of calls seems like it ought to work with Darshan, so would be good to see if we can get it right. I'll see if I can test something out locally and report back.
--Shane
________________________________
From: Adrian Jackson <a.jackson at epcc.ed.ac.uk>
Sent: Tuesday, July 12, 2022 1:08 PM
To: Snyder, Shane <ssnyder at mcs.anl.gov>; darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
Subject: Re: [Darshan-users] darshan execl issue
Ah, good question. No, I don't think I am, I'd not picked up that was
what execl did.
For full disclosure, I'm not actually calling execl, this is a Fortran
program that is calling system(), which according to the man pages on
our Cray does "The system() library function uses fork(2) to create a
child process that executes the shell command specified in command using
execl(3) as follows:
execl("/bin/sh", "sh", "-c", command, (char *) 0);
"
Hence the execl call.
I think I'll just tell the code owners not to use system() for now.
cheers
adrianj
On 12/07/2022 18:57, Snyder, Shane wrote:
> This email was sent to you by someone outside the University.
> You should only click on links or attachments if you are certain that
> the email is genuine and the content is safe.
> Just to clarify, are you doing anything explicit to spawn a new process
> (i.e., fork) ahead of the call to execl? My understanding is that execl
> replaces the calling process, so generally speaking it shouldn't result
> in 2 processes (MPI one + one for system tasks)?
>
> --Shane
> ------------------------------------------------------------------------
> *From:* Adrian Jackson <a.jackson at epcc.ed.ac.uk>
> *Sent:* Tuesday, July 12, 2022 10:58 AM
> *To:* Snyder, Shane <ssnyder at mcs.anl.gov>;
> darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
> *Subject:* Re: [Darshan-users] darshan execl issue
> Hi Shane,
>
> Thanks for the reply. This is spawning an additional process that runs
> for a bit then ends, the original MPI processes are still all there,
> it's just there is an additional one for a bit. The spawned process
> doesn't do any MPI, it's just doing some system interaction stuff. If
> Darshan intercepts or is triggered by a process ending I think that
> would explain it.
>
> It's something we can work around anyway, we can just not use Darshan
> for this exectuable, I was just checking if it's expected behaviour with
> Darshan or whether we'd stumbled across "an accidental feature" :)
>
> cheers
>
> adrianj
>
> On 12/07/2022 16:53, Snyder, Shane wrote:
>> This email was sent to you by someone outside the University.
>> You should only click on links or attachments if you are certain that
>> the email is genuine and the content is safe.
>> Hi Adrian,
>>
>> So you are replacing one of the MPI processes in MPI_COMM_WOLRD with a
>> new process? In that case, it is probably that this new replacing
>> process is not calling MPI_Finalize which ultimately causes Darshan to
>> hang -- Darshan is intercepting the shutdown call and performing some
>> collective operations for MPI applications, and if one of the ranks
>> disappears these calls will likely just hang. If that's the issue, you
>> could probably reproduce without using Darshan by having your MPI
>> processes run a collective on MPI_COMM_WORLD (like a barrier) _after_
>> the execl call.
>>
>> A couple of different ideas:
>>
>> * If possible, it might be worth trying to fork ahead of the execl
>> call so that you still have all MPI processes hanging around at
>> shutdown time?
>> * You may be able to run Darshan in non-MPI mode at runtime (using
>> 'export DARSHAN_ENABLE_NONMPI=1') to workaround this problem. This
>> would prevent Darshan from running collectives at shutdown time, but
>> will result in a different log file for each process in your
>> application.
>>
>> Thanks,
>> --Shane
>> ------------------------------------------------------------------------
>> *From:* Darshan-users <darshan-users-bounces at lists.mcs.anl.gov> on
>> behalf of Adrian Jackson <a.jackson at epcc.ed.ac.uk>
>> *Sent:* Tuesday, July 12, 2022 8:13 AM
>> *To:* darshan-users at lists.mcs.anl.gov <darshan-users at lists.mcs.anl.gov>
>> *Subject:* [Darshan-users] darshan execl issue
>> Hi,
>>
>> I've encountered an issue using Darshan (3.3.1) with a code which calls
>> execl from one MPI process. Using with Darshan the MPI run just hangs.
>> Is spawning processes from a subset of MPI processes an issue for
>> Darshan? I would say that I can still spawn processes (i.e. using fork)
>> and it seems to work, but using execl doesn't.
>>
>> cheers
>>
>> adrianj
>> --
>> Tel: +44 131 6506470 skype: remoteadrianj
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336. Is e buidheann carthannais
>> a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh
>> SC005336.
>> _______________________________________________
>> Darshan-users mailing list
>> Darshan-users at lists.mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
> <https://lists.mcs.anl.gov/mailman/listinfo/darshan-users>
>> <https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
> <https://lists.mcs.anl.gov/mailman/listinfo/darshan-users>>
>
> --
> Tel: +44 131 6506470 skype: remoteadrianj
--
Tel: +44 131 6506470 skype: remoteadrianj
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/darshan-users/attachments/20220712/48543c3c/attachment.html>
More information about the Darshan-users
mailing list