[Darshan-users] LD_PRELOAD system() call

Cristian Simarro cristian.simarro at ecmwf.int
Tue May 5 06:09:05 CDT 2015


Hi Phill,

I have tried your test in our Cray but I can not reproduce the behaviour. Actually the test program is finishing properly.

Can you add traceback information about the call?
    
    Dl_info dli;

    original_read = dlsym(RTLD_NEXT, "read");
    dladdr(original_read,&dli);
    fprintf(stderr, "debug trace [%d]: %s "
                    "called by %p [ %s(%p) %s(%p) ].\n",
                    getpid(), __func__,
                     __builtin_return_address(0),
                    strrchr(dli.dli_fname, '/') ?
                            strrchr(dli.dli_fname, '/')+1 : dli.dli_fname,
                    dli.dli_fbase, dli.dli_sname, dli.dli_saddr);

Best regards,
Cristian

------------------------------------------------------------------
Cristian Simarro
Analyst, User Support Section
European Centre for Medium-Range Weather Forecasts (ECMWF)
Shinfield Park, Reading, RG2 9AX, United Kingdom
Tel:    (+44 118) 9499315                Fax:    (+44 118) 9869450
E-mail: Cristian.Simarro at ecmwf.int            http://www.ecmwf.int
------------------------------------------------------------------

----- Original Message -----
From: "Phil Carns" <carns at mcs.anl.gov>
To: darshan-users at lists.mcs.anl.gov
Sent: Sunday, 3 May, 2015 3:35:46 PM
Subject: Re: [Darshan-users] LD_PRELOAD system() call

Hi Cristian,

This is definitely the same problem that Kalyana and I were looking at 
earlier with fork().  I built a very small example reproducer library to 
try to simplify the problem (see Makefile and read-wrapper.c).  The 
read-wrapper.c isn't doing anything except intercepting the read() 
function, printing some information, then calling the real read() 
function.  I've been building this library with PrgEnv-gnu.

The test.c is your example program, and the 
test-preload-read-wrapper.pbs.e* is an example stderr file from trying 
to run it with the example read wrapper library preloaded.

There isn't any Darshan code involved here, but the example still 
hangs.  It looks like you could trigger it with *any* wrapper on the 
read() function in the Cray environment in conjunction with a fork() or 
system() call.  Maybe there is some sort of recursion here?

I'll keep thinking about this some, but I thought I would share what I'm 
seeing with the list in case anyone else has an idea.

thanks,
-Phil

On 05/01/2015 12:06 PM, Carns, Philip H. wrote:
> Hi Cristian,
>
> I was testing on Edison, an XC30 system at NERSC.  I compiled with
> cray-mpich 7.1.1, and I think it is using Torque as the batch system.
> FYI, to run this example program I have to launch the executable using
> aprun (otherwise MPI won't initialize properly). I think this will be
> reproducible with non-MPI programs as well, though.
>
> thanks,
> -Phil
>
> On 05/01/2015 04:53 AM, Cristian Simarro wrote:
>> Hi Phil,
>>
>> Could you please tell me the batch system that you are using in your Cray machine? Is the MPI implementation cray-mpich?
>>
>> Thanks,
>> Cristian
>>
>> ----- Original Message -----
>> From: "Phil Carns" <carns at mcs.anl.gov>
>> To: "Cristian Simarro" <cristian.simarro at ecmwf.int>
>> Cc: darshan-users at lists.mcs.anl.gov
>> Sent: Thursday, 30 April, 2015 8:45:11 PM
>> Subject: Re: [Darshan-users] LD_PRELOAD system() call
>>
>> Thanks for the test program, Cristian. I can confirm that it hangs with
>> LD_PRELOAD on a Cray, but not on a Linux workstation.  I'm not exactly
>> sure what the underlying difference is in this case, but it is
>> definitely 100% reproducible in the Cray environment.
>>
>> Kalyana Chadalavada has actually observed something very similar when
>> using fork() directly; I imagine that it is the underlying fork() within
>> the system() call that is causing the problem.
>>
>> thanks,
>> -Phil
>>
>> On 04/30/2015 03:07 AM, Cristian Simarro wrote:
>>> Hi Phill,
>>>
>>> Actually any command under system() call is triggering the problem. The spawned process do not finish and then the task that has issued the call is hung on the waitpid.
>>>
>>> This example hangs if we are using LD_PRELOAD mechanism:
>>>
>>> #include <stdio.h>
>>> #include <mpi.h>
>>> #include <stdlib.h>
>>>
>>> int main (int argc, char *argv[])
>>> {
>>>      int rank, size;
>>>      int ret;
>>>
>>>      MPI_Init (&argc, &argv);
>>>      MPI_Comm_rank (MPI_COMM_WORLD, &rank);
>>>      MPI_Comm_size (MPI_COMM_WORLD, &size);
>>>      if(rank == 0) {
>>>       ret = system("echo calling system");
>>>      }
>>>      printf( "Hello world from process %d of %d\n", rank, size );
>>>      MPI_Finalize();
>>>      return 0;
>>> }
>>>
>>> Thanks,
>>> Cristian
>>>
>>> ------------------------------------------------------------------
>>> Cristian Simarro
>>> Analyst, User Support Section
>>> European Centre for Medium-Range Weather Forecasts (ECMWF)
>>> Shinfield Park, Reading, RG2 9AX, United Kingdom
>>> Tel:    (+44 118) 9499315                Fax:    (+44 118) 9869450
>>> E-mail: Cristian.Simarro at ecmwf.int            http://www.ecmwf.int
>>> ------------------------------------------------------------------
>>>
>>> ----- Original Message -----
>>> From: "Phil Carns" <carns at mcs.anl.gov>
>>> To: darshan-users at lists.mcs.anl.gov
>>> Sent: Wednesday, 29 April, 2015 10:13:54 PM
>>> Subject: Re: [Darshan-users] LD_PRELOAD system() call
>>>
>>> On 04/29/2015 02:54 PM, Phil Carns wrote:
>>>> On 04/29/2015 12:17 PM, Cristian Simarro wrote:
>>>>> Hello,
>>>>>
>>>>> We have been facing some problems with system() call inside some
>>>>> C/Fortran codes in our Cray machine.
>>>>>
>>>>> The method used here is compile dynamically and then use LD_PRELOAD.
>>>>> When the code calls system(command), it hangs the execution if
>>>>> preloaded with Darshan because it is trying to instrument an internal
>>>>> system read() with no initialization.
>>>>>
>>>>> The solution we have designed is to unset LD_PRELOAD (if set before)
>>>>> in the darshan_mpi_initialize function.
>>>>>
>>>>> Has anybody found a similar problem with LD_PRELOAD + system() calls?
>>>> Hi Cristian,
>>>>
>>>> I don't think I've seen this exact combination before, but it seems
>>>> like something we should be able to reproduce and isolate.
>>>>
>>>> If I understand correctly, it sounds like the underlying process
>>>> spawned by system() is inheriting the LD_PRELOAD environment variable
>>>> from the parent program, and it is the underlying process that is
>>>> getting hung?  If so, does it matter what you run in the system() call
>>>> or does it seem like pretty anything triggers it?
>>>>
>>>> thanks,
>>>> -Phil
>>> The solution you have suggested (unsetting LD_PRELOAD grammatically
>>> during Darshan initialization) might not be a bad long term solution,
>>> maybe with some extra safety logic to make sure we don't accidentally
>>> unset unrelated LD_PRELOAD entries.  I imagine that once the application
>>> has gotten to darshan initialization, then the loader has already
>>> processed the LD_PRELOAD environment variable and we don't need to keep
>>> it set any longer.  That would help keep it from interfering with child
>>> processes.
>>>
>>> We would definitely need to do some testing to confirm, though.
>>>
>>> thanks,
>>> -Phil
>>> _______________________________________________
>>> Darshan-users mailing list
>>> Darshan-users at lists.mcs.anl.gov
>>> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users
> _______________________________________________
> Darshan-users mailing list
> Darshan-users at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/darshan-users


_______________________________________________
Darshan-users mailing list
Darshan-users at lists.mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/darshan-users


More information about the Darshan-users mailing list