[Swift-devel] IO overheads of swift wrapper scripts on BlueGene/P

Ioan Raicu iraicu at cs.uchicago.edu
Sun Oct 18 09:06:47 CDT 2009


Hi Allan,
I don't remember, but your Falkon only run seemed to run OK, right? 
Didn't that also produce the output files Swift is producing? Or is 
Swift doing an extra step, to copy/move files from one place to another 
after the computation terminates, which is the thing that takes so long? 
Just trying to understand the difference between the Falkon only run and 
Swift run.

Ioan

-- 
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
=================================================================
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu at eecs.northwestern.edu
Web:   http://www.eecs.northwestern.edu/~iraicu/
       https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================




Allan Espinosa wrote:
> Here I tried one directory per job (Q0000130).  3 output files are
> expected per directory which are produced by a single job:
>
> Progress  2009-10-17 20:53:56.943503000-0500  LOG_START
>
> _____________________________________________________________________________
>
>         Wrapper
> _____________________________________________________________________________
>
> Job directory mode is: link on shared filesystem
> DIR=jobs/7/blastall-715ul5ij
> EXEC=/home/espinosa/workflows/jgi_blastp/blastall_wrapper
> STDIN=
> STDOUT=home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130/out_Q0000130.sout
> STDERR=home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130/out_Q0000130.serr
> DIRS=home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130
> INF=
> OUTF=home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130/out_Q0000130.out^home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130/out_Q0000130.serr^home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130/out_Q0000130.sout
> KICKSTART=
> ARGS=-p blastp -m 8 -e 1.0e-5 -FF -d /dataifs/nr -i
> /intrepid-fs0/users/espinosa/persistent/datasets/nr_bob/queries/mock_2seq/D0000000/SEQ0000130.fasta
> -o home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130/out_Q0000130.out
> ARGC=13
> Progress  2009-10-17 20:53:58.656335000-0500  CREATE_JOBDIR
> Created job directory: jobs/7/blastall-715ul5ij
> Progress  2009-10-17 20:54:05.204962000-0500  CREATE_INPUTDIR
> Created output directory:
> jobs/7/blastall-715ul5ij/home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130
> Progress  2009-10-17 20:54:15.498666000-0500  LINK_INPUTS
> Progress  2009-10-17 20:54:19.900786000-0500  EXECUTE
> Moving back to workflow directory
> /fuse/intrepid-fs0/users/espinosa/scratch/jgi-blastp_runs/blastp-test3.2.7_3cpn.64ifs.192cpu
> Progress  2009-10-17 21:20:23.390800000-0500  EXECUTE_DONE
> Job ran successfully
> Progress  2009-10-17 21:31:11.179664000-0500  COPYING_OUTPUTS
> Progress  2009-10-17 21:37:14.539569000-0500  RM_JOBDIR
> Progress  2009-10-17 21:38:24.220130000-0500  END
>
>
> COPYING_OUTPUTS still take time.
>
> 2009/10/17 Michael Wilde <wilde at mcs.anl.gov>:
>   
>> Remember that any situation in which multiple IONs modify the same file or
>> directory (ie by creating files or directories in the same parent directory)
>> will cause severe contention and performance degradation on any GPFS
>> filesystem.
>>
>> In addition to creating many directories, you need to ensure that no single
>> file or directories is likely to ever be written to from multiple client
>> nodes (eg IONs on the BG/P) concurrently.
>>     
>
> This workload is just over 1 PSET so there are no other IONs
> contending over the directories.
>
>   
>> Have you done that in this workload, Allan?
>>
>> - Mike
>>
>>
>> On 10/17/09 2:59 AM, Allan Espinosa wrote:
>>     
>>> I was using 1000 files  (or was it 3000?) per directory. it looks like
>>> i need to lower my ratio...
>>>
>>> -Allan
>>>
>>> 2009/10/16 Mihael Hategan <hategan at mcs.anl.gov>:
>>>       
>>>> On Fri, 2009-10-16 at 21:07 -0500, Allan Espinosa wrote:
>>>>         
>>>>> Progress  2009-10-16 18:00:33.756364000-0500  COPYING_OUTPUTS
>>>>> Progress  2009-10-16 18:08:19.970449000-0500  RM_JOBDIR
>>>>>           
>>>> Grr. 8 minutes spent COPYING_OUTPUTS.
>>>>
>>>> What would be useful is to aggregate all the access that happened on
>>>> that FS from all the relevant jobs, to see the exact thing that causes
>>>> contention. I strongly suspect it's
>>>> home/espinosa/workflows/jgi_blastp/test3.4.7_3cpn.32ifs.192cpu/output/
>>>>
>>>> Pretty much all the outputs seem to go to that directory.
>>>>
>>>> I'm afraid however that the information in the logs is insufficient.
>>>> Strace with relevant options (for fs calls only) may be useful if you
>>>> want to try.
>>>>
>>>> Alternatively, you could try to spread your output over multiple
>>>> directories and see what the difference is.
>>>>
>>>> Also, it may be interesting to see the dependence between the delay and
>>>> the number of contending processes. That is so that we know the limit of
>>>> how many processes we can allow to compete for a shared resource without
>>>> causing too much trouble.
>>>>
>>>> Mihael
>>>>
>>>>
>>>>
>>>>         
>>>
>>>       
>>     
>
>
>
>   

-- 
=================================================================
Ioan Raicu, Ph.D.
NSF/CRA Computing Innovation Fellow
=================================================================
Center for Ultra-scale Computing and Information Security (CUCIS)
Department of Electrical Engineering and Computer Science
Northwestern University
2145 Sheridan Rd, Tech M384 
Evanston, IL 60208-3118
=================================================================
Cel:   1-847-722-0876
Tel:   1-847-491-8163
Email: iraicu at eecs.northwestern.edu
Web:   http://www.eecs.northwestern.edu/~iraicu/
       https://wiki.cucis.eecs.northwestern.edu/
=================================================================
=================================================================


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/swift-devel/attachments/20091018/651df606/attachment.html>


More information about the Swift-devel mailing list