[Swift-devel] IO overheads of swift wrapper scripts on BlueGene/P
Allan Espinosa
aespinosa at cs.uchicago.edu
Sat Oct 17 21:57:45 CDT 2009
Here I tried one directory per job (Q0000130). 3 output files are
expected per directory which are produced by a single job:
Progress 2009-10-17 20:53:56.943503000-0500 LOG_START
_____________________________________________________________________________
Wrapper
_____________________________________________________________________________
Job directory mode is: link on shared filesystem
DIR=jobs/7/blastall-715ul5ij
EXEC=/home/espinosa/workflows/jgi_blastp/blastall_wrapper
STDIN=
STDOUT=home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130/out_Q0000130.sout
STDERR=home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130/out_Q0000130.serr
DIRS=home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130
INF=
OUTF=home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130/out_Q0000130.out^home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130/out_Q0000130.serr^home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130/out_Q0000130.sout
KICKSTART=
ARGS=-p blastp -m 8 -e 1.0e-5 -FF -d /dataifs/nr -i
/intrepid-fs0/users/espinosa/persistent/datasets/nr_bob/queries/mock_2seq/D0000000/SEQ0000130.fasta
-o home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130/out_Q0000130.out
ARGC=13
Progress 2009-10-17 20:53:58.656335000-0500 CREATE_JOBDIR
Created job directory: jobs/7/blastall-715ul5ij
Progress 2009-10-17 20:54:05.204962000-0500 CREATE_INPUTDIR
Created output directory:
jobs/7/blastall-715ul5ij/home/espinosa/workflows/jgi_blastp/oldtests/test3.2.7_3cpn.64ifs.192cpu/output/D0000000/Q0000130
Progress 2009-10-17 20:54:15.498666000-0500 LINK_INPUTS
Progress 2009-10-17 20:54:19.900786000-0500 EXECUTE
Moving back to workflow directory
/fuse/intrepid-fs0/users/espinosa/scratch/jgi-blastp_runs/blastp-test3.2.7_3cpn.64ifs.192cpu
Progress 2009-10-17 21:20:23.390800000-0500 EXECUTE_DONE
Job ran successfully
Progress 2009-10-17 21:31:11.179664000-0500 COPYING_OUTPUTS
Progress 2009-10-17 21:37:14.539569000-0500 RM_JOBDIR
Progress 2009-10-17 21:38:24.220130000-0500 END
COPYING_OUTPUTS still take time.
2009/10/17 Michael Wilde <wilde at mcs.anl.gov>:
> Remember that any situation in which multiple IONs modify the same file or
> directory (ie by creating files or directories in the same parent directory)
> will cause severe contention and performance degradation on any GPFS
> filesystem.
>
> In addition to creating many directories, you need to ensure that no single
> file or directories is likely to ever be written to from multiple client
> nodes (eg IONs on the BG/P) concurrently.
This workload is just over 1 PSET so there are no other IONs
contending over the directories.
>
> Have you done that in this workload, Allan?
>
> - Mike
>
>
> On 10/17/09 2:59 AM, Allan Espinosa wrote:
>>
>> I was using 1000 files (or was it 3000?) per directory. it looks like
>> i need to lower my ratio...
>>
>> -Allan
>>
>> 2009/10/16 Mihael Hategan <hategan at mcs.anl.gov>:
>>>
>>> On Fri, 2009-10-16 at 21:07 -0500, Allan Espinosa wrote:
>>>>
>>>> Progress 2009-10-16 18:00:33.756364000-0500 COPYING_OUTPUTS
>>>> Progress 2009-10-16 18:08:19.970449000-0500 RM_JOBDIR
>>>
>>> Grr. 8 minutes spent COPYING_OUTPUTS.
>>>
>>> What would be useful is to aggregate all the access that happened on
>>> that FS from all the relevant jobs, to see the exact thing that causes
>>> contention. I strongly suspect it's
>>> home/espinosa/workflows/jgi_blastp/test3.4.7_3cpn.32ifs.192cpu/output/
>>>
>>> Pretty much all the outputs seem to go to that directory.
>>>
>>> I'm afraid however that the information in the logs is insufficient.
>>> Strace with relevant options (for fs calls only) may be useful if you
>>> want to try.
>>>
>>> Alternatively, you could try to spread your output over multiple
>>> directories and see what the difference is.
>>>
>>> Also, it may be interesting to see the dependence between the delay and
>>> the number of contending processes. That is so that we know the limit of
>>> how many processes we can allow to compete for a shared resource without
>>> causing too much trouble.
>>>
>>> Mihael
>>>
>>>
>>>
>>
>>
>>
>
>
--
Allan M. Espinosa <http://allan.88-mph.net/blog>
PhD student, Computer Science
University of Chicago <http://people.cs.uchicago.edu/~aespinosa>
More information about the Swift-devel
mailing list