[petsc-dev] testing in parallel

Mon May 6 14:10:49 CDT 2019

@bsmith -- this long message addresses your other messages as well.

Regarding why the different `'make -jXX` gives different number of
tests depending on the value of XX:

This issue here is how we handle failed tests.
The general paradigm is in the shell script is:

----------------
    petsc_testrun "${mpiexec} -n ${nsize} ${exec} ...

    res=$?

    if test $res = 0; then
       petsc_testrun "${diff_exe} ...
    else
       printf "ok ${label} # SKIP Command failed so no diff\n"
    fi
----------------

If we fail, then we skip the diff and don't record that test since it is 
skipped (we don't report SKIP or TODO's by default).
In other words, a successful invocation of a run will have 2 test 
(running and diffing), but a failure of the running will only have 1 
test (just the running, which will be a failure).

So the real question is:
   Why does the "-j20" case have more failures in running?

These are problems that Barry has been reporting both here and in
private email messages.

For my tests, these are the mat tests that fail with a make -j20:
mat_tests-ex23_10
mat_tests-ex23_2
mat_tests-ex23_3
mat_tests-ex23_4
mat_tests-ex23_5
mat_tests-ex23_9
mat_tests-ex23_vscat_default
mat_tests-ex23_vscat_sf
mat_tutorials-ex12_1

All of the mat_tests=ex23*   tests fail with timeouts.

mat_tutorials-ex12_1 fails without any stderr.  I don't know what's 
going on really.

Barry has reported hard crashes that don't occur when running the script 
by hand.  I assume that this is related to lack of resources when 
running in parallel, but it's speculative.

I am surprised at how consistent this seems to be -- the differences are 
pretty reproducible if I don't have much else going on with my laptop.
Perhaps this suggests a solution.

The fact that mat_test-ex23* is a problem could be predicted by
just looking at my 'make -j1' run and seeing which tests took the most
time:
# Timing summary (actual test time / total CPU time):
#   mat_tests-ex23_4: 13.29 sec / 15.08 sec
#   mat_tests-ex23_9: 12.20 sec / 14.19 sec
#   mat_tests-ex23_10: 9.72 sec / 11.36 sec
#   mat_tests-ex23_3: 7.22 sec / 8.33 sec
#   mat_tests-ex23_vscat_default: 7.06 sec / 8.22 sec

Predictably, when I coded up the dependencies, I list them sequentially 
and I assume that gmake's parallelization will just doing a queue-based 
task distribution.    That is, mat_tests-ex23_4 will be invoked 
immediately after mat_tests-ex23_3.  This means that gmake will run 
these expensive tests together.

A crude method of trying to ameliorate these problems would be to 
randomize the dependency list.  In this example, the goal would be to 
prevent multiple ex23 executables from being called at the same time.

Of course, a better method would be to use some type of our own 
round-robin distribution based on an expected "JFLAG" value.  That could 
perhaps be a flag passed to config/gmakegentest.py.

Comments welcome.

Scott

On 4/29/19 5:04 PM, Scott Kruger via petsc-dev wrote:
> 
> 
> FYI -- I have reproduced all the problems but am still looking at it.
> 
> I thought perhaps it would be something about the globsearch's 
> invocation of python, but it's not -- I get the same thing even with 
> gmake's native filter (and in fact, it appears to be worse).
> 
> I'm getting something funny in the counts directory which is where each 
> individual run stores its output, but I need more testing to figure out 
> what's going on.
> 
> Scott
> 
> 
> On 4/22/19 11:00 PM, Jed Brown via petsc-dev wrote:
>> I don't know how this would happen and haven't noticed it myself.
>> Perhaps Scott can help investigate.  It would help to know which tests
>> run in each case.  To debug, I would make a dry-run or skip-all mode
>> that skips actually running the tests and just reports success (or
>> skip).
>>
>> Stefano Zampini <stefano.zampini at gmail.com> writes:
>>
>>> The print-test target seems ok wrt race conditions
>>>
>>> [szampini at localhost petsc]$ make -j1 -f gmakefile.test print-test  
>>> globsearch="mat*" | wc
>>>        1     538   11671
>>> [szampini at localhost petsc]$ make -j20 -f gmakefile.test print-test  
>>> globsearch="mat*" | wc
>>>        1     538   11671
>>>
>>> However, if I run the tests, I get two different outputs
>>>
>>> [szampini at localhost petsc]$ make -j20 -f gmakefile.test test 
>>> globsearch="mat*"
>>> [..]
>>> # -------------
>>> #   Summary
>>> # -------------
>>> # success 1226/1312 tests (93.4%)
>>> # failed 0/1312 tests (0.0%)
>>> # todo 6/1312 tests (0.5%)
>>> # skip 80/1312 tests (6.1%)
>>>
>>> [szampini at localhost petsc]$ make -j20 -f gmakefile.test test 
>>> globsearch="mat*"
>>> [..]
>>> # -------------
>>> #   Summary
>>> # -------------
>>> # success 990/1073 tests (92.3%)
>>> # failed 0/1073 tests (0.0%)
>>> # todo 6/1073 tests (0.6%)
>>> # skip 77/1073 tests (7.2%)
>>>
>>>> On Apr 22, 2019, at 8:12 PM, Jed Brown <jed at jedbrown.org> wrote:
>>>>
>>>> Stefano Zampini via petsc-dev <petsc-dev at mcs.anl.gov> writes:
>>>>
>>>>> Scott,
>>>>>
>>>>> I have noticed that make -j20 -f gmakefile.test test 
>>>>> globsearch="mat*" does
>>>>> not always run the same number of tests. How hard is to fix this race
>>>>> condition in the generation of the rules?
>>>>
>>>> Can you reproduce with the print-test target?  These are just running
>>>> Python to create a list of targets, and should all take place before
>>>> executing rules.
> 

-- 
Tech-X Corporation               kruger at txcorp.com
5621 Arapahoe Ave, Suite A       Phone: (720) 974-1841
Boulder, CO 80303                Fax:   (303) 448-7756