[ExM Users] mkstatic questions

Ketan Maheshwari ketan at mcs.anl.gov
Fri May 30 14:38:52 CDT 2014


Discussed this with Justin and rebuilt Turbine, adlb, et al. with bgxlcxx.
Now getting the following adlb related error:

$ cat 272932.output
   0.001 MODE: ENGINE
   0.001 MODE: SERVER
   0.001 MODE: WORKER
   0.001 MODE: WORKER
   0.002 ENGINES: 1 RANKS: 0 - 0
   0.003 WORKERS: 2 RANKS: 1 - 2
   0.003 SERVERS: 1 RANKS: 3 - 3
   0.008 function:swift:constants
   0.009 allocated string: c:s_500=<1>
   0.010 store: <1>="500"
   0.011 allocated string: c:s_database=<2>
   0.011 store: <2>="-database"
   0.012 allocated string: c:s_ex1=<3>
   0.012 store: <3>="-ex1"
   0.013 allocated string: c:s_ex2aro=<4>
   0.013 store: <4>="-ex2aro"
   0.014 allocated string: c:s_hlac97DFAM=<5>
   0.014 store: <5>="hlac-97-D-FAMPNAQTA_complex_0001_swift.sc"
   0.015 allocated string: c:s_homevsachd=<6>
   0.015 store:
<6>="/home/vsachde/ROSETTA/new-benchmark/minirosetta_database/"
   0.016 allocated string: c:s_nstruct=<7>
   0.016 store: <7>="-nstruct"
   0.017 allocated string: c:s_overwrite=<8>
   0.017 store: <8>="-overwrite"
   0.018 allocated string: c:s_pep_refine=<9>
   0.018 store: <9>="-pep_refine"
   0.019 allocated string: c:s_projectsEx=<10>
   0.019 store:
<10>="/projects/ExM/hlac-97-D/hlac-97-D-FAMPNAQTA_complex_0001.pdb"
   0.020 allocated string: c:s_s=<11>
   0.020 store: <11>="-s"
   0.020 allocated string: c:s_scorefile=<12>
   0.021 store: <12>="-scorefile"
   0.021 allocated string: c:s_use_input_=<13>
   0.022 store: <13>="-use_input_sc"
   0.023 enter function: main
ADLB_DATA_CHECK FAILED: src/handlers.c:827
   0.023 allocated t:0=<14> t:1=<1074790400>
ADLB_CHECK FAILED: src/server.c:xlb_handle_pending():330
   0.024 array_kv_build: <1074790400> 13 elems, write_decr 1
ADLB_CHECK FAILED: src/server.c:serve_several():261
ADLB_CHECK FAILED: src/server.c:ADLB_Server():218
CAUGHT ERROR:

error: adlb::server: SERVER FAILED


    invoked from within
"adlb::server "
    (procedure "enter_mode_unchecked" line 5)
    invoked from within
"enter_mode_unchecked $rules $engine_startup"
    (procedure "enter_mode" line 10)
    invoked from within
"enter_mode $rules $engine_startup "
CALLING adlb::abort
ADLB_Abort(1)
MPI_Abort(1)


Any ideas?


On Fri, May 30, 2014 at 1:18 PM, Maheshwari, Ketan C. <ketan at mcs.anl.gov>
wrote:

>  Thanks! to narrow down and eliminate application-adlb MPI issues I am
> now trying to rebuild the application with bgxlc++ instead of bgmpixlcxx
> which it was built originally. Will keep you posted on how things work out.
>
>
> On Fri, May 30, 2014 at 12:49 PM, Tim Armstrong <tim.g.armstrong at gmail.com
> > wrote:
>
>>  I see.  Based on the two lines of output you sent me, the problem is
>> something to do with message sizes on MPI.  Assuming your app isn't using
>> MPI internally, it's probably some communication that ADLB is doing.  The
>> error message would generally be caused by a mismatch of message size
>> between sender and receiver.  The most likely explanation in the ADLB
>> codebase is that the sender and receiver somehow disagree on sizes of
>> structs, which doesn't make a whole lot of sense unless something strange
>> was done during the build process, e.g. one file was compiled with
>> different compiler settings, or you somehow linked to different versions of
>> the function.
>> .
>> It's possible that it's a bug in the ADLB codebase that's nothing to do
>> with how it was built, but it seems unlikely that something like that would
>> have escaped all the tests.  It might help to look at the Tcl code or Swift
>> that's being run, as well as to make sure that it runs correctly on a
>> different environment.
>>
>> It would also be helpful to have a full log of the program output with
>> debug logging enabled, since that will tell me what ADLB was doing at the
>> time.
>>
>> I'm not sure if I can help with debugging the problem without more info.
>>
>>  - Tim
>>
>>
>> On Fri, May 30, 2014 at 11:33 AM, Ketan Maheshwari <ketan at mcs.anl.gov>
>> wrote:
>>
>>> I rebuilt the application recently without MPI. It seems to be working
>>> outside of Swift on Cetus compute nodes.
>>>
>>>
>>> On Fri, May 30, 2014 at 11:18 AM, Tim Armstrong <
>>> tim.g.armstrong at gmail.com> wrote:
>>>
>>>>  Regarding the MPI error - that seems strange.  There are multiple
>>>> places in the code that it might be.
>>>>
>>>>  One possible cause is if something funny happened in compiling/linking
>>>> - e.g. multiple compilers or versions of things linked together.  Have you
>>>> tried running the code locally?
>>>>
>>>> I'm a little perplexed because MPI tag 4 shouldn't be used in your
>>>> application - the message type (Iget) is only really used for
>>>> gemtc/coasters applications.  It would be helpful to debug further if I
>>>> could get a log from the run with ADLB debugging enabled at compile time
>>>> (--enable-log-debug for the ADLB configure stage, or setting EXM_DEBUG=1 in
>>>> exm-settings.sh depending on how you built it).
>>>>
>>>>  - Tim
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20140530/7dc21736/attachment-0001.html>


More information about the ExM-user mailing list