[ExM Users] mkstatic questions

Tim Armstrong tim.g.armstrong at gmail.com
Fri May 30 15:18:18 CDT 2014


Hmm ok.  Something is seriously messed up with that build.  It doesn't
appear to be doing anything complicated.  I just noticed in the log that it
allocated data ID <1074790400>, which it shouldn't do, since they start
low.  Seems like it's probably reading uninitialized memory, which might be
related to the previous problem.

Justin, have you had any problems with that build of lb/turbine?

The source of the error appears to be in handle_multicreate or one of it's
callees, but it's not really clear what would silently return an error in
there.

- Tim


On Fri, May 30, 2014 at 2:56 PM, Ketan Maheshwari <ketan at mcs.anl.gov> wrote:

> Not sure about the version. Justin built it, it is here on
> Cetus/Mira: /home/wozniak/Public/sfw/ppc64/lb/lib
>
>
> On Fri, May 30, 2014 at 2:53 PM, Tim Armstrong <tim.g.armstrong at gmail.com>
> wrote:
>
>>  What version of ADLB are you using, so I can correlate the source line
>> to the code?
>>
>>  - Tim
>>
>>
>> On Fri, May 30, 2014 at 2:38 PM, Ketan Maheshwari <ketan at mcs.anl.gov>
>> wrote:
>>
>>> Discussed this with Justin and rebuilt Turbine, adlb, et al. with
>>> bgxlcxx. Now getting the following adlb related error:
>>>
>>>  $ cat 272932.output
>>>    0.001 MODE: ENGINE
>>>    0.001 MODE: SERVER
>>>    0.001 MODE: WORKER
>>>    0.001 MODE: WORKER
>>>    0.002 ENGINES: 1 RANKS: 0 - 0
>>>    0.003 WORKERS: 2 RANKS: 1 - 2
>>>    0.003 SERVERS: 1 RANKS: 3 - 3
>>>    0.008 function:swift:constants
>>>    0.009 allocated string: c:s_500=<1>
>>>    0.010 store: <1>="500"
>>>    0.011 allocated string: c:s_database=<2>
>>>    0.011 store: <2>="-database"
>>>    0.012 allocated string: c:s_ex1=<3>
>>>    0.012 store: <3>="-ex1"
>>>    0.013 allocated string: c:s_ex2aro=<4>
>>>    0.013 store: <4>="-ex2aro"
>>>    0.014 allocated string: c:s_hlac97DFAM=<5>
>>>    0.014 store: <5>="hlac-97-D-FAMPNAQTA_complex_0001_swift.sc"
>>>    0.015 allocated string: c:s_homevsachd=<6>
>>>    0.015 store:
>>> <6>="/home/vsachde/ROSETTA/new-benchmark/minirosetta_database/"
>>>    0.016 allocated string: c:s_nstruct=<7>
>>>    0.016 store: <7>="-nstruct"
>>>    0.017 allocated string: c:s_overwrite=<8>
>>>    0.017 store: <8>="-overwrite"
>>>    0.018 allocated string: c:s_pep_refine=<9>
>>>    0.018 store: <9>="-pep_refine"
>>>    0.019 allocated string: c:s_projectsEx=<10>
>>>    0.019 store:
>>> <10>="/projects/ExM/hlac-97-D/hlac-97-D-FAMPNAQTA_complex_0001.pdb"
>>>    0.020 allocated string: c:s_s=<11>
>>>    0.020 store: <11>="-s"
>>>    0.020 allocated string: c:s_scorefile=<12>
>>>    0.021 store: <12>="-scorefile"
>>>    0.021 allocated string: c:s_use_input_=<13>
>>>    0.022 store: <13>="-use_input_sc"
>>>    0.023 enter function: main
>>> ADLB_DATA_CHECK FAILED: src/handlers.c:827
>>>    0.023 allocated t:0=<14> t:1=<1074790400>
>>> ADLB_CHECK FAILED: src/server.c:xlb_handle_pending():330
>>>    0.024 array_kv_build: <1074790400> 13 elems, write_decr 1
>>> ADLB_CHECK FAILED: src/server.c:serve_several():261
>>> ADLB_CHECK FAILED: src/server.c:ADLB_Server():218
>>> CAUGHT ERROR:
>>>
>>>  error: adlb::server: SERVER FAILED
>>>
>>>
>>>      invoked from within
>>> "adlb::server "
>>>     (procedure "enter_mode_unchecked" line 5)
>>>     invoked from within
>>> "enter_mode_unchecked $rules $engine_startup"
>>>     (procedure "enter_mode" line 10)
>>>     invoked from within
>>> "enter_mode $rules $engine_startup "
>>> CALLING adlb::abort
>>> ADLB_Abort(1)
>>> MPI_Abort(1)
>>>
>>>
>>>  Any ideas?
>>>
>>>
>>>  On Fri, May 30, 2014 at 1:18 PM, Maheshwari, Ketan C. <
>>> ketan at mcs.anl.gov> wrote:
>>>
>>>>   Thanks! to narrow down and eliminate application-adlb MPI issues I
>>>> am now trying to rebuild the application with bgxlc++ instead of bgmpixlcxx
>>>> which it was built originally. Will keep you posted on how things work out.
>>>>
>>>>
>>>>  On Fri, May 30, 2014 at 12:49 PM, Tim Armstrong <
>>>> tim.g.armstrong at gmail.com> wrote:
>>>>
>>>>>   I see.  Based on the two lines of output you sent me, the problem
>>>>> is something to do with message sizes on MPI.  Assuming your app isn't
>>>>> using MPI internally, it's probably some communication that ADLB is doing.
>>>>> The error message would generally be caused by a mismatch of message size
>>>>> between sender and receiver.  The most likely explanation in the ADLB
>>>>> codebase is that the sender and receiver somehow disagree on sizes of
>>>>> structs, which doesn't make a whole lot of sense unless something strange
>>>>> was done during the build process, e.g. one file was compiled with
>>>>> different compiler settings, or you somehow linked to different versions of
>>>>> the function.
>>>>> .
>>>>> It's possible that it's a bug in the ADLB codebase that's nothing to
>>>>> do with how it was built, but it seems unlikely that something like that
>>>>> would have escaped all the tests.  It might help to look at the Tcl code or
>>>>> Swift that's being run, as well as to make sure that it runs correctly on a
>>>>> different environment.
>>>>>
>>>>> It would also be helpful to have a full log of the program output with
>>>>> debug logging enabled, since that will tell me what ADLB was doing at the
>>>>> time.
>>>>>
>>>>> I'm not sure if I can help with debugging the problem without more
>>>>> info.
>>>>>
>>>>>  - Tim
>>>>>
>>>>>
>>>>> On Fri, May 30, 2014 at 11:33 AM, Ketan Maheshwari <ketan at mcs.anl.gov>
>>>>> wrote:
>>>>>
>>>>>> I rebuilt the application recently without MPI. It seems to be
>>>>>> working outside of Swift on Cetus compute nodes.
>>>>>>
>>>>>>
>>>>>> On Fri, May 30, 2014 at 11:18 AM, Tim Armstrong <
>>>>>> tim.g.armstrong at gmail.com> wrote:
>>>>>>
>>>>>>>  Regarding the MPI error - that seems strange.  There are multiple
>>>>>>> places in the code that it might be.
>>>>>>>
>>>>>>>  One possible cause is if something funny happened in
>>>>>>> compiling/linking - e.g. multiple compilers or versions of things linked
>>>>>>> together.  Have you tried running the code locally?
>>>>>>>
>>>>>>> I'm a little perplexed because MPI tag 4 shouldn't be used in your
>>>>>>> application - the message type (Iget) is only really used for
>>>>>>> gemtc/coasters applications.  It would be helpful to debug further if I
>>>>>>> could get a log from the run with ADLB debugging enabled at compile time
>>>>>>> (--enable-log-debug for the ADLB configure stage, or setting EXM_DEBUG=1 in
>>>>>>> exm-settings.sh depending on how you built it).
>>>>>>>
>>>>>>>  - Tim
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20140530/5cb72ca5/attachment.html>


More information about the ExM-user mailing list