[ExM Users] mkstatic questions
Ketan Maheshwari
ketan at mcs.anl.gov
Fri May 30 14:38:52 CDT 2014
Discussed this with Justin and rebuilt Turbine, adlb, et al. with bgxlcxx.
Now getting the following adlb related error:
$ cat 272932.output
0.001 MODE: ENGINE
0.001 MODE: SERVER
0.001 MODE: WORKER
0.001 MODE: WORKER
0.002 ENGINES: 1 RANKS: 0 - 0
0.003 WORKERS: 2 RANKS: 1 - 2
0.003 SERVERS: 1 RANKS: 3 - 3
0.008 function:swift:constants
0.009 allocated string: c:s_500=<1>
0.010 store: <1>="500"
0.011 allocated string: c:s_database=<2>
0.011 store: <2>="-database"
0.012 allocated string: c:s_ex1=<3>
0.012 store: <3>="-ex1"
0.013 allocated string: c:s_ex2aro=<4>
0.013 store: <4>="-ex2aro"
0.014 allocated string: c:s_hlac97DFAM=<5>
0.014 store: <5>="hlac-97-D-FAMPNAQTA_complex_0001_swift.sc"
0.015 allocated string: c:s_homevsachd=<6>
0.015 store:
<6>="/home/vsachde/ROSETTA/new-benchmark/minirosetta_database/"
0.016 allocated string: c:s_nstruct=<7>
0.016 store: <7>="-nstruct"
0.017 allocated string: c:s_overwrite=<8>
0.017 store: <8>="-overwrite"
0.018 allocated string: c:s_pep_refine=<9>
0.018 store: <9>="-pep_refine"
0.019 allocated string: c:s_projectsEx=<10>
0.019 store:
<10>="/projects/ExM/hlac-97-D/hlac-97-D-FAMPNAQTA_complex_0001.pdb"
0.020 allocated string: c:s_s=<11>
0.020 store: <11>="-s"
0.020 allocated string: c:s_scorefile=<12>
0.021 store: <12>="-scorefile"
0.021 allocated string: c:s_use_input_=<13>
0.022 store: <13>="-use_input_sc"
0.023 enter function: main
ADLB_DATA_CHECK FAILED: src/handlers.c:827
0.023 allocated t:0=<14> t:1=<1074790400>
ADLB_CHECK FAILED: src/server.c:xlb_handle_pending():330
0.024 array_kv_build: <1074790400> 13 elems, write_decr 1
ADLB_CHECK FAILED: src/server.c:serve_several():261
ADLB_CHECK FAILED: src/server.c:ADLB_Server():218
CAUGHT ERROR:
error: adlb::server: SERVER FAILED
invoked from within
"adlb::server "
(procedure "enter_mode_unchecked" line 5)
invoked from within
"enter_mode_unchecked $rules $engine_startup"
(procedure "enter_mode" line 10)
invoked from within
"enter_mode $rules $engine_startup "
CALLING adlb::abort
ADLB_Abort(1)
MPI_Abort(1)
Any ideas?
On Fri, May 30, 2014 at 1:18 PM, Maheshwari, Ketan C. <ketan at mcs.anl.gov>
wrote:
> Thanks! to narrow down and eliminate application-adlb MPI issues I am
> now trying to rebuild the application with bgxlc++ instead of bgmpixlcxx
> which it was built originally. Will keep you posted on how things work out.
>
>
> On Fri, May 30, 2014 at 12:49 PM, Tim Armstrong <tim.g.armstrong at gmail.com
> > wrote:
>
>> I see. Based on the two lines of output you sent me, the problem is
>> something to do with message sizes on MPI. Assuming your app isn't using
>> MPI internally, it's probably some communication that ADLB is doing. The
>> error message would generally be caused by a mismatch of message size
>> between sender and receiver. The most likely explanation in the ADLB
>> codebase is that the sender and receiver somehow disagree on sizes of
>> structs, which doesn't make a whole lot of sense unless something strange
>> was done during the build process, e.g. one file was compiled with
>> different compiler settings, or you somehow linked to different versions of
>> the function.
>> .
>> It's possible that it's a bug in the ADLB codebase that's nothing to do
>> with how it was built, but it seems unlikely that something like that would
>> have escaped all the tests. It might help to look at the Tcl code or Swift
>> that's being run, as well as to make sure that it runs correctly on a
>> different environment.
>>
>> It would also be helpful to have a full log of the program output with
>> debug logging enabled, since that will tell me what ADLB was doing at the
>> time.
>>
>> I'm not sure if I can help with debugging the problem without more info.
>>
>> - Tim
>>
>>
>> On Fri, May 30, 2014 at 11:33 AM, Ketan Maheshwari <ketan at mcs.anl.gov>
>> wrote:
>>
>>> I rebuilt the application recently without MPI. It seems to be working
>>> outside of Swift on Cetus compute nodes.
>>>
>>>
>>> On Fri, May 30, 2014 at 11:18 AM, Tim Armstrong <
>>> tim.g.armstrong at gmail.com> wrote:
>>>
>>>> Regarding the MPI error - that seems strange. There are multiple
>>>> places in the code that it might be.
>>>>
>>>> One possible cause is if something funny happened in compiling/linking
>>>> - e.g. multiple compilers or versions of things linked together. Have you
>>>> tried running the code locally?
>>>>
>>>> I'm a little perplexed because MPI tag 4 shouldn't be used in your
>>>> application - the message type (Iget) is only really used for
>>>> gemtc/coasters applications. It would be helpful to debug further if I
>>>> could get a log from the run with ADLB debugging enabled at compile time
>>>> (--enable-log-debug for the ADLB configure stage, or setting EXM_DEBUG=1 in
>>>> exm-settings.sh depending on how you built it).
>>>>
>>>> - Tim
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/exm-user/attachments/20140530/7dc21736/attachment-0001.html>
More information about the ExM-user
mailing list