[codes-ross-users] Straggler event?

Caitlin Ross rossc3 at rpi.edu
Fri Aug 10 09:22:53 CDT 2018


Can you send me your darshan_config.conf file? Also what is the bb_dragonfly_sim executable? I’m assuming a model you wrote that uses the dragonfly network model.  If so, how can I access that code?

Do you know if your model is deterministic at all (or close to it)? For some smaller run (like maybe decreasing the amount of time you’re simulating and/or use a smaller network), run it with sync=1 and with sync=3.  Does the number of Net Events given at the end of the simulation match?  We know that in CODES there are some issues with the models being non-deterministic in the net events.  If a parallel run of your model is pretty far off from the sequential run on the number of net events, there’s probably a reverse computation issue that is some how causing these errors you get in ROSS.

Caitlin

> On Aug 9, 2018, at 5:39 PM, Jian Peng <jpeng10 at hawk.iit.edu> wrote:
> 
> Tried the newer version, the error still exists. The MPI I'm using is [MPICH2 3.2] The command I'm using is:
> 
> mpirun -f ./hosts -n 33 ./bb_dragonfly_sim --extramem=100000 --nkp=128  --sync=3 --batch=1 --gvt-interval=32  -- /home/cc/share/sim_configs/darshan_config.conf
> 
> Also, another long-existing error sometimes pops up, which I think might be related to the gvt:
> 
> node: 27: error: /home/cc/Project/NERSC/ROSS-master/core/gvt/mpi_allreduce.c:180: PE 27 GVT decreased 1.42151 -> 1.36322
> node: 5: error: /home/cc/Project/NERSC/ROSS-master/core/gvt/mpi_allreduce.c:180: PE 5 GVT decreased 1.42151 -> 1.36322
> node: 29: error: /home/cc/Project/NERSC/ROSS-master/core/gvt/mpi_allreduce.c:180: PE 29 GVT decreased 1.42151 -> 1.36322
> node: 28: error: /home/cc/Project/NERSC/ROSS-master/core/gvt/mpi_allreduce.c:180: PE 28 GVT decreased 1.42151 -> 1.36322
> node: 13: error: /home/cc/Project/NERSC/ROSS-master/core/gvt/mpi_allreduce.c:180: PE 13 GVT decreased 1.42151 -> 1.36322
> .......
> 
> My current walk-around solution to the later issue is changing the "nkp" parameter.
> 
> On Thu, Aug 9, 2018 at 12:49 PM, Caitlin Ross <rossc3 at rpi.edu <mailto:rossc3 at rpi.edu>> wrote:
> It’s saying that a PE received an event that has a time stamp less than the current GVT, which shouldn’t be possible. 
> 
> But your line number for the error in network-mpi.c is off from what it is currently in the master branch of ROSS. There’s been some changes in the MPI layer of ROSS relatively recently (in May/June), so my first recommendation is to update your version of ROSS and see if you still get the error. If you do still get the error, could you also send some more details on the simulation run that causes this error?
> 
> Thanks,
> Caitlin
> 
>> On Aug 9, 2018, at 2:20 PM, Jian Peng <jpeng10 at hawk.iit.edu <mailto:jpeng10 at hawk.iit.edu>> wrote:
>> 
>> Hi All, 
>> 
>> Just ran into an issue with error:
>> 
>>  "error: network-mpi.c:388: 1:Received straggler from 7: 2938789.459012 3193751.109728 (0)" . Which is cause by failure of 
>> if(e->recv_ts < me->GVT)
>> in
>> recv_finish(tw_pe *me, tw_event *e, char * buffer)
>> 
>> Any suggestion of fixing this issue? Thanks!
>> 
>> 
>> _______________________________________________
>> codes-ross-users mailing list
>> codes-ross-users at lists.mcs.anl.gov <mailto:codes-ross-users at lists.mcs.anl.gov>
>> https://lists.mcs.anl.gov/mailman/listinfo/codes-ross-users <https://lists.mcs.anl.gov/mailman/listinfo/codes-ross-users>
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/codes-ross-users/attachments/20180810/9fb0a959/attachment-0001.html>


More information about the codes-ross-users mailing list