[codes-ross-users] Straggler event?

Jian Peng jpeng10 at hawk.iit.edu
Thu Aug 9 17:39:45 CDT 2018


Tried the newer version, the error still exists. The MPI I'm using is
[MPICH2 3.2] The command I'm using is:

mpirun -f ./hosts -n 33 ./bb_dragonfly_sim --extramem=100000 --nkp=128
--sync=3 --batch=1 --gvt-interval=32  --
/home/cc/share/sim_configs/darshan_config.conf

Also, another long-existing error sometimes pops up, which I think might be
related to the gvt:

node: 27: error:
/home/cc/Project/NERSC/ROSS-master/core/gvt/mpi_allreduce.c:180: PE 27 GVT
decreased 1.42151 -> 1.36322
node: 5: error:
/home/cc/Project/NERSC/ROSS-master/core/gvt/mpi_allreduce.c:180: PE 5 GVT
decreased 1.42151 -> 1.36322
node: 29: error:
/home/cc/Project/NERSC/ROSS-master/core/gvt/mpi_allreduce.c:180: PE 29 GVT
decreased 1.42151 -> 1.36322
node: 28: error:
/home/cc/Project/NERSC/ROSS-master/core/gvt/mpi_allreduce.c:180: PE 28 GVT
decreased 1.42151 -> 1.36322
node: 13: error:
/home/cc/Project/NERSC/ROSS-master/core/gvt/mpi_allreduce.c:180: PE 13 GVT
decreased 1.42151 -> 1.36322
.......

My current walk-around solution to the later issue is changing the "nkp"
parameter.

On Thu, Aug 9, 2018 at 12:49 PM, Caitlin Ross <rossc3 at rpi.edu> wrote:

> It’s saying that a PE received an event that has a time stamp less than
> the current GVT, which shouldn’t be possible.
>
> But your line number for the error in network-mpi.c is off from what it is
> currently in the master branch of ROSS. There’s been some changes in the
> MPI layer of ROSS relatively recently (in May/June), so my first
> recommendation is to update your version of ROSS and see if you still get
> the error. If you do still get the error, could you also send some more
> details on the simulation run that causes this error?
>
> Thanks,
> Caitlin
>
> On Aug 9, 2018, at 2:20 PM, Jian Peng <jpeng10 at hawk.iit.edu> wrote:
>
> Hi All,
>
> Just ran into an issue with error:
>
>  "error: network-mpi.c:388: 1:Received straggler from 7: 2938789.459012
> 3193751.109728 (0)" . Which is cause by failure of
> if(e->recv_ts < me->GVT)
> in
> recv_finish(tw_pe *me, tw_event *e, char * buffer)
>
> Any suggestion of fixing this issue? Thanks!
>
>
> _______________________________________________
> codes-ross-users mailing list
> codes-ross-users at lists.mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/codes-ross-users
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/codes-ross-users/attachments/20180809/b3c94f18/attachment.html>


More information about the codes-ross-users mailing list