[codes-ross-users] Replay HPL's dumpi trace on CODES

Maxime Chevalier maxime.chevalier at inria.fr
Thu Jun 8 02:57:46 CDT 2017


Hi Misbah! 
Thanks again for your time. What you have observed is weird. Can it come from my DUMPI installation? I used MPICH3 and in the doc you warn about it. As I generate DUMPI traces too, I had compiled DUMPI as follows : 

../configure --enable-libdumpi --enable-test --prefix=/home/chevamax/logiciels CC=mpicc CXX=mpiCC CFLAGS="-DMPICH_SUPPRESS_PROTOTYPES=1 -DHAVE_PRAGMA_HP_SEC_DEF=1 -pthread -I/home/chevamax/logiciels/include/open-trace-format -L/home/chevamax/logiciels/lib" 

I try to use dumpi2otf (I know it's not finished) to visualize dumpi trace (as I found nothing graphical), so that's why I link libotf. Perhaps the CFLAGS's options break something. But I doubt that's come from here as I generate AMG traces and replay them without errors. 

When I execute HPL, it run smoothly with a correct terminaison (end tests pass and no warning/error messages), so I don't think it comes from an early terminaison of HPL. 

Maxime 

----- Mail original -----

> De: "Misbah Mubarak" <mmubarak at anl.gov>
> À: "maxime chevalier" <maxime.chevalier at inria.fr>
> Cc: codes-ross-users at lists.mcs.anl.gov
> Envoyé: Mercredi 7 Juin 2017 22:42:14
> Objet: Re: [codes-ross-users] Replay HPL's dumpi trace on CODES

> Hi Maxime,

> I ran the HPL traces with no MPI data type on the simulation and here are
> some observations. I disabled any synchronizations (wait, wait-alls) in the
> simulation so that it only matches the MPI sends with the receives and does
> nothing else.

> - Rank 0 expects 192 messages from Rank 1 but it instead receives 192
> messages from Rank 2.
> - Rank 1 receives 192 messages from rank 0 but there are no corresponding
> receives posted so the messages remain unmatched. -
> - Rank 2 is expecting 192 messages from Rank 0 but they don’t arrive
> (probably because they arrived at Rank 1).

> Is it possible that having no MPI data type resulted in missing messages that
> introduced these discrepancies? Or maybe the application is terminating
> earlier than usual?

> I will try the version with MPI data types and let you know if the results
> are different.

> Thanks,
> Misbah
> From: < codes-ross-users-bounces at lists.mcs.anl.gov > on behalf of Maxime
> Chevalier < maxime.chevalier at inria.fr >
> Date: Sunday, June 4, 2017 at 12:20 PM
> To: " codes-ross-users at lists.mcs.anl.gov " <
> codes-ross-users at lists.mcs.anl.gov >
> Subject: Re: [codes-ross-users] Replay HPL's dumpi trace on CODES

> Hi Misbah,
> Thanks for your help, you can find dumpi traces with "UNDEFINED DATA TYPE"
> and without via the link below. Codes-workload-dump utility is very usefull,
> thanks for that (I was using dumpistat).

> https://1drv.ms/f/s!Ati25f8zqy9lnNFi7EX8u1tmdJ4rfw

> Regards,
> Maxime
> ----- Mail original -----

> > De: "Misbah Mubarak" < mmubarak at anl.gov >
> 
> > À: "Maxime Chevalier" < maxime.chevalier at inria.fr >,
> > codes-ross-users at lists.mcs.anl.gov
> 
> > Envoyé: Vendredi 2 Juin 2017 18:54:13
> 
> > Objet: Re: [codes-ross-users] Replay HPL's dumpi trace on CODES
> 

> > Hi Maxime,
> 

> > There is a codes-workload-dump utility that helps you inspect the traces
> > and
> > provides detailed information on the individual MPI operations such as
> > number of bytes transmitted (which is derived by the data type and count).
> > If you could run the utility with one of the traces and send me the output,
> > I can have a look at whats going on. Alternatively, if you could share the
> > traces, I can have a look at those.
> 

> > Using the utility is simple, here is some documentation on how to run it:
> 

> > https://xgitlab.cels.anl.gov/codes/codes/wikis/codes-dumpi-workload
> 

> > Thanks,
> 
> > Misbah
> 
> > From: < codes-ross-users-bounces at lists.mcs.anl.gov > on behalf of Maxime
> > Chevalier < maxime.chevalier at inria.fr >
> 
> > Date: Friday, June 2, 2017 at 8:52 AM
> 
> > To: " codes-ross-users at lists.mcs.anl.gov " <
> > codes-ross-users at lists.mcs.anl.gov >
> 
> > Subject: Re: [codes-ross-users] Replay HPL's dumpi trace on CODES
> 

> > Hi Misbah,
> 
> > Thanks for your fast response. I was looking for the data type, but I don't
> > really understand. I have figured out how to avoid "UNDEFINED DATA TYPE"
> > errors by compiling HPL whit "HPL_NO_MPI_DATATYPE", but the output is quite
> > the same (see trace below). I don't know if it's a step forward or
> > backward...
> 

> > Regards,
> 
> > Maxime
> 

> > Trace :
> 

> > Fri Jun 2 09:15:49 2017
> 

> > ROSS Revision: 4c6a7d8eb9c784797d900edfc76725d62ec25941
> 

> > tw_net_start: Found world size to be 1
> 

> > ROSS Core Configuration:
> 
> > Total Nodes 1
> 
> > Total Processors [Nodes (1) x PE_per_Node (1)] 1
> 
> > Total KPs [Nodes (1) x KPs (16)] 16
> 
> > Total LPs 54
> 
> > Simulation End Time 300000000000.00
> 
> > LP-to-PE Mapping model defined
> 

> > ROSS Event Memory Allocation:
> 
> > Model events 13825
> 
> > Network events 50000
> 
> > Total events 63824
> 

> > *** START SEQUENTIAL SIMULATION ***
> 

> > *** END SIMULATION ***
> 

> > LP 1 unmatched irecvs 1 unmatched sends 0 Total sends 0 receives 2
> > collectives 0 delays 8 wait alls 0 waits 0 send time 0.000000 wait 0.000000
> 
> > LP 3 unmatched irecvs 1 unmatched sends 0 Total sends 1 receives 1
> > collectives 0 delays 10 wait alls 0 waits 0 send time 3.202149 wait
> > 0.000000
> 
> > LP 5 unmatched irecvs 0 unmatched sends 0 Total sends 0 receives 1
> > collectives 0 delays 7 wait alls 0 waits 0 send time 0.000000 wait 0.000000
> 
> > LP 7 unmatched irecvs 1 unmatched sends 0 Total sends 1 receives 1
> > collectives 0 delays 10 wait alls 0 waits 0 send time 3.189207 wait
> > 0.000000
> 
> > : Running Time = 0.0001 seconds
> 

> > TW Library Statistics:
> 
> > Total Events Processed 56
> 
> > Events Aborted (part of RBs) 0
> 
> > Events Rolled Back 0
> 
> > Event Ties Detected in PE Queues 0
> 
> > Efficiency 100.00 %
> 
> > Total Remote (shared mem) Events Processed 0
> 
> > Percent Remote Events 0.00 %
> 
> > Total Remote (network) Events Processed 0
> 
> > Percent Remote Events 0.00 %
> 

> > Total Roll Backs 0
> 
> > Primary Roll Backs 0
> 
> > Secondary Roll Backs 0
> 
> > Fossil Collect Attempts 0
> 
> > Total GVT Computations 0
> 

> > Net Events Processed 56
> 
> > Event Rate (events/sec) 823529.4
> 
> > Total Events Scheduled Past End Time 0
> 

> > TW Memory Statistics:
> 
> > Events Allocated 63825
> 
> > Memory Allocated 62573
> 
> > Memory Wasted 683
> 

> > TW Data Structure sizes in bytes (sizeof):
> 
> > PE struct 608
> 
> > KP struct 144
> 
> > LP struct 128
> 
> > LP Model struct 760
> 
> > LP RNGs 80
> 
> > Total LP 968
> 
> > Event struct 144
> 
> > Event struct with Model 928
> 

> > TW Clock Cycle Statistics (MAX values in secs at 1.0000 GHz):
> 
> > Priority Queue (enq/deq) 0.0000
> 
> > AVL Tree (insert/delete) 0.0000
> 
> > LZ4 (de)compression 0.0000
> 
> > Buddy system 0.0000
> 
> > Event Processing 0.0000
> 
> > Event Cancel 0.0000
> 
> > Event Abort 0.0000
> 

> > GVT 0.0000
> 
> > Fossil Collect 0.0000
> 
> > Primary Rollbacks 0.0000
> 
> > Network Read 0.0000
> 
> > Statistics Computation 0.0000
> 
> > Statistics Write 0.0000
> 
> > Total Time (Note: Using Running Time above for Speedup) 0.0002
> 

> > TW GVT Statistics: MPI AllReduce
> 
> > GVT Interval 16
> 
> > GVT Real Time Interval (cycles) 0
> 
> > GVT Real Time Interval (sec) 0.00000000
> 
> > Batch Size 16
> 

> > Forced GVT 0
> 
> > Total GVT Computations 0
> 
> > Total All Reduce Calls 0
> 
> > Average Reduction / GVT -nan
> 

> > Total bytes sent 8 recvd 20
> 
> > max runtime 0.000000 ns avg runtime 0.000000
> 
> > max comm time 0.000000 avg comm time -69573.000000
> 
> > max send time 3.202149 avg send time 1.597839
> 
> > max recv time 45682.609151 avg recv time 11420.652288
> 
> > max wait time 0.000000 avg wait time 0.000000
> 
> > ----- Mail original -----
> 

> > > De: "Misbah Mubarak" < mmubarak at anl.gov >
> > 
> 
> > > À: "Maxime Chevalier" < maxime.chevalier at inria.fr >,
> > > codes-ross-users at lists.mcs.anl.gov
> > 
> 
> > > Envoyé: Mardi 30 Mai 2017 18:12:46
> > 
> 
> > > Objet: Re: [codes-ross-users] Replay HPL's dumpi trace on CODES
> > 
> 

> > > Hi Maxime,
> > 
> 

> > > Thanks for your message. There seems to be a data type that is either not
> > > supported by DUMPI or CODES. Are you familiar with what data types are
> > > being
> > > used by the HPL trace? I will find out if the support for them can be
> > > added
> > > in the code.
> > 
> 

> > > Regards,
> > 
> 
> > > Misbah
> > 
> 
> > > From: < codes-ross-users-bounces at lists.mcs.anl.gov > on behalf of Maxime
> > > Chevalier < maxime.chevalier at inria.fr >
> > 
> 
> > > Date: Monday, May 29, 2017 at 3:51 AM
> > 
> 
> > > To: " codes-ross-users at lists.mcs.anl.gov " <
> > > codes-ross-users at lists.mcs.anl.gov >
> > 
> 
> > > Subject: [codes-ross-users] Replay HPL's dumpi trace on CODES
> > 
> 

> > > Hi,
> > 
> 
> > > I'm trying to replay HPL's DUMPI trace generated on my computer with
> > > CODES.
> > > Unfortunately, I get a lot of "Undefined data type" errors (see the trace
> > > below).
> > 
> 
> > > I have already replayed AMG traces (downloaded here ) and replayed my own
> > > generated AMG traces. It has worked fine.
> > 
> 
> > > So I'm wondering if I did something wrong, or if it's HPL fault.
> > 
> 

> > > Best regards,
> > 
> 
> > > Maxime
> > 
> 

> > > Trace :
> > 
> 

> > > > ROSS Revision: 4c6a7d8eb9c784797d900edfc76725d62ec25941
> > > 
> > 
> 

> > > > tw_net_start: Found world size to be 1
> > > 
> > 
> 

> > > > ROSS Core Configuration:
> > > 
> > 
> 
> > > > Total Nodes 1
> > > 
> > 
> 
> > > > Total Processors [Nodes (1) x PE_per_Node (1)] 1
> > > 
> > 
> 
> > > > Total KPs [Nodes (1) x KPs (16)] 16
> > > 
> > 
> 
> > > > Total LPs 5
> > > 
> > 
> 
> > > > Simulation End Time 300000000000.00
> > > 
> > 
> 
> > > > LP-to-PE Mapping model defined
> > > 
> > 
> 

> > > > ROSS Event Memory Allocation:
> > > 
> > 
> 
> > > > Model events 1281
> > > 
> > 
> 
> > > > Network events 50000
> > > 
> > 
> 
> > > > Total events 51280
> > > 
> > 
> 

> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type
> > > 
> > 
> 
> > > > Undefined data type *** START SEQUENTIAL SIMULATION ***
> > > 
> > 
> 

> > > > *** END SIMULATION ***
> > > 
> > 
> 

> > > > LP 1 unmatched irecvs 1 unmatched sends 0 Total sends 0 receives 1
> > > > collectives 0 delays 7 wait alls 0 waits 0 send time 0.000000 wait
> > > > 0.000000
> > > 
> > 
> 
> > > > : Running Time = 0.0000 seconds
> > > 
> > 
> 

> > > > TW Library Statistics:
> > > 
> > 
> 
> > > > Total Events Processed 8
> > > 
> > 
> 
> > > > Events Aborted (part of RBs) 0
> > > 
> > 
> 
> > > > Events Rolled Back 0
> > > 
> > 
> 
> > > > Event Ties Detected in PE Queues 0
> > > 
> > 
> 
> > > > Efficiency 100.00 %
> > > 
> > 
> 
> > > > Total Remote (shared mem) Events Processed 0
> > > 
> > 
> 
> > > > Percent Remote Events 0.00 %
> > > 
> > 
> 
> > > > Total Remote (network) Events Processed 0
> > > 
> > 
> 
> > > > Percent Remote Events 0.00 %
> > > 
> > 
> 

> > > > Total Roll Backs 0
> > > 
> > 
> 
> > > > Primary Roll Backs 0
> > > 
> > 
> 
> > > > Secondary Roll Backs 0
> > > 
> > 
> 
> > > > Fossil Collect Attempts 0
> > > 
> > 
> 
> > > > Total GVT Computations 0
> > > 
> > 
> 

> > > > Net Events Processed 8
> > > 
> > 
> 
> > > > Event Rate (events/sec) 307692.3
> > > 
> > 
> 
> > > > Total Events Scheduled Past End Time 0
> > > 
> > 
> 

> > > > TW Memory Statistics:
> > > 
> > 
> 
> > > > Events Allocated 51281
> > > 
> > 
> 
> > > > Memory Allocated 51168
> > > 
> > 
> 
> > > > Memory Wasted 720
> > > 
> > 
> 

> > > > TW Data Structure sizes in bytes (sizeof):
> > > 
> > 
> 
> > > > PE struct 608
> > > 
> > 
> 
> > > > KP struct 144
> > > 
> > 
> 
> > > > LP struct 128
> > > 
> > 
> 
> > > > LP Model struct 760
> > > 
> > 
> 
> > > > LP RNGs 80
> > > 
> > 
> 
> > > > Total LP 968
> > > 
> > 
> 
> > > > Event struct 144
> > > 
> > 
> 
> > > > Event struct with Model 928
> > > 
> > 
> 

> > > > TW Clock Cycle Statistics (MAX values in secs at 1.0000 GHz):
> > > 
> > 
> 
> > > > Priority Queue (enq/deq) 0.0000
> > > 
> > 
> 
> > > > AVL Tree (insert/delete) 0.0000
> > > 
> > 
> 
> > > > LZ4 (de)compression 0.0000
> > > 
> > 
> 
> > > > Buddy system 0.0000
> > > 
> > 
> 
> > > > Event Processing 0.0000
> > > 
> > 
> 
> > > > Event Cancel 0.0000
> > > 
> > 
> 
> > > > Event Abort 0.0000
> > > 
> > 
> 

> > > > GVT 0.0000
> > > 
> > 
> 
> > > > Fossil Collect 0.0000
> > > 
> > 
> 
> > > > Primary Rollbacks 0.0000
> > > 
> > 
> 
> > > > Network Read 0.0000
> > > 
> > 
> 
> > > > Statistics Computation 0.0000
> > > 
> > 
> 
> > > > Statistics Write 0.0000
> > > 
> > 
> 
> > > > Total Time (Note: Using Running Time above for Speedup) 0.0001
> > > 
> > 
> 

> > > > TW GVT Statistics: MPI AllReduce
> > > 
> > 
> 
> > > > GVT Interval 16
> > > 
> > 
> 
> > > > GVT Real Time Interval (cycles) 0
> > > 
> > 
> 
> > > > GVT Real Time Interval (sec) 0.00000000
> > > 
> > 
> 
> > > > Batch Size 16
> > > 
> > 
> 

> > > > Forced GVT 0
> > > 
> > 
> 
> > > > Total GVT Computations 0
> > > 
> > 
> 
> > > > Total All Reduce Calls 0
> > > 
> > 
> 
> > > > Average Reduction / GVT -nan
> > > 
> > 
> 

> > > > Total bytes sent 0 recvd 4
> > > 
> > 
> 
> > > > max runtime 0.000000 ns avg runtime 0.000000
> > > 
> > 
> 
> > > > max comm time 0.000000 avg comm time -66232.000000
> > > 
> > 
> 
> > > > max send time 0.000000 avg send time 0.000000
> > > 
> > 
> 
> > > > max recv time 0.000000 avg recv time 0.000000
> > > 
> > 
> 
> > > > max wait time 0.000000 avg wait time 0.000000
> > > 
> > 
> 
> > > > LP-IO: writing output to hpl-trace-25282-1495543803/
> > > 
> > 
> 
> > > > LP-IO: data files:
> > > 
> > 
> 
> > > > hpl-trace-25282-1495543803/mpi-replay-stats
> > > 
> > 
> 
> > > > hpl-trace-25282-1495543803/model-net-category-all
> > > 
> > 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/codes-ross-users/attachments/20170608/93812535/attachment-0001.html>


More information about the codes-ross-users mailing list