DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks
DAMOV is a benchmark suite and a methodical framework targeting the study of data movement bottlenecks in modern applications. It is intended to study new architectures, such as near-data processing.
The DAMOV benchmark suite is the first open-source benchmark suite for main memory data movement-related studies, based on our systematic characterization methodology. This suite consists of 144 functions representing different sources of data movement bottlenecks and can be used as a baseline benchmark set for future data-movement mitigation research. The applications in the DAMOV benchmark suite belong to popular benchmark suites, including BWA, Chai, Darknet, GASE, Hardware Effects, Hashjoin, HPCC, HPCG, Ligra, PARSEC, Parboil, PolyBench, Phoenix, Rodinia, SPLASH-2, STREAM.
The DAMOV framework is based on two widely-known simulators: ZSim and Ramulator. We consider a computing system that includes host CPU cores and PIM cores. The PIM cores are placed in the logic layer of a 3D-stacked memory (Ramulator's HMC model). With this simulation framework, we can simulate host CPU cores and general-purpose PIM cores to compare both for an application or parts of it.
Citation
Please cite the following preliminary version of our paper if you find this repository useful:
Geraldo F. Oliveira, Juan Gómez-Luna, Lois Orosa, Saugata Ghose, Nandita Vijaykumar, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu, "DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks". arXiv:2105.03725 [cs.AR], 2021.
Bibtex entry for citation:
@misc{deoliveira2021damov,
title={{Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture}},
author={Geraldo F. Oliveira and Juan Gómez-Luna and Lois Orosa and Saugata Ghose and Nandita Vijaykumar and Ivan Fernandez and Mohammad Sadrosadati and Onur Mutlu},
year={2021},
eprint={2105.03725},
archivePrefix={arXiv},
primaryClass={cs.AR}
}
Setting up DAMOV
Repository Structure and Installation
We point out next to the repository structure and some important folders and files.
.
+-- README.md
+-- get_workloads.sh
+-- simulator/
| +-- command_files/
| +-- ramulator/
| +-- ramulator-configs/
| +-- scripts/
| +-- src/
| +-- templates/
Step 0: Prerequisites
Our framework requires both ZSim and Ramulator dependencies.
- Ramulator requires a C++11 compiler (e.g., clang++, g++-5).
- ZSim requires gcc >=4.6, pin, scons, libconfig, libhdf5, libelfg0: We provide two scripts
setup.sh
andcompile.sh
undersimulator/scripts
to facilitate ZSim's installation. The first one installs all ZSim's dependencies. The second one compiles ZSim. - We use lrztar to compress files.
Step 1: Installing the Simulator
To install the simulator:
cd simulator
sudo sh ./scripts/setup.sh
sh ./scripts/compile.sh
cd ../
Step 2: Downloading the Workloads
To download the workloads:
sh get_workloads.sh
The get_workloads.sh
script will download all workloads. The script stores the workloads under the ./workloads
folder.
Please, note that the workload folder requires around 6 GB of storage.
The ./workloads
folder has the following structure:
.
+-- workloads/
| +-- Darknet/
| +-- GASE-master/
| +-- PolyBench-ACC/
| +-- STREAM/
| +-- bwa/
| +-- chai-cpu/
| +-- hardware-effects/
| +-- hpcc/
| +-- hpcg/
| +-- ligra/
| +-- multicore-hashjoins-0.1/
| +-- parboil/
| +-- parsec-3.0/
| +-- phoenix/
| +-- rodinia_3.1/
The DAMOV Benchmark Suite
The DAMOV benchmark suite constitutes a set of 144 functions that span across 74 different applications, belonging to 16 different widely-used benchmark suites or frameworks.
Each application is instrumented to delimiter one or more functions of interest (i.e., memory-bound functions). We provide a set of scripts that set up each application in the benchmark suite.
Application's Dependencies
Please, check each workload's README file for more information regarding its dependencies.
Application’s Compilation
To aid the compilation of the applications, we provide helping scripts inside each's application folder. The scripts are called compile.py
. The script (1) compiles the applications, (2) decompresses the dataset of each application, and (3) sets their expected file names as defined in the simulator's command files (please, see below).
To illustrate, to compile the STREAM applications:
cd worloads/STREAM/
python compile.py
cd ../../
DAMOV-SIM: The DAMOV Simulation Framework
We build a framework that integrates the ZSim CPU simulator with the Ramulator memory simulator to produce a fast, scalable, and cycle-accurate open-source simulator called DAMOV-SIM. We use ZSim to simulate the core microarchitecture, cache hierarchy, coherence protocol, and prefetchers. We use Ramulator to simulate the DRAM architecture, memory controllers, and memory accesses. To compute spatial and temporal locality, we modify ZSim to generate a single-thread memory trace for each application, which we use as input for the locality analysis algorithm.
(1) Simulator Configuration
Host and PIM Core Format
ZSim can simulate three types of PIM Cores:
OOO
: An out-of-order core.Timing
: A simple 1-issue in-order-like core.Accelerator
: A dataflow accelerator model. The model is designed by issuing at every clock cycle all independent arithmetic instructions in the dataflow graph of a given basic block.
ZSim Configuration Files
The user can configure the core model, number of cores, and cache hierarchy structure by creating configuration files. The configuration file will be used as input to ZSim when launching a new simulation.
We provide sample template files under simulator/templates
for different Host and PIM systems. These template files are:
template_host_nuca_1_core.cfg
: Defines a host system with a single OOO core, private L1/L2 caches, and shared NUCA L3 cache.template_host_nuca.cfg
: Defines a host system with multiple OOO cores, private L1/L2 caches, and shared NUCA L3 cache.template_host_nuca_1_core_inorder.cfg
: Defines a host system with a single Timing core, private L1/L2 caches, and shared NUCA L3 cache.template_host_nuca_inorder.cfg
: Defines a host system with multiple Timing cores, private L1/L2 caches, and shared NUCA L3 cache.template_host_accelerator.cfg
: Defines a host system with multiple Accelerator cores, private L1/L2 caches, and shared L3 cache of fixed size.template_host_inorder.cfg
: Defines a host system with multiple Timing cores, private L1/L2 caches, and shared L3 cache of fixed size.template_host_ooo.cfg
: Defines a host system with multiple OOO cores, private L1/L2 caches, and shared L3 cache of fixed size.template_host_prefetch_accelerator.cfg
: Defines a host system with multiple Accelerator cores, private L1/L2 caches, L2 prefetcher, and shared L3 cache of fixed size.template_host_prefetch_inorder.cfg
: Defines a host system with multiple Timing cores, private L1/L2 caches, L2 prefetcher, and shared L3 cache of fixed size.template_host_prefetch_ooo.cfg
: Defines a host system with multiple OOO cores, private L1/L2 caches, L2 prefetcher, and shared L3 cache of fixed size.template_pim_accelerator.cfg
: Defines a PIM system with multiple Accelerator cores and private L1 caches.template_pim_inorder.cfg
: Defines a PIM system with multiple Timing cores and private L1 caches.template_pim_ooo.cfg
: Defines a PIM system with multiple OOO cores and private L1 caches.
Generating ZSim Configuration Files
The script under simulator/scripts/generate_config_files.py
can automatically generate configuration files for a given command file. Command files are used to specify the path to the application binary of interest and its input commands. A list of command files for the workloads under workloads/
can be found at simulator/command_files
. To automatically generate configuration files for a given benchmark (STREAM in the example below), one can execute the following command:
python scripts/generate_config_files.py command_files/stream_cf
The script uses the template files available under simulator/templates/
to generate the appropriate configuration files. The user needs to modify the script to point to the path of the workloads folder (i.e., PIM_ROOT flag) and the path of the simulator folder (i.e., ROOT flag). You can modify the script also to generate configuration files for different core models by changing the core type when calling the create_*_configs()
function.
The script stores the generated configuration files under simulator/config_files
.
(2) Running an Application from DAMOV
We illustrate how to run an application from our benchmark suite using the STREAM Add
application as an example. To execution a host simulation of the STREAM Add
application, running in a system with four OOO cores:
./build/opt/zsim config_files/host_ooo/no_prefetch/stream/4/Add_Add.cfg
The output of the simulation will be stored under zsim_stats/pim_ooo/4/stream_Add_Add.*
.
To execution a PIM simulation of the STREAM Add application, running in a system with four OOO cores:
./build/opt/zsim config_files/pim_ooo/stream/4/Add_Add.cfg
The output of the simulation will be stored under zsim_stats/pim_ooo/4/stream_Add_Add.*
.
The script under simulator/scripts/generate_config_files.py
can parse some useful statistics from a simulation.
For example, the user can collect the IPC of the execution of the host simulation of the STREAM Add application, running in a system with four OOO cores by executing:
python scripts/get_stats_per_app.py zsim_stats/host_ooo/no_prefetch/4/stream_Add_Add.zsim.out
Output:
------------------ Summary ------------------------
Instructions: 1000002055
Cycles: 450355583
IPC: 2.22047220629
L3 Miss Rate (%): 99.9991935477
L2 Miss Rate (%): 100.0
L1 Miss Rate (%): 73.563163442
L3 MPKI: 23.4357438395
LFMR: 0.999992703522
Likewise, the user can collect the IPC of the execution of the PIM simulation of the STREAM Add application, running in a system with four OOO cores by executing:
python scripts/get_stats_per_app.py zsim_stats/pim_ooo/4/stream_Add_Add.zsim.out
Output:
------------------ Summary ------------------------
Instructions: 1000009100
Cycles: 284225084
IPC: 3.5183703209
L3 Miss Rate (%): 0.0
L2 Miss Rate (%): 0.0
L1 Miss Rate (%): 73.563253602
L3 MPKI: 0.0
LFMR: 0.0
In this way, the speedup the PIM system provides compared to the host system for this particular application is of 3.5183703209/ 2.22047220629 = 1.58451446
.
Please, note that the simulation framework does not currently support concurrent execution on host and PIM cores.
(3) Instrumenting and Simulating New Applications
There are three steps to run a simulation with ZSim:
- Instrument the code with the hooks provided in
workloads/zsim_hooks.h
. - Create configuration files for ZSim.
- Run.
Next, we describe the three steps in detail:
- First, we identify the application's hotspot. We refer to it as the
offload
region, i.e., the region of code that will run in the PIM cores. We instrument the application by including the following code:
#include "zsim_hooks.h"
foo(){
/*
* zsim_roi_begin() marks the beginning of the region of interest (ROI).
* It must be included in a serial part of the code.
*/
zsim_roi_begin();
zsim_PIM_function_begin(); // Indicates the beginning of the code to simulate (hotspot).
...
zsim_PIM_function_end(); // Indicates the end of the code to simulate.
/*
* zsim_roi_end() marks the end of the ROI.
* It must be included in a serial part of the code.
*/
zsim_roi_end();
}
- Second, we create the configuration files to execute the application using ZSim. Sample configuration files are provided under
simulator/config_files/
. Please, check those files to understand how to configure the number of cores, number of caches and their sizes, and number prefetchers. Next, we describe other important knobs that can be changed in the configuration files:
pimMode=true|false
: When set totrue
, ZSim will simulate a memory model with shorter memory latency and higher memory bandwidth. When set tofalse
, it will simulate a regular memory device.max_offload_instrs
: Maximum number of offload instructions to execute.
- Third, we run ZSim:
./build/opt/zsim configuration_file.cfg
Getting Help
If you have any suggestions for improvement, please contact geraldo dot deoliveira at safari dot ethz dot ch. If you find any bugs or have further questions or requests, please post an issue at the issue page.
Acknowledgments
We acknowledge support from the SAFARI Research Group’s industrial partners, especially ASML, Facebook, Google, Huawei, Intel, Microsoft, VMware, and the Semiconductor Research Corporation.