LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism

Overview

LineFS repository for the Artifact Evaluation of SOSP 2021

If you are using our testbed for SOSP 2021 Artifact Evaluation, please read README for Artifact Evaluation Reviewers first. It is provided on the HotCRP submission page.
After reading it, you can directly go on to Configuring LineFS.

1. System requirements (Tested environment)

1.1. Hardware requirements

1.1.1. Host machine

  • 16 cores per NUMA node
  • 96 GB DRAM
  • 6 NVDIMM persistent memory per NUMA node

1.1.2. SmartNIC

  • NVIDIA BlueField DPU 25G (Model number: MBF1M332A-ASCAT)
    • 16 ARM cores
    • 16 GB DRAM

1.2. Software requirements

1.2.1. Host machine

  • Ubuntu 18.04
  • Linux kernel version: 5.3.11
  • Mellanox OFED driver version: 4.7-3.2.9
  • Bluefield Software version: 2.5.1

1.2.2. SmartNIC

  • Ubuntu 18.04
  • Linux kernel version: 4.18.0
  • Mellanox OFED driver: 4.7-3.2.9

2. Dependent package installation

sudo apt install build-essential make pkg-config autoconf libnuma-dev libaio1 libaio-dev uuid-dev librdmacm-dev ndctl numactl libncurses-dev libssl-dev libelf-dev rsync

3. Hardware setup

3.1. RoCE configuration

You need to configure RoCE to enable RDMA on Ethernet. This document does not describe how to deploy RoCE because configuration processes differ according to a switch and adapters in the system. Please refer to [Recommended Network Configuration Examples for RoCE Deployment] written by NVIDIA for RoCE setup.

3.2. SmartNIC setup

We assume that Ubuntu and MLNX_OFED driver are installed on SmartNIC. SmartNIC should be accessible via ssh. To set up SmartNIC, refer to BlueField DPU Software documentation.

3.3. Persistent memory configuration

If your system does not have persistent memory, you need to emulate it using DRAM. Refer to How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM) for persistent memory emulation.

LineFS uses persistent memory as storage and it needs to be configured as Device-DAX mode. Make sure that the created namespace has enough size. It must be larger than the size reserved by LineFS (dev_size in libfs/src/storage/storage.h). A command for creating a new namespace is as below.

sudo ndctl create-namespace -m dax --region=region0 --size=132G

Now, you can find out DAX devices under /dev directory as below.

$ ls /dev/dax*
/dev/dax0.0  /dev/dax0.1

4. Download source code

git clone [email protected]:casys-kaist/LineFS.git
cd LineFS

5. NFS mount for source code management and deployment

With NFS, you can easily share the same source code among three machines and three SmartNICs. It is recommended to maintain two different directories for the same source code, one is for x86 hosts and the other is for ARM SoCs in SmartNIC.

5.1. NFS server setup

Locate source codes in one of the x86 hosts. Let assume source codes for x86 are stored at ${HOME}/LineFS_x86 and for ARM are at ${HOME}/LineFS_ARM.

cd ~
git clone https://github.com/casys-kaist/LineFS.git LineFS_x86
cp -r LineFS_x86 LineFS_ARM

You can skip the following process if the NFS server is already set up.

Install NFS server on libra06 machine.

sudo apt update
sudo apt install nfs-kernel-server

Add two source code directory paths to /etc/exports file. For example, if the paths are /home/guest/LineFS_x86 and /home/guest/LineFS_ARM, add following lines to the file.

/home/guest/LineFS_x86 *(rw,no_root_squash,sync,insecure,no_subtree_check,no_wdelay)
/home/guest/LineFS_ARM *(rw,no_root_squash,sync,insecure,no_subtree_check,no_wdelay)

To apply the modification, run the following command.

sudo exportfs -rv

5.2. NFS client setup

You can skip this process if the NFS clients are set up correctly.

In the x86 hosts and NICs, install NFS client packages.

sudo apt update
sudo apt install nfs-common

On the x86 hosts, mount the x86 source code directory.

cd ~
mkdir LineFS_x86 	# If there is no directory, create one.
sudo mount -t nfs <nfs_server_address>:/home/guest/LineFS_x86 /home/guest/LineFS_x86

Now, you should see the source code at ${HOME}/LineFS_x86 directory.

Similarly, mount LineFS_ARM source codes to SmartNICs. Access to SmartNIC from host machines. In the NIC, mount the ARM source code directory. You need to set up all three NICs.

mkdir LineFS_ARM     # If there is no directory, create one.
sudo mount -t nfs <nfs_server_address>:/home/guest/LineFS_ARM /home/guest/LineFS_ARM

6. Configuring LineFS

6.1. Set project root paths

Set project root paths of the host machine and the SmartNIC at scripts/global.sh.
For example, let assume the source codes are located as:

  • On x86 host,
    • Source code for x86 host is located in /home/guest/LineFS_x86
    • Source code for ARM SoC (NIC) is located in /home/guest/LineFS_ARM_src
  • On SmartNIC,
    • Source code for ARM SoC (/home/guest/LineFS_ARM_src of host) is mounted at /home/guest/LineFS_ARM as NFS.
PROJ_DIR="/home/guest/LineFS_x86"
NIC_PROJ_DIR="/home/guest/LineFS_ARM"

Set host's path of directory that contains source codes for ARM SoC(NIC).

NIC_SRC_DIR="/home/guest/LineFS_ARM_src"

Set signal directory paths of the x86 host and ARM SoC(NIC) at mlfs_config.sh. These paths are required for automated scripts to run experiments.

export X86_SIGNAL_PATH='/home/guest/LineFS_x86/scripts/signals' # Signal path in X86 host. It should be the same as $PROJ_DIR(in global.sh)/scripts/signals.
export ARM_SIGNAL_PATH='/home/guest/LineFS_ARM/scripts/signals' # Signal path in ARM SoC. It should be the same as $NIC_PROJ_DIR(in global.sh)/scripts/signals.

Set source directory paths of the x86 host at scripts/push_src.sh. These paths are required for automated scripts to run experiments.

SRC_ROOT_PATH="/home/guest/LineFS_x86/"  # The last "/" should not be skipped.
TARGET_ROOT_PATH="/home/guest/LineFS_ARM_src/"  # The last "/" should not be skipped.

6.2. Set hostnames

Set hostnames for three x86 hosts and three SmartNICs. For example, If you are using three host machines, host01, host02, and host03, and three SmartNICs, host01-nic, host02-nic, and host03-nic, change scripts/global.sh as below.

# Hostnames of X86 hosts.
# You can get this values by running `hostname` command on each X86 host.
HOST_1="host01"
HOST_2="host02"
HOST_3="host03"

# Hostnames of NICs
# You can get this values by running `hostname` command on each NIC.
NIC_1="host01-nic"
NIC_2="host02-nic"
NIC_3="host03-nic"

6.3. Set network interfaces

Set the names of network interfaces of x86 hosts and SmartNICs. You can use IP addresses instead of the names. The name of SmartNICs should be the name of RDMA interfaces. If hosts' IP addresses are mapped to host01, host02, and host03, and NICs' are mapped to host01-nic-rdma, host02-nic-rdma, and host03-nic-rdma in /etc/hosts, set scripts/global.sh as below.

# Hostname (or IP address) of host machines. You should be able to ssh to each machine with these names.
HOST_1_INF="host01"
HOST_2_INF="host02"
HOST_3_INF="host03"

# Name (or IP address) of RDMA interface of NICs. You should be able to ssh to each NIC with these names.
NIC_1_INF="host01-nic-rdma"
NIC_2_INF="host02-nic-rdma"
NIC_3_INF="host03-nic-rdma"

6.4. Make root and a user can ssh to the x86 hosts and NICs without entering a password

To use scripts provided in this source code, you must be able to access all the machines and NICs via ssh without entering a password as a root. In other words, you need to copy the root account's public key to all the machines and NICs. It is optional but recommended to copy the public key of your account too if you don't want to execute all the scripts as a root. ssh-copy-id is a useful command to do it. Note that, if you don't have a key, generate one with ssh-keygen.

6.5. Compile-time configurations

Locations of compile-time configurations are as below.

  • kernfs/Makefile and libfs/Makefile include compile-time configurations. You have to re-compile LineFS to apply changes in the configurations. You can leave it as a default.

  • Some constants like the private log size, the number of max LibFS processes are defined in libfs/src/global/global.h. You can leave it as a default.

  • IP addresses of machines and SmartNICs and the order of replication chain are defined as a variable hot_replicas in libfs/src/distributed/rpc_interface.h. You have to set correct values for this configuration.

  • A device size to be used by LineFS is defined as a variable dev_size in libfs/src/storage/storage.h. You can leave it as a default.

  • Paths of pmem devices should be defined in libfs/src/storage/storage.c as below. g_dev_path[1] is the path of the device for the public area and g_dev_path[4] is for the log area. You have to set correct values for this configuration. Here is an example.

    char *g_dev_path[] = {
    (char *)"unused",
    (char *)"/dev/dax0.0", # Public Area
    (char *)"/backup/mlfs_ssd",
    (char *)"/backup/mlfs_hdd",
    (char *)"/dev/dax0.1", # Log Area
    };

6.6. Run-time configurations

mlfs_config.sh includes run-time configurations. To apply changes in the configurations you need to restart LineFS.

7. Compiling LineFS

7.1. Build on the host machine

The following command will do all the compilations required on the host machine. It includes downloading and compiling libraries, compiling LibFS library, a kernel worker, an RDMA module and benchmarks, setting SPDK up, and formatting file system.

make host-init

You can build the components one by one with the following commands. Refer to Makefile in the project root directory for detail.

make host-lib   # Build host libraries.
make rdma       # Build rdma module.
make kernfs-linefs     # Build kernel worker.
make libfs-linefs      # Build LibFS.

7.2. Build on SmartNIC

The following command will do all the compilations required on SmartNIC. It includes downloading and compiling libraries and compiling an RDMA module and NICFS.

make snic-init

You can build the components one by one with the following commands. Refer to Makefile in the project root directory for detail.

make snic-lib           # Build libraries.
make rdma               # Build rdma module.
make kernfs-linefs      # Build `NICFS`

8. Formatting devices

Run the following command at the project root directory. Run it only on the host machines. (There is no device on the SmartNIC.)

make mkfs

9. Deploying LineFS

9.1. Deployment scenario

Let's think of the following deployment scenario. There are three host machines and each host machine is equipped with a SmartNIC.

Hostname RDMA IP address
Host machine 1 host01 192.168.13.111
Host machine 2 host02 192.168.13.113
Host machine 3 host03 192.168.13.115
SmartNIC of Host machine 1 host01-nic 192.168.13.112
SmartNIC of Host machine 2 host02-nic 192.168.13.114
SmartNIC of Host machine 3 host03-nic 192.168.13.116

We want to make LineFS have a replication chain as below.

  • Host machine 1 --> Host machine 2 --> Host machine 3

9.2. Setup I/OAT device

The command should be executed on each host machine.

make spdk-init

You only need to run it once after reboot.

9.3. Run kernel workers on host machines

The following script runs kernel worker.

scripts/run_kernfs.sh

You have to execute this script on all three host machines. After running the script, kernel workers wait for SmartNICs to connect.

9.4. Run NICFS on SmartNIC

Execute scripts/run_kernfs.sh on all three SmartNICs. Run them in the reverse order of the replication chain. The replication chain is defined as hot_replicas at libfs/src/distributed/rpc_interface.h. For example, if they are defined as below,

static struct peer_id hot_replicas[g_n_hot_rep] = {
    { .ip = "192.168.13.114", .role = HOT_REPLICA, .type = KERNFS_NIC_PEER},  // SmartNIC on host machine 1
    { .ip = "192.168.13.113", .role = HOT_REPLICA, .type = KERNFS_PEER},      // Host machine 1
    { .ip = "192.168.13.118", .role = HOT_REPLICA, .type = KERNFS_NIC_PEER},  // SmartNIC on host machine 2
    { .ip = "192.168.13.117", .role = HOT_REPLICA, .type = KERNFS_PEER},      // Host machine 2
    { .ip = "192.168.13.116", .role = HOT_REPLICA, .type = KERNFS_NIC_PEER},  // SmartNIC on host machine 2
    { .ip = "192.168.13.115", .role = HOT_REPLICA, .type = KERNFS_PEER}       // Host machine 3
}

run scripts/run_kernfs.sh in host03-nic --> host02-nic --> host01-nic order. You have to wait that the previous SmartNIC finishes establishing its connections.

9.4.1. Running all Kernel workers and NICFSes at once

If you could correctly run the Kernel workers and NICFSes manually, you can use a script to run all them with one command. To make the script work, set signal paths in mlfs_conf.sh as below. This script works only if you are sharing the source codes using NFS (Refer to NFS mount for source coe management and deployment).

export X86_SIGNAL_PATH='/home/host01/LineFS_x86/scripts/signals' # Signal path in X86 host. It should be the same as $PROJ_DIR(in global.sh)/scripts/signals.
export ARM_SIGNAL_PATH='/home/host01/LineFS_ARM/scripts/signals' # Signal path in ARM SoC. It should be the same as $NIC_PROJ_DIR(in global.sh)/scripts/signals.

Run the script. It will format devices of x86 hosts and run kernel workers and NICFSes. Make sure that the source code is built as LineFS (make kernfs-linefs).

# At the project root directory of host01 machine,
scripts/run_all_kernfs.sh

9.5. Run applications

We are going to run a simple test application, iotest. Note that, all the LineFS applications run on the Primary host CPU (host01).

cd libfs/tests
sudo ./run.sh ./iotest sw 1G 4K 1   # sequential write, 1GB file, 4KB i/o size, 1 thread

10. Run Assise

To run Assise (DFS without NIC-offloading) rather than LineFS, you have to rebuild LibFS and SharedFS (a.k.a. KernFS).

# At the project root directory,
make kernfs-assise
make libfs-assise

You can use the same script, scripts/run_kernfs.sh, however, a SharedFS (KernFS) needs to wait for the next SharedFS in the replication chain to be ready. For example, run Replica 2's SharedFS -> wait for a while -> run Replica 1's SharedFS -> wait for a while --> run Primary's SharedFS.

You can use the same script, scripts/run_all_kernfs.sh to format devices and run all the SharedFSes with one command as described in Running all Kernel workers and NICFSes at once.

11. Running benchmarks

Refer to README-bench.

You might also like...
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large scales

Fairring (FAIR + Herring): a faster all-reduce TL;DR: Using a variation on Amazon’s "Herring" technique, which leverages reduction servers, we can per

A program developed using MPI for distributed computation of Histogram for large data and their performance anaysis on multi-core systems
A program developed using MPI for distributed computation of Histogram for large data and their performance anaysis on multi-core systems

mpi-histo A program developed using MPI for distributed computation of Histogram for large data and their performance anaysis on multi-core systems. T

A library for distributed ML training with PyTorch
A library for distributed ML training with PyTorch

moolib moolib - a communications library for distributed ML training moolib offers general purpose RPC with automatic transport selection (shared memo

PaRSEC: the Parallel Runtime Scheduler and Execution Controller for micro-tasks on distributed heterogeneous systems.

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.

Gigaleak | Import HMS file to GEO file for sm64 decomp

Convert HMS to GEO This is a conventer HMS to GEO for Super Mario 64. Requires SM64 decomp and a knowledge of how levels work. NOTE: This is super eas

Faiss is a library for efficient similarity search and clustering of dense vectors.

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.

A hierarchical parameter server framework based on MXNet. GeoMX also implements multiple communication-efficient strategies.

Introduction GeoMX is a MXNet-based two-layer parameter server framework, aiming at integrating data knowledge that owned by multiple independent part

This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.
This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

Deploy SCRFD, an efficient high accuracy face detection approach, in your web browser with ncnn and webassembly

ncnn-webassembly-scrfd open https://nihui.github.io/ncnn-webassembly-scrfd and enjoy build and deploy Install emscripten

Comments
  • rdma_bind_addr failed

    rdma_bind_addr failed

    Hi, Kim, I'm trying to follow your work and met some problems when performing the step 9.3

    setup

    I am trying to set up a cluster with 3 nodes. and I got a segmentation fault

    
    After debugging with gdb, it seems the error is caused by rdma_bind_addr in line 71 file /usr/include/aarch64-linux-gnu/bits/string_fortified.h
    I find that the cm_id is null and addr is 0.
    
    Could you please help me with this problem? Thanks a lot!
    opened by lyxxn0414 6
  • git submodules can not found

    git submodules can not found

    It seems that some repos in .gitsubmodules can not found, such ascasys-kaist-internal/hyperloop.code.git and casys-kaist-internal/Pipeline-ioat-dma-kernel-module.git. So where can we find the submoule repos?

    opened by UntaggedRui 1
  • Remote access error

    Remote access error

    I'm starting linefs with 3 nodes, ip addresses of them are as below:

    // 1st machine - Bluefield NIC
    	{ .ip = "10.10.3.101", .role = HOT_REPLICA, .type = KERNFS_NIC_PEER},
            // 1st machine - Host
    	{ .ip = "10.10.3.1", .role = HOT_REPLICA, .type = KERNFS_PEER},
            // 2nd machine - Bluefield NIC
            { .ip = "10.10.3.102", .role = HOT_REPLICA, .type = KERNFS_NIC_PEER},
            // 2nd machine - Host
            { .ip = "10.10.3.2", .role = HOT_REPLICA, .type = KERNFS_PEER},
            // 3rd machine - Bluefield NIC
    	{ .ip = "10.10.3.103", .role = HOT_REPLICA, .type = KERNFS_NIC_PEER},
            // 3rd machine - Host
    	{ .ip = "10.10.3.3", .role = HOT_REPLICA, .type = KERNFS_PEER},
    

    and SmartNIC of node2 has been successfully connected with its host. The terminal of node2-nic is like this:

    [New Thread 0xfffd0b1e6850 (LWP 253800)]
    [New Thread 0xfffd0a9e5850 (LWP 253801)]
    Connecting to KernFS instance 5 [ip: 10.10.3.3]
    [New Thread 0xfffd095fe850 (LWP 253802)]
    Wait for connections established. 0/2
    Wait for connections established. 0/2
    Wait for connections established. 0/2
    Wait for connections established. 0/2
    Wait for connections established. 0/2
    

    But node1-host gets the wrong ip of node1-nic. The terminal of node1-host is like this:

    ip address on interface 'enp129s0f0' is 10.10.3.2
    It is Ready. Write 1 to file: /opt/LineFS/LineFS_x86/signals/kernfs/node1.jinwei-121792.bfkvs-pg0.clemson.cloudlab.us
    Server: cmd_hdr is s
    Server: cmd_hdr is s
    Reading root inode with inum: 1
    [New Thread 0x7ff6721ff700 (LWP 12727)]
    Connecting to KernFS instance 4 [ip: 10.10.3.103]
    12767 connection.c:1357 rc_die(): unknown event
    Thread 34 "kernfs" received signal SIGTRAP, Trace/breakpoint trap
    

    Could you help me to solve this problem? Thanks a lot!

    opened by lyxxn0414 3
Owner
null
A 3D DNN-based Metric Semantic Dense Mapping pipeline and a Visual Inertial SLAM system

MSDM-SLAM This repository represnets a 3D DNN-based Metric Semantic Dense Mapping pipeline and a Visual Inertial SLAM system that can be run on a grou

ITMO Biomechatronics and Energy Efficient Robotics Laboratory 11 Jul 23, 2022
Gstreamer plugin that allows use of NVIDIA Maxine SDK in a generic pipeline.

GST-NVMAXINE Gstreamer plugin that allows use of NVIDIA MaxineTM sdk in a generic pipeline. This plugin is intended for use with NVIDIA hardware. Visi

Alex Pitrolo 18 Dec 19, 2022
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

The Apache Software Foundation 20.2k Dec 31, 2022
Distributed machine learning platform

Veles Distributed platform for rapid Deep learning application development Consists of: Platform - https://github.com/Samsung/veles Znicz Plugin - Neu

Samsung 897 Dec 5, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 5, 2023
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.6k Jan 3, 2023
Simplified distributed block storage with strong consistency, like in Ceph (repository mirror)

Vitastor Читать на русском The Idea Make Software-Defined Block Storage Great Again. Vitastor is a small, simple and fast clustered block storage (sto

Vitaliy Filippov 63 Jan 7, 2023
OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

OpenEmbedding English version | 中文版 About OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration. Nowadays, many m

4Paradigm 19 Jul 25, 2022
Distributed Pose Graph Optimization

Distributed Pose Graph Optimization

MIT Aerospace Controls Laboratory 99 Dec 28, 2022