Anomaly Detection on Dynamic (time-evolving) Graphs in Real-time and Streaming manner

Overview

MIDAS

C++ implementation of

The old implementation is in another branch OldImplementation, it should be considered as being archived and will hardly receive feature updates.

Table of Contents

Features

  • Finds Anomalies in Dynamic/Time-Evolving Graph: (Intrusion Detection, Fake Ratings, Financial Fraud)
  • Detects Microcluster Anomalies (suddenly arriving groups of suspiciously similar edges e.g. DoS attack)
  • Theoretical Guarantees on False Positive Probability
  • Constant Memory (independent of graph size)
  • Constant Update Time (real-time anomaly detection to minimize harm)
  • Up to 55% more accurate and 929 times faster than the state of the art approaches
  • Experiments are performed using the following datasets:

Demo

If you use Windows:

  1. Open a Visual Studio developer command prompt, we want their toolchain
  2. cd to the project root MIDAS/
  3. cmake -DCMAKE_BUILD_TYPE=Release -GNinja -S . -B build/release
  4. cmake --build build/release --target Demo
  5. cd to MIDAS/build/release/
  6. .\Demo.exe

If you use Linux/macOS:

  1. Open a terminal
  2. cd to the project root MIDAS/
  3. cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/release
  4. cmake --build build/release --target Demo
  5. cd to MIDAS/build/release/
  6. ./Demo

The demo runs on MIDAS/data/DARPA/darpa_processed.csv, which has 4.5M records, with the filtering core (MIDAS-F).

The scores will be exported to MIDAS/temp/Score.txt, higher means more anomalous.

All file paths are absolute and "hardcoded" by CMake, but it's suggested NOT to run by double clicking on the executable file.

Requirements

Core

  • C++11
  • C++ standard libraries

Demo (if experimental ROC-AUC impl)

  • C++ standard libraries

Demo (if sklearn ROC-AUC impl)

  • Python 3 (MIDAS/util/EvaluateScore.py)
    • pandas: I/O
    • scikit-learn: Compute ROC-AUC

Experiment

  • (Optional) Intel TBB: Parallelization
  • (Optional) OpenMP: Parallelization

Other python utility scripts

  • Python 3
    • pandas
    • scikit-learn

Customization

Switch to sklearn ROC-AUC Implementation

In MIDAS/example/Demo.cpp.
Comment out section "Evaluate scores (experimental)"
Uncomment section "Write output scores" and "Evaluate scores".

Different CMS Size / Decay Factor / Threshold

Those are arguments of cores' constructors, which are at MIDAS/example/Demo.cpp:67-69.

Switch Cores

Cores are instantiated at MIDAS/example/Demo.cpp:67-69, uncomment the chosen one.

Custom Dataset + Demo.cpp

You need to prepare three files:

  • Meta file
    • Only includes an integer N, the number of records in the dataset
    • Use its path for pathMeta
    • E.g. MIDAS/data/DARPA/darpa_shape.txt
  • Data file
    • A header-less csv format file of shape [N,3]
    • Columns are sources, destinations, timestamps
    • Use its path for pathData
    • E.g. MIDAS/data/DARPA/darpa_processed.csv
  • Label file
    • A header-less csv format file of shape [N,1]
    • The corresponding label for data records
      • 0 means normal record
      • 1 means anomalous record
    • Use its path for pathGroundTruth
    • E.g. MIDAS/data/DARPA/darpa_ground_truth.csv

Custom Dataset + Custom Runner

  1. Include the header MIDAS/src/NormalCore.hpp, MIDAS/src/RelationalCore.hpp or MIDAS/src/FilteringCore.hpp
  2. Instantiate cores with required parameters
  3. Call operator() on individual data records, it returns the anomaly score for the input record

Other Files

example/

Experiment.cpp

The code we used for experiments.
It will try to use Intel TBB or OpenMP for parallelization.
You should comment all but only one runner function call in the main() as most results are exported to MIDAS/temp/Experiiment.csv together with many intermediate files.

Reproducible.cpp

Similar to Demo.cpp, but with all random parameters hardcoded and always produce the same result.
It's for other developers and us to test if the implementation in other languages can produce acceptable results.

util/

DeleteTempFile.py, EvaluateScore.py and ReproduceROC.py will show their usage and a short description when executed without any argument.

AUROC.hpp

Experimental ROC-AUC implementation in C++11. More info at this repo.

PreprocessData.py

The code to process the raw dataset into an easy-to-read format.
Datasets are always assumed to be in a folder in MIDAS/data/.
It can process the following dataset(s)

  • DARPA/darpa_original.csv -> DARPA/darpa_processed.csv, DARPA/darpa_ground_truth.csv, DARPA/darpa_shape.txt

In Other Languages

  1. Python: Rui Liu's MIDAS.Python, Ritesh Kumar's pyMIDAS
  2. Python (pybind): Wong Mun Hou's MIDAS
  3. Golang: Steve Tan's midas
  4. Ruby: Andrew Kane's midas
  5. Rust: Scott Steele's midas_rs
  6. R: Tobias Heidler's MIDASwrappeR
  7. Java: Joshua Tokle's MIDAS-Java
  8. Julia: Ashrya Agrawal's MIDAS.jl

Online Coverage

  1. ACM TechNews
  2. AIhub
  3. Hacker News
  4. KDnuggets
  5. Microsoft
  6. Towards Data Science

Citation

If you use this code for your research, please consider citing the arXiv preprint

@misc{bhatia2020realtime,
    title={Real-Time Streaming Anomaly Detection in Dynamic Graphs},
    author={Siddharth Bhatia and Rui Liu and Bryan Hooi and Minji Yoon and Kijung Shin and Christos Faloutsos},
    year={2020},
    eprint={2009.08452},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

or the AAAI paper

@inproceedings{bhatia2020midas,
    title="MIDAS: Microcluster-Based Detector of Anomalies in Edge Streams",
    author="Siddharth {Bhatia} and Bryan {Hooi} and Minji {Yoon} and Kijung {Shin} and Christos {Faloutsos}",
    booktitle="AAAI Conference on Artificial Intelligence (AAAI)",
    year="2020"
}
Comments
  • Production implementation

    Production implementation

    Hi first off this is really cool, Im a novice coder and for research I would like to implement this on Netflow data in real time, the only thing is Im unsure how this can be integrated into a live environment and not on some local dataset, but maybe its a dumb question, but how should or could this be implemented?

    opened by SyGen899 9
  • 1.0 Changes

    1.0 Changes

    Hey, I tried to summarize the changes with 1.0 I encountered while upgrading the Ruby gem. It may be worth adding some version to the readme to make it easier for others to upgrade. Demo.cpp was really helpful. Assuming src, dst, and times are std::vector<int>:

    Version 0.1.0

    #include <anom.hpp>
    
    vector<double>* result;
    result = midasR(src, dst, times, num_rows, num_buckets, factor);
    

    Version 1.0.0

    #include <RelationalCore.hpp>
    
    size_t n = src.size();
    std::vector<float> result;
    result.reserve(n);
    
    MIDAS::RelationalCore midas(num_rows, num_buckets, factor);
    for (size_t i = 0; i < n; i++) {
      result.push_back(midas(src[i], dst[i], times[i]));
    }
    

    Use NormalCore for the no relations version.

    Other changes:

    • the midas function takes float input and returns a float score (previously took int input and returned a double score)
    • factor is now a float instead of a double
    • there's a new FilteringCore
    opened by ankane 7
  • What's the reason for the `m` value?

    What's the reason for the `m` value?

    https://github.com/bhatiasiddharth/MIDAS/blob/c1e4e6b316d007165907db44efcefab9fd8b02f8/anom.cpp#L50

    Since we have no guarantee that the first and last values of the inputs are different (or, if they are different, that they differ by any particular magnitude), why not just supply m as an input to the program?

    (I looked through the paper. Apologies if I missed an explanation there.)

    enhancement question 
    opened by scooter-dangle 6
  • ground truth labels for TwitterworldCup2014 dataset

    ground truth labels for TwitterworldCup2014 dataset

    I want to run MIDAS on the TwitterWorldCup2014 dataset, but in the given dataset, the ground truth does not include the label as 0 or 1, instead, it shows the following

    1 | Arena de Sao Paulo, Sao Paulo, Brazil | Brazil, Croatia | Marcelo | Own Goal | 6-12-2014 20:11:00 | High importance events.

    please suggest, how to generate labels as 0 or 1 i.e anomalous or not. Have you already prepared ground truth labels for this, if yes could you please share that?

    Here in this dataset , there are three events such as

    1. goal 2.penalty 3, Injury. what could be the anomaly in these events.

    Thanks.

    opened by victordaniel 3
  • SyntaxError : print(f

    SyntaxError : print(f"ROC-AUC{indexRun} = {auc:.4f}")

    When I run the Demo.py, I got the following error which I coulnt resolve after trying much. Why is that so? ( I dont think it is a syntax error also I dont find such syntax as well ) :- Seed = 1606470101 // In case of reproduction #Records = 4554344 // Dataset is loaded Time = 826ms // Algorithm is finished // Raw anomaly scores are exported to // /home/rohit/MIDAS/MIDAS/temp/Score.txt File "/home/rohit/MIDAS/MIDAS/util/EvaluateScore.py", line 33 print(f"ROC-AUC{indexRun} = {auc:.4f}") ^ SyntaxError: invalid syntax although output result is there in Score.txt

    opened by rohitnitk 2
  • Segmentation fault: 11

    Segmentation fault: 11

    Hello, I am currently trying to use MIDAS-R on a dataset however I have this error right after running it:

    $ ./midas -i ../Wednesday-14-02-2018_4GRAPH.csv -o ../scores.txt
    Finished Loading Data from ../Wednesday-14-02-2018_4GRAPH.csv
    Segmentation fault: 11
    

    Here is a sample of Wednesday-14-02-2018_4GRAPH.csv file:

    source,destination,time 1451698946054,901943132206,352877 1451698946054,901943132206,628353 1451698946054,901943132206,973076 1451698946054,901943132206,980110 1451698946054,901943132206,981852 103079215137,1460288880642,1518566400 1322849927169,1047972020228,1518566400 1322849927169,1047972020228,1518566400 1322849927169,1047972020228,1518566400 687194767395,1640677507073,1518566400 1236950581249,1700807049228,1518566400 1322849927169,1047972020228,1518566400 1700807049228,712964571136,1518566400 1322849927169,1047972020228,1518566400 1632087572482,1477468749825,1518566400 1597727834115,94489280524,1518566400 1236950581249,979252543497,1518566400 1580547964930,979252543497,1518566400 1322849927169,1047972020228,1518566400 1116691496960,1047972020228,1518566401 1374389534736,163208757249,1518566401 1116691496960,1047972020228,1518566401 1520418422807,575525617668,1518566401

    What is wrong?

    Thanks

    question 
    opened by rastafrange 2
  • Threshold Used For Experimental Results

    Threshold Used For Experimental Results

    Hi there, I was attempting to replicate your results on the Darpa dataset, but realized you didn't specify the threshold you used. I understand the threshold is user defined, but would like to know what value was used in the experimental setup. Could you please clarify how you calculate the MIDAS(R) ROC and what threshold you used?

    Thanks!

    question 
    opened by ZeroCool2u 2
  • Ruby Library

    Ruby Library

    Hey, thanks for this project and research! Just wanted to let you know there are now Ruby bindings for it. If you have any feedback, let me know or feel free to create an issue on the project.

    enhancement 
    opened by ankane 2
  • Implement question: Should I fill in for the absent data?

    Implement question: Should I fill in for the absent data?

    Hi, Thank you for implement this amazing anomaly detection method! In the implementation, I'm wondering if I should fill in for the absent data, for example, if the directional IP pair A to B appears at 10:00, but is absent at 11:00 and 12:00. Should I fill A to B count 0 in 11:00 and 12:00?

    image

    Thank you

    opened by chunshou-Liu 1
  • Any recommendation to normalize score?

    Any recommendation to normalize score?

    Hi,

    Thank you for implementing this wonderful AD method!

    I've read through your paper and the score is calculated as image

    We usually use Unix timestamp to represent time, therefore the score we get is usually very large. Do you have any recommendations to narrow the value range?

    Thank you!

    opened by munhouiani 1
  • How to decide whther edge is anomalous ?

    How to decide whther edge is anomalous ?

    In the Algorithm, how (on what basis ) you are deciding whether an edge is anomalous or not, given the anomaly score?
    (I've read the paper but couldn't find it )

    question 
    opened by rohitnitk 1
  • Unclear Docker volume binds for Demo

    Unclear Docker volume binds for Demo

    When running the Demo code on Docker, it took me a while before noticing that I needed to bind both $PWD/data and $PWD/temp (if I want the raw scores) when running the container. I would suggest adding a section to the README about executing the Demo on Docker and include something like the following snippet:

    docker run -it \
    	--rm \
    	--name midas \
    	--volume $PWD/data:/MIDAS/data \
    	--volume $PWD/temp:/MIDAS/temp \
    	midas
    

    Any thoughts?

    opened by rcopstein 2
  • Should either Dockerize or better specify dependencies

    Should either Dockerize or better specify dependencies

    I'm running Ubuntu 18.04 and so created the following initial Dockerfile to get around the cmake version requirements that prevent my following the steps listed in the Demo section of the README:

    FROM ubuntu:20.04
    
    ENV DEBIAN_FRONTEND noninteractive
    
    RUN apt-get update \
        && apt-get install --yes \
          build-essential \
          cmake \
          python-is-python3 \
        && apt-get clean \
        && rm --recursive --force \
          /var/lib/apt/lists/* \
          /tmp/* \
          /var/tmp/*
    
    RUN mkdir /src
    WORKDIR /src
    
    COPY CMakeLists.txt ./
    RUN mkdir --parents build/release \
        && cp CMakeLists.txt build/release/
    
    COPY example ./example
    COPY src ./src
    COPY temp ./temp
    COPY util ./util
    
    RUN cmake -DCMAKE_BUILD_TYPE=Release -S . -B build/release \
        && cmake --build build/release --target Demo
    

    I then build it via

    # Wouldn't need to use `sudo` on macOS
    sudo docker build . --tag midas
    

    and run the compile Demo app via

    sudo docker run \
      --tty \
      --interactive \
      --rm \
      --volume $PWD/data:/src/data \
      midas \
      build/release/Demo
    

    which, when shelling out to the Python scripts, aborts with the following

    Traceback (most recent call last):
      File "/src/util/EvaluateScore.py", line 20, in <module>
        from pandas import read_csv
    ModuleNotFoundError: No module named 'pandas'
    

    since pandas is not available.


    To better avoid the need for local environment debugging, my personal preference would be for a known-working Dockerfile.

    opened by scooter-dangle 7
Releases(v1.1.2)
  • v1.1.2(Nov 16, 2020)

  • v1.1.1(Sep 29, 2020)

  • v1.1.0(Sep 20, 2020)

    • Partially vectorize MIDAS-F's conditional merge
      • Reduce running time by ~10%
    • + a reproducible (no-random) demo
      • To test implementations in other languages
    • + official python implementation
      • See README.md
    • Merge EdgeHash.hpp and NodeHash.hpp -> CountMinSketch.hpp
    • Change the method signature of MIDAS::CountMinSketch::Hash()
      • indexOut is the first, same as other methods
      • b has a default value 0
    • Merge src/CMakeLists.txt into CMakeLists.txt
    • Rename variable MIDAS::*Core::timestampCurrent -> MIDAS::*Core::timestamp
      • Use this-> to differentiate
    • Rename macro ParallelProvider_* -> ParallelizationProvider_*
      • Only used in example/Experiment.cpp
    Source code(tar.gz)
    Source code(zip)
  • v1.0.2(Jul 23, 2020)

  • v1.0.1(Jun 25, 2020)

    • Remove useless script file MIDAS/util/PlotAnomalousEvent.py
    • Unify import style in MIDAS/util/PreprocessData.py
    • Add tbb support for MIDAS/example/Experiment.cpp
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Jun 16, 2020)

  • v0.1.0(Jun 16, 2020)

Owner
Stream-AD
Anomaly Detection in Data Streams
Stream-AD
Real-time object detection with YOLOv5 and TensorRT

YOLOv5-TensorRT The goal of this library is to provide an accessible and robust method for performing efficient, real-time inference with YOLOv5 using

Noah van der Meer 43 Dec 27, 2022
PointPillars MultiHead 40FPS - A REAL-TIME 3D detection network [Pointpillars] compiled by CUDA/TensorRT/C++.

English | 简体中文 PointPillars High performance version of 3D object detection network -PointPillars, which can achieve the real-time processing (less th

Yan haixu 191 Jan 3, 2023
Real time monaural source separation base on fully convolutional neural network operates on Time-frequency domain.

Real time monaural source separation base on fully convolutional neural network operates on Time-frequency domain.

James Fung 111 Jan 9, 2023
NCNN implementation of Real-ESRGAN. Real-ESRGAN aims at developing Practical Algorithms for General Image Restoration.

Real-ESRGAN ncnn Vulkan This project is the ncnn implementation of Real-ESRGAN. Real-ESRGAN ncnn Vulkan heavily borrows from realsr-ncnn-vulkan. Many

Xintao 602 Jan 6, 2023
MediaPipe offers cross-platform, customizable ML solutions for live and streaming media.

Cross-platform, customizable ML solutions for live and streaming media.

Google 20k Jan 9, 2023
Port of the 2020 support library to Raspberry Pi for the VL53L3CX Time-of-Flight ranging sensor with advanced multi-object detection

Port of ST VL53L3CX (2020) driver library to Raspberry Pi This is a port of the support library to Raspberry Pi for the VL53L3CX Time-of-Flight rangin

Niall Douglas 4 Jul 27, 2022
CNStream is a streaming framework for building Cambricon machine learning pipelines

CNStream is a streaming framework for building Cambricon machine learning pipelines

Cambricon Technologies 13 Dec 30, 2022
TengineGst is a streaming media analytics framework, based on GStreamer multimedia framework, for creating varied complex media analytics pipelines.

TengineGst is a streaming media analytics framework, based on GStreamer multimedia framework, for creating varied complex media analytics pipelines. It ensures pipeline interoperability and provides optimized media, and inference operations using Tengine Toolkit Inference Engine backend, across varied architecture - CPU, iGPU and VPU.

OAID 69 Dec 17, 2022
ORB-SLAM3 is the first real-time SLAM library able to perform Visual, Visual-Inertial and Multi-Map SLAM with monocular, stereo and RGB-D cameras, using pin-hole and fisheye lens models.

Just to test for my research, and I add coordinate transformation to evaluate the ORB_SLAM3. Only applied in research, and respect the authors' all work.

B.X.W 5 Jul 11, 2022
Cartographer is a system that provides real-time simultaneous localization and mapping (SLAM) in 2D and 3D across multiple platforms and sensor configurations.

Cartographer Purpose Cartographer is a system that provides real-time simultaneous localization and mapping (SLAM) in 2D and 3D across multiple platfo

Cartographer 6.3k Jan 4, 2023
NCNN+Int8+YOLOv4 quantitative modeling and real-time inference

NCNN+Int8+YOLOv4 quantitative modeling and real-time inference

pengtougu 20 Dec 6, 2022
Python and C++ implementation of "MarkerPose: Robust real-time planar target tracking for accurate stereo pose estimation". Accepted at LXCV Workshop @ CVPR 2021.

MarkerPose: Robust Real-time Planar Target Tracking for Accurate Stereo Pose Estimation This is a PyTorch and LibTorch implementation of MarkerPose: a

Jhacson Meza 47 Nov 18, 2022
A real-time LiDAR SLAM package that integrates FLOAM and ScanContext.

SC-FLOAM What is SC-FLOAM? A real-time LiDAR SLAM package that integrates FLOAM and ScanContext. FLOAM for odometry (i.e., consecutive motion estimati

Jinlai Zhang 16 Jan 8, 2023
A real-time LiDAR SLAM package that integrates TLOAM and ScanContext.

SC-TLOAM What is SC-TLOAM? A real-time LiDAR SLAM package that integrates TLOAM and ScanContext. TLOAM for odometry. ScanContext for coarse global loc

Jinlai Zhang 3 Sep 17, 2021
A lightweight 2D Pose model can be deployed on Linux/Window/Android, supports CPU/GPU inference acceleration, and can be detected in real time on ordinary mobile phones.

A lightweight 2D Pose model can be deployed on Linux/Window/Android, supports CPU/GPU inference acceleration, and can be detected in real time on ordinary mobile phones.

JinquanPan 58 Jan 3, 2023
R3live - A Robust, Real-time, RGB-colored, LiDAR-Inertial-Visual tightly-coupled state Estimation and mapping package

R3LIVE A Robust, Real-time, RGB-colored, LiDAR-Inertial-Visual tightly-coupled state Estimation and mapping package News [Dec 31, 2021] Release of cod

HKU-Mars-Lab 1.3k Jan 4, 2023
Tandem - [CoRL 21'] TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo

TANDEM: Tracking and Dense Mapping in Real-time using Deep Multi-view Stereo Lukas Koestler1*    Nan Yang1,2*,†    Niclas Zeller2,3    Daniel Cremers1

TUM Computer Vision Group 742 Dec 31, 2022
Real time eye tracking for embedded and mobile devices.

drishti Real time eye tracking for embedded and mobile devices in C++11. NEWS (2018/08/10) Native iOS, Android, and "desktop" variants of the real-tim

null 356 Dec 14, 2022