A system to flag anomalous source code expressions by learning typical expressions from training data

Overview

A friendly request: Thanks for visiting control-flag GitHub repository! If you find control-flag useful, we would appreciate a note from you (to [email protected] or [email protected]). And, of course, we love testimonials!

-- The ControlFlag Team

linux_build_and_test linux_style_check macos_build_and_test macos_style_check GitHub license

ControlFlag: A Self-supervised Idiosyncratic Pattern Detection System for Software Control Structures

ControlFlag is a self-supervised idiosyncratic pattern detection system that learns typical patterns that occur in the control structures of high-level programming languages, such as C/C++, by mining these patterns from open-source repositories (on GitHub and other version control systems). It then applies learned patterns to detect anomalous patterns in user's code.

Brief technical description

ControlFlag's pattern anomaly detection system can be used for various problems such as typographical error detection, flagging a missing NULL check to name a few. This PoC demonstrates ControlFlag's application in the typographical error detection.

Figure below shows ControlFlag's two main phases: (1) pattern mining phase, and (2) scanning for anomalous patterns phase. The pattern mining phase is a "training phase" that mines typical patterns in the user-provided GitHub repositories and then builds a decision-tree from the mined patterns. The scanning phase, on the other hand, applies the mined patterns to flag anomalous expressions in the user-specified target repositories.

ControlFlag design

More details can be found in our MAPS paper (https://arxiv.org/abs/2011.03616).

Directory structure (evolving)

  • src: Source code for ControlFlag for typographical error detection system
  • scripts: Scripts for pattern mining and scanning for anomalies
  • quick_start: Scripts to run quick start tests
  • github: Scripts and data for downloading GitHub repos. It also contains pre-processed training data containing patterns mined from 6000 GitHub repositories using C as their primary language.
  • tests: unit tests

Install

ControlFlag can be built on Linux and MacOS.

Requirements

  • CMake 3.4.3 or above
  • C++17 compatible compiler
  • Tree-sitter parser (downloaded automatically as part of cmake)
  • GNU parallel (optional, if you want to generate your own training data)

Tested build configuration on Linux-based systems

  • CentOS-7.6/Ubuntu-20.04 with g++-v10.2.0 for x86_64

Tested build configuration on MacOS

  • MacOS Mojave v10.14.6 with clang-1001.0.46.4 (Apple LLVM version 10.0.1) for x86_64 (obtained from The Command Line Tools Package)

Build

$ cd control-flag
$ cmake .
$ make -j
$ make test

All tests in make test should pass, but currently tests for Verilog are failing because of a version mismatch issue. Verilog support is WIP.

Using ControlFlag

Quick start

Using patterns obtained from 6000 GitHub repos to scan repository of your choice

Download the training data for C language depending on the memory constraints of your device. Note, however, that using smaller datasets may lead to reduced accuracy in the results ControlFlag produces and possibly an increase in the number of false positives it generates.

Dataset name Size on disk Memory requirements Direct link gdown ID MD5 checksum
Small ~100MB ~400MB link 1gvUyRXq1SeZD9g3i__RaamYAMo_QaQIb 2825f209aba0430993f7a21e74d99889
Medium ~450MB ~1.3GB link 1zsCFJAKlZlSAWKPfBcVGcQNlFB5Gtwo3 aab2427edebe9ed4acab75c3c6227f24
Large ~9GB ~13GB link 1-jzs3zrKU541hwChaciXSk8zrnMN1mYc 1ba954d9716765d44917445d3abf8e85
$ python -m pip install gdown && gdown https://drive.google.com/uc?id=<id_from_table>
$ (optional) md5sum <tgz_file>
$ tar -zxf <tgz_file>

To scan C code of your choice, use below command:

$ scripts/scan_for_anomalies.sh -d <directory_to_be_scanned_for_anomalies> -t <training_data>.ts -o <output_directory_to_store_log_files>

Once the run is complete (which could take some time depending on your system and the number of C programs in your repository,) refer to the section below to understand scan output.

Mining patterns from a small repo and applying them to another small repo

In this test, we will mine patterns from Glb-director project of GitHub and apply them to flag anomalies in GitHub's brubeck project.

Simply run below command:

cd quick_start && ./test1_c.sh

If everything goes well, you can see output from the scanner in test1_scan_output directory. Look for "Potential anomaly" label in it by grep "Potential anomaly" -C 5 \*.log, and you should see output like below:

thread_6.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("==") (identifier) (non_terminal_expression))) found in training dataset:
Source file: brubeck/src/server.c:266:5:(s == sizeof(fdsi))
thread_6.log-Autocorrect search took 0.000 secs
thread_6.log:Potential anomaly
thread_6.log-Did you mean:(parenthesized_expression (binary_expression ("==") (identifier) (non_terminal_expression))) with editing cost:0 and occurrences: 1
thread_6.log-Did you mean:(parenthesized_expression (binary_expression ("==") (identifier) (null))) with editing cost:1 and occurrences: 25
thread_6.log-Did you mean:(parenthesized_expression (binary_expression ("==") (identifier) (identifier))) with editing cost:1 and occurrences: 5
thread_6.log-Did you mean:(parenthesized_expression (binary_expression (">=") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 3
thread_6.log-Did you mean:(parenthesized_expression (binary_expression ("==") (non_terminal_expression) (non_terminal_expression))) with editing cost:1 and occurrences: 2

The anomaly is flagged for brubeck/src/server.c at line number 266.

Detailed steps

  1. Pattern Mining phase (if you want to generate training data yourself)

If you do not want to generate training data yourself, go to Evaluation step below.

In this phase, we mine the idiosyncratic patterns that appear in the control structures of high-level language such as C. This PoC mines patterns from if statements that appear in C programs.

If you want to use your own repository for mining patterns, jump to Step 1.2.

1.1 Downloading Top-100 GitHub repos for C language

Steps below show how to download Top-100 GitHub repos for C language (c100.txt) and generate training data. training_repo_dir is a directory where the command below will clone all the repos.

$ cd github
$ python download_repos.py -f c100.txt -o <training_repo_dir> -m clone -p 5

1.2 Mining patterns from downloaded repositories

You can use your own repository to mine for expressions by passing it in place of <training_repo_dir>.

mine_patterns.sh script helps for this. It's usage is as below:

Usage: ./mine_patterns.sh -d <directory_to_mine_patterns_from> -o <output_file_to_store_training_data>
Optional:
[-n number_of_processes_to_use_for_mining]  (default: num_cpus_on_system)
[-l source_language_number] (default: 1 (C), supported: 1 (C), 2 (Verilog)

We use it as:

$ scripts/mine_patterns.sh -d <training_repo_dir> -o <training_data_file> -l 1

<training_dat_file> contains conditional expressions in C language that are found in the specified GitHub repos and their AST (abstract syntax tree) representations. You can view this file as a text file, if you want.

Evaluation (or scanning for anomalies in C code from test repo)

We can run scan_for_anomalies.sh script to scan target directory of interest. Its usage is as below.

Usage: ./scan_for_anomalies.sh -t <training_data> -d <directory_to_scan_for_anomalous_patterns>
Optional:
 [-c max_cost_for_autocorrect]              (default: 2)
 [-n max_number_of_results_for_autocorrect] (default: 5)
 [-j number_of_scanning_threads]            (default: num_cpus_on_systems)
 [-o output_log_dir]                        (default: /tmp)
 [-l source_language_number]                (default: 1 (C), supported: 1 (C), 2 (Verilog))
 [-a anomaly_threshold]                     (default: 3.0)
scripts/scan_for_anomalies.sh -d <test_directory> -t <training_data_file> -o <output_log_dir>

As a part of scanning for anomalies, ControlFlag also suggests possible corrections in case a conditional expression is flagged as an anomaly. 25 is the max_cost for the correction -- how close should the suggested correction be to possibly mistyped expression. Increasing max_cost leads to suggesting more corrections. If you feel that the number of reported anomalies is high, consider reducing anomaly_threshold to 1.0 or less.

Understanding scan output

Under output_log_dir you will find multiple log files corresponding to the scan output from different scanner threads. Potential anomalies are reported with "Potential anomaly" as a label. Command below will report log files containing at least one anomaly.

$ grep "Potential anomaly" <output_log_dir>/thread_*.log

A sample anomaly report looks like below:

Level:<ONE or TWO> Expression: <AST_for_anomalous_expression>
Source file and line number: <C code with line number having the anomaly>
Potential anomaly
Did you mean ...

The text after "Did you mean" shows possible corrections to the anomalous expression.

Comments
  • I've tried it with ClickHouse and it did not find anything meaningful.

    I've tried it with ClickHouse and it did not find anything meaningful.

    I've tested https://github.com/ClickHouse/ClickHouse (sources without submodules) with full training data. It has found only a few false positives (see below).

    It means that either ClickHouse source code is too good (which I don't believe) or control-flag did not do the job well.

    $ grep -B1 -A10 "Potential anomaly" *.log 
    thread_0.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("+") (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))(number_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/base/glibc-compatibility/musl/atomic.h:212:4:(u.r[0]+1)
    thread_0.log:Expression is Potential anomaly
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("+") (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))(number_literal))) with editing cost:0 and occurrences: 2
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("&") (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))(number_literal))) with editing cost:1 and occurrences: 2559
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression (">") (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))(number_literal))) with editing cost:1 and occurrences: 1340
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("<") (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))(number_literal))) with editing cost:1 and occurrences: 1088
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("==") (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))(number_literal))) with editing cost:2 and occurrences: 6708
    thread_0.log-
    thread_0.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("+") (non_terminal_expression) (number_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/base/glibc-compatibility/musl/atomic.h:212:4:(u.r[0]+1)
    thread_0.log:Expression is Potential anomaly
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("+") (non_terminal_expression) (number_literal))) with editing cost:0 and occurrences: 24
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("<") (non_terminal_expression) (number_literal))) with editing cost:1 and occurrences: 466215
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression (">") (non_terminal_expression) (number_literal))) with editing cost:1 and occurrences: 229340
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("&") (non_terminal_expression) (number_literal))) with editing cost:1 and occurrences: 60040
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("%") (non_terminal_expression) (number_literal))) with editing cost:1 and occurrences: 2847
    thread_0.log-
    thread_0.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("+") (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))(number_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/base/glibc-compatibility/musl/atomic.h:213:4:(u.r[1]+1)
    thread_0.log:Expression is Potential anomaly
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("+") (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))(number_literal))) with editing cost:0 and occurrences: 2
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("&") (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))(number_literal))) with editing cost:1 and occurrences: 2559
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression (">") (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))(number_literal))) with editing cost:1 and occurrences: 1340
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("<") (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))(number_literal))) with editing cost:1 and occurrences: 1088
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("==") (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))(number_literal))) with editing cost:2 and occurrences: 6708
    thread_0.log-
    thread_0.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("+") (non_terminal_expression) (number_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/base/glibc-compatibility/musl/atomic.h:213:4:(u.r[1]+1)
    thread_0.log:Expression is Potential anomaly
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("+") (non_terminal_expression) (number_literal))) with editing cost:0 and occurrences: 24
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("<") (non_terminal_expression) (number_literal))) with editing cost:1 and occurrences: 466215
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression (">") (non_terminal_expression) (number_literal))) with editing cost:1 and occurrences: 229340
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("&") (non_terminal_expression) (number_literal))) with editing cost:1 and occurrences: 60040
    thread_0.log-Did you mean:(parenthesized_expression (binary_expression ("%") (non_terminal_expression) (number_literal))) with editing cost:1 and occurrences: 2847
    thread_0.log-
    thread_0.log-Level:ONE Expression:(parenthesized_expression (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/base/glibc-compatibility/musl/atomic.h:222:4:(u.r[0])
    thread_0.log-Expression is Okay
    thread_0.log-Level:TWO Expression:(parenthesized_expression (subscript_expression (non_terminal_expression) (number_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/base/glibc-compatibility/musl/atomic.h:222:4:(u.r[0])
    thread_0.log-Expression is Okay
    --
    thread_1.log-Level:ONE Expression:(parenthesized_expression (field_expression (call_expression (field_expression (identifier)(field_identifier))(argument_list))(field_identifier))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Storages/MergeTree/ReplicatedMergeTreeSink.h:49:11:(context->getSettingsRef().deduplicate_blocks_in_dependent_materialized_views)
    thread_1.log:Expression is Potential anomaly
    thread_1.log-Did you mean:(parenthesized_expression (field_expression (call_expression (field_expression (identifier)(field_identifier))(argument_list))(field_identifier))) with editing cost:0 and occurrences: 3
    thread_1.log-Did you mean:(parenthesized_expression (field_expression (field_expression (field_expression (identifier)(field_identifier))(field_identifier))(field_identifier))) with editing cost:2 and occurrences: 41060
    thread_1.log-Did you mean:(parenthesized_expression (field_expression (subscript_expression (field_expression (identifier)(field_identifier))(identifier))(field_identifier))) with editing cost:2 and occurrences: 20258
    thread_1.log-Did you mean:(parenthesized_expression (field_expression (subscript_expression (field_expression (identifier)(field_identifier))(number_literal))(field_identifier))) with editing cost:2 and occurrences: 3340
    thread_1.log-
    thread_1.log-Level:TWO Expression:(parenthesized_expression (field_expression argument: (identifier) field: (field_identifier))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Storages/MergeTree/ReplicatedMergeTreeSink.h:49:11:(context->getSettingsRef().deduplicate_blocks_in_dependent_materialized_views)
    thread_1.log-Expression is Okay
    thread_1.log-[TID=139711595431680] Scanning File: /home/milovidov/work/ClickHouse_clean/src/Storages/IStorage_fwd.h
    thread_1.log-[TID=139711595431680] Scanning File: /home/milovidov/work/ClickHouse_clean/src/Storages/SelectQueryDescription.h
    thread_1.log-[TID=139711595431680] Scanning File: /home/milovidov/work/ClickHouse_clean/src/Storages/RabbitMQ/StorageRabbitMQ.h
    --
    thread_2.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("||") (call_expression (identifier)(argument_list (char_literal)(identifier)))(call_expression (identifier)(argument_list (char_literal)(identifier))))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/IO/readFloatText.h:384:7:(checkChar('e', in) || checkChar('E', in))
    thread_2.log:Expression is Potential anomaly
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("||") (call_expression (identifier)(argument_list (char_literal)(identifier)))(call_expression (identifier)(argument_list (char_literal)(identifier))))) with editing cost:0 and occurrences: 2
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("||") (call_expression (identifier)(argument_list (string_literal)(identifier)))(call_expression (identifier)(argument_list (string_literal)(identifier))))) with editing cost:2 and occurrences: 152
    thread_2.log-
    thread_2.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("||") (non_terminal_expression) (non_terminal_expression))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/IO/readFloatText.h:384:7:(checkChar('e', in) || checkChar('E', in))
    thread_2.log-Expression is Okay
    thread_2.log-Level:ONE Expression:(parenthesized_expression (call_expression (field_expression (identifier)(field_identifier))(argument_list))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/IO/readFloatText.h:386:11:(in.eof())
    thread_2.log-Expression is Okay
    thread_2.log-Level:TWO Expression:(parenthesized_expression (call_expression)) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/IO/readFloatText.h:386:11:(in.eof())
    thread_2.log-Expression is Okay
    thread_2.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("==") (pointer_expression (call_expression (field_expression (identifier)(field_identifier))(argument_list)))(char_literal))) not found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/IO/readFloatText.h:395:11:(*in.position() == '-')
    --
    thread_2.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(string_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Storages/MergeTree/RPNBuilder.h:96:11:(func->name == "not")
    thread_2.log:Expression is Potential anomaly
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(string_literal))) with editing cost:0 and occurrences: 5
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(null))) with editing cost:1 and occurrences: 148401
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(true))) with editing cost:1 and occurrences: 4627
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(char_literal))) with editing cost:1 and occurrences: 4157
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(identifier))) with editing cost:2 and occurrences: 576742
    thread_2.log-
    thread_2.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("==") (non_terminal_expression) (string_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Storages/MergeTree/RPNBuilder.h:96:11:(func->name == "not")
    thread_2.log-Expression is Okay
    thread_2.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("!=") (call_expression (field_expression (identifier)(field_identifier))(argument_list))(number_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Storages/MergeTree/RPNBuilder.h:98:15:(args.size() != 1)
    thread_2.log-Expression is Okay
    --
    thread_2.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(string_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Storages/MergeTree/RPNBuilder.h:107:20:(func->name == "or")
    thread_2.log:Expression is Potential anomaly
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(string_literal))) with editing cost:0 and occurrences: 5
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(null))) with editing cost:1 and occurrences: 148401
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(true))) with editing cost:1 and occurrences: 4627
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(char_literal))) with editing cost:1 and occurrences: 4157
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(identifier))) with editing cost:2 and occurrences: 576742
    thread_2.log-
    thread_2.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("==") (non_terminal_expression) (string_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Storages/MergeTree/RPNBuilder.h:107:20:(func->name == "or")
    thread_2.log-Expression is Okay
    thread_2.log-[TID=139711587038976] Scanning File: /home/milovidov/work/ClickHouse_clean/src/Storages/MergeTree/MergeTreePartInfo.h
    thread_2.log-[TID=139711587038976] Scanning File: /home/milovidov/work/ClickHouse_clean/src/Storages/MergeTree/MergeTreeDataPartUUID.h
    --
    thread_2.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(string_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Interpreters/GlobalSubqueriesVisitor.h:208:19:(func.name == "globalIn")
    thread_2.log:Expression is Potential anomaly
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(string_literal))) with editing cost:0 and occurrences: 5
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(null))) with editing cost:1 and occurrences: 148401
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(true))) with editing cost:1 and occurrences: 4627
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(char_literal))) with editing cost:1 and occurrences: 4157
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(identifier))) with editing cost:2 and occurrences: 576742
    thread_2.log-
    thread_2.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("==") (non_terminal_expression) (string_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Interpreters/GlobalSubqueriesVisitor.h:208:19:(func.name == "globalIn")
    thread_2.log-Expression is Okay
    thread_2.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(string_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Interpreters/GlobalSubqueriesVisitor.h:210:24:(func.name == "globalNotIn")
    thread_2.log:Expression is Potential anomaly
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(string_literal))) with editing cost:0 and occurrences: 5
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(null))) with editing cost:1 and occurrences: 148401
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(true))) with editing cost:1 and occurrences: 4627
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(char_literal))) with editing cost:1 and occurrences: 4157
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(identifier))) with editing cost:2 and occurrences: 576742
    thread_2.log-
    thread_2.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("==") (non_terminal_expression) (string_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Interpreters/GlobalSubqueriesVisitor.h:210:24:(func.name == "globalNotIn")
    thread_2.log-Expression is Okay
    thread_2.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(string_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Interpreters/GlobalSubqueriesVisitor.h:212:24:(func.name == "globalNullIn")
    thread_2.log:Expression is Potential anomaly
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(string_literal))) with editing cost:0 and occurrences: 5
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(null))) with editing cost:1 and occurrences: 148401
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(true))) with editing cost:1 and occurrences: 4627
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(char_literal))) with editing cost:1 and occurrences: 4157
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(identifier))) with editing cost:2 and occurrences: 576742
    thread_2.log-
    thread_2.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("==") (non_terminal_expression) (string_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Interpreters/GlobalSubqueriesVisitor.h:212:24:(func.name == "globalNullIn")
    thread_2.log-Expression is Okay
    thread_2.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(string_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Interpreters/GlobalSubqueriesVisitor.h:214:24:(func.name == "globalNotNullIn")
    thread_2.log:Expression is Potential anomaly
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(string_literal))) with editing cost:0 and occurrences: 5
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(null))) with editing cost:1 and occurrences: 148401
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(true))) with editing cost:1 and occurrences: 4627
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(char_literal))) with editing cost:1 and occurrences: 4157
    thread_2.log-Did you mean:(parenthesized_expression (binary_expression ("==") (field_expression (identifier)(field_identifier))(identifier))) with editing cost:2 and occurrences: 576742
    thread_2.log-
    thread_2.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("==") (non_terminal_expression) (string_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Interpreters/GlobalSubqueriesVisitor.h:214:24:(func.name == "globalNotNullIn")
    thread_2.log-Expression is Okay
    thread_2.log-[TID=139711587038976] Scanning File: /home/milovidov/work/ClickHouse_clean/src/Interpreters/interpretSubquery.h
    thread_2.log-[TID=139711587038976] Scanning File: /home/milovidov/work/ClickHouse_clean/src/Interpreters/RedundantFunctionsInOrderByVisitor.h
    --
    thread_3.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("||") (binary_expression ("!=") (identifier)(number_literal))(binary_expression ("==") (identifier)(unary_expression ("-") (identifier))))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/base/glibc-compatibility/musl/powl.c:232:5:(x != 0.0 || y == -INFINITY)
    thread_3.log:Expression is Potential anomaly
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("||") (binary_expression ("!=") (identifier)(number_literal))(binary_expression ("==") (identifier)(unary_expression ("-") (identifier))))) with editing cost:0 and occurrences: 2
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("||") (binary_expression ("==") (identifier)(number_literal))(binary_expression ("==") (identifier)(unary_expression ("-") (identifier))))) with editing cost:1 and occurrences: 280
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("||") (binary_expression (">=") (identifier)(number_literal))(binary_expression ("==") (identifier)(unary_expression ("-") (identifier))))) with editing cost:1 and occurrences: 71
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("||") (binary_expression (">") (identifier)(number_literal))(binary_expression ("==") (identifier)(unary_expression ("-") (identifier))))) with editing cost:2 and occurrences: 106
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("||") (binary_expression ("==") (identifier)(identifier))(binary_expression ("==") (identifier)(unary_expression ("-") (identifier))))) with editing cost:2 and occurrences: 100
    thread_3.log-
    thread_3.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("||") (non_terminal_expression) (non_terminal_expression))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/base/glibc-compatibility/musl/powl.c:232:5:(x != 0.0 || y == -INFINITY)
    thread_3.log-Expression is Okay
    thread_3.log-Level:ONE Expression:(parenthesized_expression (binary_expression (">=") (identifier)(identifier))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/base/glibc-compatibility/musl/powl.c:235:4:(x >= LDBL_MAX)
    thread_3.log-Expression is Okay
    --
    thread_3.log-Level:ONE Expression:(parenthesized_expression (call_expression (field_expression (subscript_expression (field_expression (identifier)(field_identifier))(identifier))(field_identifier))(argument_list))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/AggregateFunctions/AggregateFunctionSumMap.h:348:19:(elem.second[col].isNull())
    thread_3.log:Expression is Potential anomaly
    thread_3.log-Did you mean:(parenthesized_expression (call_expression (field_expression (subscript_expression (field_expression (identifier)(field_identifier))(identifier))(field_identifier))(argument_list))) with editing cost:0 and occurrences: 2
    thread_3.log-Did you mean:(parenthesized_expression (field_expression (field_expression (subscript_expression (field_expression (identifier)(field_identifier))(identifier))(field_identifier))(field_identifier))) with editing cost:2 and occurrences: 1494
    thread_3.log-Did you mean:(parenthesized_expression (call_expression (field_expression (field_expression (field_expression (identifier)(field_identifier))(field_identifier))(field_identifier))(argument_list))) with editing cost:2 and occurrences: 257
    thread_3.log-Did you mean:(parenthesized_expression (subscript_expression (field_expression (subscript_expression (field_expression (identifier)(field_identifier))(identifier))(field_identifier))(identifier))) with editing cost:2 and occurrences: 229
    thread_3.log-Did you mean:(parenthesized_expression (subscript_expression (field_expression (subscript_expression (field_expression (identifier)(field_identifier))(identifier))(field_identifier))(number_literal))) with editing cost:2 and occurrences: 146
    thread_3.log-
    thread_3.log-Level:TWO Expression:(parenthesized_expression (call_expression)) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/AggregateFunctions/AggregateFunctionSumMap.h:348:19:(elem.second[col].isNull())
    thread_3.log-Expression is Okay
    thread_3.log-Level:ONE Expression:(parenthesized_expression (binary_expression (">") (binary_expression ("<") (unary_expression ("!") (field_expression (call_expression (field_expression (identifier)(field_identifier))(argument_list))(field_identifier)))(identifier))(parenthesized_expression (identifier)))) not found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/AggregateFunctions/AggregateFunctionSumMap.h:415:11:(!params_.front().tryGet<Array>(keys_to_keep_))
    thread_3.log-Expression is Okay
    --
    thread_3.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("*") (identifier) (non_terminal_expression))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Columns/ColumnUnique.h:319:7:(auto * nullable = checkAndGetColumn<ColumnNullable>(src))
    thread_3.log:Expression is Potential anomaly
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("*") (identifier) (non_terminal_expression))) with editing cost:0 and occurrences: 193
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression (">") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 130297
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("<") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 130156
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("&") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 96043
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("=") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 12586
    thread_3.log-
    thread_3.log-Level:ONE Expression:(parenthesized_expression (identifier)) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Columns/ColumnUnique.h:342:7:(is_nullable)
    thread_3.log-Expression is Okay
    thread_3.log-Level:TWO Expression:(parenthesized_expression (identifier)) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Columns/ColumnUnique.h:342:7:(is_nullable)
    thread_3.log-Expression is Okay
    --
    thread_3.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("!=") (parenthesized_expression (binary_expression (">") (identifier)(number_literal)))(identifier))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Columns/ColumnUnique.h:436:19:((nan_direction_hint > 0) != reverse)
    thread_3.log:Expression is Potential anomaly
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("!=") (parenthesized_expression (binary_expression (">") (identifier)(number_literal)))(identifier))) with editing cost:0 and occurrences: 1
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("!=") (parenthesized_expression (binary_expression ("&") (identifier)(number_literal)))(identifier))) with editing cost:1 and occurrences: 962
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("!=") (parenthesized_expression (binary_expression (">>") (identifier)(number_literal)))(identifier))) with editing cost:1 and occurrences: 375
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("!=") (parenthesized_expression (binary_expression ("+") (identifier)(number_literal)))(identifier))) with editing cost:1 and occurrences: 198
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("!=") (parenthesized_expression (binary_expression ("^") (identifier)(number_literal)))(identifier))) with editing cost:1 and occurrences: 60
    thread_3.log-
    thread_3.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("!=") (non_terminal_expression) (identifier))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Columns/ColumnUnique.h:436:19:((nan_direction_hint > 0) != reverse)
    thread_3.log-Expression is Okay
    thread_3.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("<=") (binary_expression ("-") (identifier)(identifier))(number_literal))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Columns/ColumnUnique.h:446:19:(last - first <= 1)
    thread_3.log-Expression is Okay
    --
    thread_3.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("*") (identifier) (non_terminal_expression))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Columns/ColumnUnique.h:519:7:(auto * nullable_column = checkAndGetColumn<ColumnNullable>(src))
    thread_3.log:Expression is Potential anomaly
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("*") (identifier) (non_terminal_expression))) with editing cost:0 and occurrences: 193
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression (">") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 130297
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("<") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 130156
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("&") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 96043
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("=") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 12586
    thread_3.log-
    thread_3.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("==") (identifier)(identifier))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Columns/ColumnUnique.h:527:7:(src_column == nullptr)
    thread_3.log-Expression is Okay
    thread_3.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("==") (identifier) (identifier))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/Columns/ColumnUnique.h:527:7:(src_column == nullptr)
    thread_3.log-Expression is Okay
    --
    thread_3.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("*") (identifier) (non_terminal_expression))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/DataTypes/DataTypesDecimal.h:64:7:(auto * decimal_type = checkDecimal<Decimal64>(data_type))
    thread_3.log:Expression is Potential anomaly
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("*") (identifier) (non_terminal_expression))) with editing cost:0 and occurrences: 193
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression (">") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 130297
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("<") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 130156
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("&") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 96043
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("=") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 12586
    thread_3.log-
    thread_3.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("*") (identifier)(binary_expression ("=") (identifier)(binary_expression (">") (binary_expression ("<") (identifier)(identifier))(parenthesized_expression (identifier)))))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/DataTypes/DataTypesDecimal.h:66:7:(auto * decimal_type = checkDecimal<Decimal128>(data_type))
    thread_3.log-Expression is Okay
    thread_3.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("*") (identifier) (non_terminal_expression))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/DataTypes/DataTypesDecimal.h:66:7:(auto * decimal_type = checkDecimal<Decimal128>(data_type))
    thread_3.log:Expression is Potential anomaly
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("*") (identifier) (non_terminal_expression))) with editing cost:0 and occurrences: 193
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression (">") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 130297
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("<") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 130156
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("&") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 96043
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("=") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 12586
    thread_3.log-
    thread_3.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("*") (identifier)(binary_expression ("=") (identifier)(binary_expression (">") (binary_expression ("<") (identifier)(identifier))(parenthesized_expression (identifier)))))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/DataTypes/DataTypesDecimal.h:68:7:(auto * decimal_type = checkDecimal<Decimal256>(data_type))
    thread_3.log-Expression is Okay
    thread_3.log-Level:TWO Expression:(parenthesized_expression (binary_expression ("*") (identifier) (non_terminal_expression))) found in training dataset: Source file: /home/milovidov/work/ClickHouse_clean/src/DataTypes/DataTypesDecimal.h:68:7:(auto * decimal_type = checkDecimal<Decimal256>(data_type))
    thread_3.log:Expression is Potential anomaly
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression ("*") (identifier) (non_terminal_expression))) with editing cost:0 and occurrences: 193
    thread_3.log-Did you mean:(parenthesized_expression (binary_expression (">") (identifier) (non_terminal_expression))) with editing cost:1 and occurrences: 130297
    
    question 
    opened by alexey-milovidov 11
  • [BUG] Is it training instead of evaluating? Is it working?

    [BUG] Is it training instead of evaluating? Is it working?

    Describe the bug I launch ../scripts/scan_for_anomalies.sh as described in the README, but see "Training: start." Per the README, I was not expecting this launch to have anything to do with training - just scan the code I provided for anomalies.

    Exact command to reproduce I used this command:

    cd control-flag
    cmake .
    make -j  # successful build
    make test  # successful tests
    gdown https://drive.google.com/uc?id=1-jzs3zrKU541hwChaciXSk8zrnMN1mYc
    tar xpvf c_lang_if_stmts_6000_gitr*
    cd quick_start
    git clone .... src_ttsiod
    mkdir test1_scan_output/
    ../scripts/scan_for_anomalies.sh -d src_ttsiod/ -t ../c_lang_if_stmts_6000_gitrepos.ts -o test1_scan_output/
    

    Callstack (if it is a crash bug) or error info No crash. The program cf_file_scanner is running, consuming a single core - but the only thing it printed so far was "Training: start".

    Expected behavior I was under the impression that this command should not lead to training - I have downloaded (and decompressed) the 6000 Github packages db, so I expected this to scan, not train.

    Environment (please complete the following information):

    • OS: Debian Buster
    • Compiler: GCC v8.3.0
    • 32-bit or 64-bit? 64bit
    • Build command: (build completed just fine via cmake).

    ControlFlag commit I am at commit 42e068dbe3dd45ef621913a7a573eae6626184d2

    Additional context I work for the European Space Agency, reviewing mission codebases - and am interested to see what control-flag can offer to our array of Static Analysers. To be clear, I am so far testing with hand-made simple bug reproductions, not mission code.

    question 
    opened by ttsiodras 5
  • Spelling

    Spelling

    This PR corrects misspellings identified by the check-spelling action.

    The misspellings have been reported at https://github.com/jsoref/control-flag/commit/b1907620029908ea31e0df30bc3b42a035291d15#commitcomment-67928069

    The action reports that the changes in this PR would make it happy: https://github.com/jsoref/control-flag/commit/6a55d47cb21eb8ff3e522b74593448b0a4a0db3a

    Note: this PR does not include the action. If you're interested in running a spell check on every PR and push, that can be offered separately.

    opened by jsoref 1
  • [FEATURE] Give Warning About Possible Antivirus Activity Upon Downloading Top 100 Repos

    [FEATURE] Give Warning About Possible Antivirus Activity Upon Downloading Top 100 Repos

    This project may want to give a warning that the top 100 repos may trigger the antivirus that is local to the computer for malware/pentest-type repositories. For example, upon downloading this, I had this many quarantined items in my antivirus: image

    opened by Danc2050 1
  • Full support for C++ programs

    Full support for C++ programs

    This commit contains integration of TreeSitter's C++ parser into ControlFlag along with support to mine patterns from C++ programs and support to detect anomalous C++ patterns.

    It contains links to datasets containing conditional expressions that are found in GitHub repositories using C++ as their primary language.

    It also contains C++-specific unit tests and a quick start test.

    opened by nhasabni 0
  • Handle git clone of non-existent repos

    Handle git clone of non-existent repos

    This PR fixes issue #32.

    git clone command when given a non-existent repo asks for user credentials (not sure why). This creates problems for automated download/clone of repositories for ControlFlag. By adding "-c core.askPass=echo" option to git clone, we can bypass these prompts, and instead we will get a verbose error like below:

    $ python3 download_repos.py -f failed.c100.txt -o training_repo_dir -m clone -p 1
    Number of repos: 100
     19%|███████████████ | 19/100 [01:08<02:23,  1.77s/it]
    remote: Support for password authentication was
    removed on August 13, 2021. Please use a personal access token instead.
    remote: Please see
    https://github.blog/2020-12-15-token-authentication-requirements-for-git-operations/
    for more information.
    fatal: Authentication failed for 'https://github.com/craSH/socat/'
    
    bug 
    opened by nhasabni 0
  • Update README.md

    Update README.md

    Minor tweak in the language explaining the new smaller models in possibly reducing accuracy and increasing the number of false positives CF may generate.

    opened by jgottschlich 0
  • Docker image with control-flag already built

    Docker image with control-flag already built

    i suggest a virtualization with control-flag already built in.

    I pushed on docker hub a Ubuntu image with g++, cmake, and control-flag already built in

    https://hub.docker.com/r/erre577/control-flag

    opened by VivianoRiccardo 0
  • [FEATURE]Can I support Golang?

    [FEATURE]Can I support Golang?

    Is your feature request related to a problem? Please describe.

    It does not support Golang

    Describe the feature you'd like

    If it supports Golang, that would be great

    Additional context

    I'm not a $0 shopper. I want to know what I should do.

    I'm a little Golang programmer. I don't know C++, but I know a little PHP.

    It seems that PHP has fallen behind now, so I turned to Golang.

    Golang is a language without historical baggage and with standard syntax, compared with Python.

    If this tool can be used on Golang, all Golang developers will be excited, just like golangci-lint.

    I think the author or programmer of Golang should provide some help for this project.

    We are weak.

    Since I don't know C++, I want to know what help I can provide and what I need to do, such as writing an API that returns a syntax tree?

    If adapting to other languages is an easy process, it would be better to have a language adaptation tutorial.

    opened by Deng-Xian-Sheng 0
  • Add Solidity to programming languages

    Add Solidity to programming languages

    Tested locally:

    $ scripts/scan_for_anomalies.sh -t train_sol -d example_repo  -o results_sol -l 5 
    $ ll train_sol 
    -rw-r--r--  1 admin  staff  1421383 Oct  6 09:38 train_sol
    

    train_sol:

    0,AST_expression_TWO:(condition_clause (binary_expression ("!=") (non_terminal_expression) (non_terminal_expression)))
    //(allowance[from][msg.sender] != uint(-1))
    0,AST_expression_ONE:(condition_clause (binary_expression ("!=") (subscript_expression (subscript_expression (identifier)(identifier))(field_expression (identifier)(field_identifier)))(call_expression (identifier)(argument_list (number_literal)))))
    0,AST_expression_TWO:(condition_clause (binary_expression ("!=") (non_terminal_expression) (non_terminal_expression)))
    //(blockTimestampLast != blockTimestamp)
    0,AST_expression_ONE:(condition_clause (binary_expression ("!=") (identifier)(identifier)))
    0,AST_expression_TWO:(condition_clause (binary_expression ("!=") (identifier) (identifier)))
    //(leftSide < rightSide)
    0,AST_expression_ONE:(condition_clause (binary_expression ("<") (identifier)(identifier)))
    0,AST_expression_TWO:(condition_clause (binary_expression ("<") (identifier) (identifier)))
    //(amountIn == 0)
    0,AST_expression_ONE:(condition_clause (binary_expression ("==") (identifier)(number_literal)))
    0,AST_expression_TWO:(condition_clause (binary_expression ("==") (identifier) (number_literal)))
    .....
    
    opened by issanyo 0
  • [BUG] cf_file_scanner segfaults while scanning files

    [BUG] cf_file_scanner segfaults while scanning files

    Describe the bug cf_file_scanner segfaults while scanning files.

    Exact command to reproduce scripts/scan_for_anomalies.sh -d ~/projects/riscv-isa-sim -t c_lang_if_stmts_6000_gitrepos_medium.ts -o spike -l 1

    Unfortunately I could not reproduce it again with exactly the same command and sources.

    Callstack (if it is a crash bug) or error info

    (gdb) bt
    #0  0x000055ae0728d086 in Trie::CalculateEditDistance(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const ()
    #1  0x000055ae0728caa0 in Trie::SearchNearestExpressionsUsingTrieTraversal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) const::{lambda()#1}::operator()() const ()
    #2  0x000055ae0728d620 in void std::__invoke_impl<void, Trie::SearchNearestExpressionsUsingTrieTraversal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) const::{lambda()#1}>(std::__invoke_other, Trie::SearchNearestExpressionsUsingTrieTraversal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) const::{lambda()#1}&&) ()
    #3  0x000055ae0728d5d5 in std::__invoke_result<Trie::SearchNearestExpressionsUsingTrieTraversal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) const::{lambda()#1}>::type std::__invoke<Trie::SearchNearestExpressionsUsingTrieTraversal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) const::{lambda()#1}>(Trie::SearchNearestExpressionsUsingTrieTraversal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) const::{lambda()#1}&&) ()
    #4  0x000055ae0728d582 in void std::thread::_Invoker<std::tuple<Trie::SearchNearestExpressionsUsingTrieTraversal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) const::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) ()
    #5  0x000055ae0728d556 in std::thread::_Invoker<std::tuple<Trie::SearchNearestExpressionsUsingTrieTraversal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) const::{lambda()#1}> >::operator()() ()
    #6  0x000055ae0728d53a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<Trie::SearchNearestExpressionsUsingTrieTraversal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long, unsigned long) const::{lambda()#1}> > >::_M_run() ()
    #7  0x00007ff04e4d62f3 in std::execute_native_thread_routine (__p=0x7ff04807b6c0) at /usr/src/debug/gcc/libstdc++-v3/src/c++11/thread.cc:82
    #8  0x00007ff04e1bc78d in start_thread (arg=<optimized out>) at pthread_create.c:442
    #9  0x00007ff04e23d8e4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
    

    Environment (please complete the following information):

    • OS: Arch Linux x86_64
    • Compiler: GCC 12.1.1
    • 32-bit or 64-bit? 64-bit
    • Build command: cmake . && make -j

    ControlFlag commit 91c02b20926d970b8049838f5e8bb1df4815f7be

    opened by felixonmars 0
  • [BUG] Line numbers of potential anomaly are incorrect

    [BUG] Line numbers of potential anomaly are incorrect

    Line numbers that are printed out in a log file, for potential anomalies, are incorrect.

    For example output from log:

    /var/jenkins_slave/workspace/Preboot/control-flag/thread_1.log-[TID=140515822532352] Scanning File: /var/jenkins_slave/workspace/Preboot/control-flag/sandbox/IntelUndiPkg/IceUndiFullDxe/Ice.c
    /var/jenkins_slave/workspace/Preboot/control-flag/thread_1.log-Level:ONE Expression:(parenthesized_expression (binary_expression ("=") (identifier)(identifier))) found in training dataset: Source file: /var/jenkins_slave/workspace/Preboot/control-flag/sandbox/IntelUndiPkg/IceUndiFullDxe/Ice.c:84:5:(Status = EFI_NOT_FOUND)
    /var/jenkins_slave/workspace/Preboot/control-flag/thread_1.log:Expression is Potential anomaly
    

    From log here line number is 84. In fact it should be 85.

    See: https://github.com/intel-sandbox/epg.nd_preboot.uefi_undi_cpk/pull/47/files

    opened by jilgiewi 0
  • [BUG] Anomaly report: first

    [BUG] Anomaly report: first "Did you mean" is same as found expression

    Compiled control-flag from commit e7e0e448f40b631636e50b18b94c15d6f35e1283 I ran scripts/scan_for_anomalies.sh on some PHP code with your dataset. This the the output:

    Level:TWO Expression:(parenthesized_expression (binary_expression left: (unary_op_expression (variable_name (name))) right: (unary_op_expression (scoped_call_expression scope: (relative_scope) name: (name) arguments: (arguments (argument (variable_name (name)))))))) not found in training dataset: Source file: <file>:<line>:<col>:(!$id || !self::is($id))
    Expression is Potential anomaly
    Did you mean:(parenthesized_expression (binary_expression left: (unary_op_expression (variable_name (name))) right: (unary_op_expression (scoped_call_expression scope: (relative_scope) name: (name) arguments: (arguments (argument (variable_name (name)))))))) with editing cost:0 and occurrences: 0
    Did you mean:(parenthesized_expression (binary_expression left: (unary_op_expression (variable_name (name))) right: (unary_op_expression (scoped_call_expression scope: (name) name: (name) arguments: (arguments (argument (variable_name (name)))))))) with editing cost:2 and occurrences: 7
    

    The Level:TWO Expression and the first Did you mean are exactly the same.

    opened by rx80 0
  • [FEATURE] Create `requirements.txt` File

    [FEATURE] Create `requirements.txt` File

    Is your feature request related to a problem? Please describe. There is no requirements.txt file, so if a user is working on the CMDLine and does not have the proper libraries (i.e., not part of the stdlib) they will not have to install it one by one, but can use: pip3 install -r requirements.txt

    Describe the feature you'd like Add a requirements.txt file. For example, the file will look like:

    wget
    tqdm
    ...
    
    opened by Danc2050 0
Releases(v1.2)
  • v1.2(May 10, 2022)

    This release contains full support for learning typical patterns (training) and detecting anomalous patterns (inference) within if statements of C, C++, and PHP programs.

    It provides support for:

    Downloading GitHub repositories for C, C++, and PHP languages, mining conditional expressions, and training ControlFlag using them Datasets containing pre-mined conditional expressions from GitHub repositories Support for detecting anomalous conditional expressions in a target repository

    Source code(tar.gz)
    Source code(zip)
  • v1.1(Apr 12, 2022)

    This release contains full support for learning typical patterns (training) and detecting anomalous patterns (inference) within if statements of C and PHP programs.

    It provides support for:

    • Downloading GitHub repositories for C and PHP languages, mining conditional expressions, and training ControlFlag using them
    • Datasets containing pre-mined conditional expressions from GitHub repositories
    • Support for detecting anomalous conditional expressions in a target repository

    Additionally, this release fixes an error in handling inputs (#42, #45) and spelling errors (#43).

    Special thanks to @kotauchisunsun (for PHP support) and @jsoref (for error handling) for their contributions to this release!

    Source code(tar.gz)
    Source code(zip)
  • v1.0(Nov 18, 2021)

    This release contains full support for learning typical patterns (training) and detecting anomalous patterns (inference) within if statements of C programs.

    It provides support for:

    • Downloading GitHub repositories for C language, mining conditional expressions, and training ControlFlag using them
    • Datasets containing pre-mined conditional expressions from GitHub repositories
    • Support for detecting anomalous conditional expressions in a target repository
    Source code(tar.gz)
    Source code(zip)
Owner
Intel Labs
Intel Labs
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

This is the Vowpal Wabbit fast online learning code. Why Vowpal Wabbit? Vowpal Wabbit is a machine learning system which pushes the frontier of machin

Vowpal Wabbit 8.1k Dec 30, 2022
Nvvl - A library that uses hardware acceleration to load sequences of video frames to facilitate machine learning training

NVVL is part of DALI! DALI (Nvidia Data Loading Library) incorporates NVVL functionality and offers much more than that, so it is recommended to switc

NVIDIA Corporation 660 Dec 19, 2022
This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicity.

Fast Face Classification (F²C) This is the code of our paper An Efficient Training Approach for Very Large Scale Face Recognition or F²C for simplicit

null 33 Jun 27, 2021
OpenEmbedding is an open source framework for Tensorflow distributed training acceleration.

OpenEmbedding English version | 中文版 About OpenEmbedding is an open-source framework for TensorFlow distributed training acceleration. Nowadays, many m

4Paradigm 19 Jul 25, 2022
Training and fine-tuning YOLOv4 Tiny on custom object detection dataset for Taiwanese traffic

Object Detection on Taiwanese Traffic using YOLOv4 Tiny Exploration of YOLOv4 Tiny on custom Taiwanese traffic dataset Trained and tested AlexeyAB's D

Andrew Chen 5 Dec 14, 2022
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

ONNX Runtime is a cross-platform inference and training machine-learning accelerator compatible with deep learning frameworks, PyTorch and TensorFlow/Keras, as well as classical machine learning libraries such as scikit-learn, and more.

Microsoft 8k Jan 2, 2023
ResNet Implementation, Training, and Inference Using LibTorch C++ API

LibTorch C++ ResNet CIFAR Example Introduction ResNet implementation, training, and inference using LibTorch C++ API. Because there is no native imple

Lei Mao 23 Oct 29, 2022
Training and Evaluating Facial Classification Keras Models using the Tensorflow C API Implemented into a C++ Codebase.

CFace Training and Evaluating Facial Classification Keras Models using the Tensorflow C API Implemented into a C++ Codebase. Dependancies Tensorflow 2

null 7 Oct 18, 2022
Dorylus: Affordable, Scalable, and Accurate GNN Training

Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads This is Dorylus, a Scalable, Resource-eff

UCLASystem 59 Dec 26, 2022
Reactive Light Training Module used in fitness for developing agility and reaction speed.

Hello to you , Thanks for taking interest in this project. Use case of this project is to help people that want to improve their agility and reactio

null 6 Dec 14, 2022
Efficient training of deep recommenders on cloud.

HybridBackend Introduction HybridBackend is a training framework for deep recommenders which bridges the gap between evolving cloud infrastructure and

Alibaba 114 Jan 5, 2023
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large scales

Fairring (FAIR + Herring): a faster all-reduce TL;DR: Using a variation on Amazon’s "Herring" technique, which leverages reduction servers, we can per

Meta Research 46 Nov 24, 2022
A library for distributed ML training with PyTorch

moolib moolib - a communications library for distributed ML training moolib offers general purpose RPC with automatic transport selection (shared memo

Meta Research 345 Dec 29, 2022
Weekly competitive programming training for newbies (Codeforces problem set)

Codeforces Basic Problem Set Weekly competitive programming training for newbies based on the Codeforces problem set. Note that, this training problem

Nguyen Hoang Hai 4 Apr 22, 2022
HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-Through Rates (CTRs)

Merlin: HugeCTR HugeCTR is a GPU-accelerated recommender framework designed to distribute training across multiple GPUs and nodes and estimate Click-T

null 764 Jan 2, 2023
An open-source, low-code machine learning library in Python

An open-source, low-code machine learning library in Python ?? Version 2.3.6 out now! Check out the release notes here. Official • Docs • Install • Tu

PyCaret 6.7k Dec 29, 2022
Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Amazon Archives 4.4k Dec 30, 2022
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.

iNeural A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms. What is a Neural Network? Work on

Fatih Küçükkarakurt 5 Apr 5, 2022