A Binary Genetic Traits Lexer

Overview

BinLex a Genetic Binary Trait Lexer Library and Utility

The purpose of BinLex is to extract basic blocks and functions as traits from binaries.

Most projects attempting this use Python to generate traits, but it's slow. When working with a lot of malware binaries, it is much better to use a faster compiled language like C++.

Installing

sudo apt install -y git libcapstone-dev cmake make
git clone https://github.com/c3rb3ru5d3d53c/binlex.git
cd binlex/
mkdir -p build/
cd build/ && cmake -S ../ -B . && make -j 4
sudo make install
cd ../
binlex -m elf:x86 -i tests/elf/elf.x86

NOTE:

  • ZIP files in the tests/ directory can be extracted using the password infected

Usage

binlex v1.0.0 - A Binary Genetic Traits Lexer
  -i  --input           input file or directory         (required)
  -m  --mode            set mode                        (required)
  -lm --list-modes      list modes
  -h  --help            display help
  -t  --threads         threads
  -o  --output          output file or directory        (optional)
  -v  --version         display version
Author: @c3rb3ru5d3d53c

Currently Supported Modes

  • elf:x86
  • elf:x86_64
  • pe:x86
  • pe:x86_64
  • raw:x86
  • raw:x86_64

NOTE: The raw formats can be used on shellcode

General Usage Information

Binlex is designed to do one thing and one thing only, extract genetic traits from executable code in files. This means it is up to you "the researcher" / "the data scientist" to determine which traits are good and which traits are bad. To accomplish this, you need to use your own fitness function. I encourage you to read about genetic programming to gain a better understanding of this in practice. Perhaps watching this introductory video will help your understanding.

Again, it's up to you to implement your own algorithms for detection based on the genetic traits you extract.

Trait Format

Traits will contain binary code represented in hexadecimal form and will use ?? as wild cards for memory operands or other operands subject to change.

Trait files will contain a list of traits ordered by size and use the sha256 of the sample as the file name.

# Example Trait File
12 34 56 ?? ?? 11 12 13
14 15 16 17 18 ?? ?? 21 22 23
# ... More traits to follow

Tips

  • Don't mix packed and unpacked malware or you will taint your dataset (seen this in academics all the time)
  • Verify the samples you are collecting into a group using skilled analysts
  • These traits are best used with a hybrid approach (supervised)

Example Fitness Model

Traits will be compared amongst their common malware family, any traits not common to all samples will be discarded.

Once completed, all remaining traits will be compared to traits from a goodware set, any traits that match the goodware set will be discarded.

To further differ the traits from other malware families, the remaining population will be compared to other malware families, any that match will be discarded.

The remaining population of traits will be unique to the malware family tested and not legitimate binaries or other malware families.

This fitness model allows for accurate classification of the tested malware family.

Comments
  • Compile on MSVC

    Compile on MSVC

    Unfortunately the CMake is a bit hacky so the fixes are also a bit hacky. In my opinion ExternalProject_Add should eventually be replaced with find_package (in combination with vcpkg), but LIEF/tlsh (and older versions of capstone) don't have proper packages so this is rather tricky to do for now. Another issue is that there is no support for multi-config generators (like Visual Studio or Ninja) so you might end up with errors trying to build anything but Release.

    I forked tlsh to https://github.com/mrexodia/tlsh with some changes to the CMakeLists.txt to make it work on Windows. The repo seems rather dead so I don't think trying to get this upstream will be successful. Feel free to fork it to your account.

    image

    Closes #126 Closes #38 Closes #128

    enhancement 
    opened by mrexodia 19
  • Problems with function recognition

    Problems with function recognition

    Hello! I wanted to process an OpenSSL library and noticed that the latest version of binlex recognized only a negligible number of functions - 7 meanwhile IDA recognized 1636. I used command binlex -m pe:x86_64 -i <lib_name> | jq -r 'select(.type == ("function"))', am I doing something wrong or is there a bug please?

    bug 
    opened by nofiv 15
  • Fix for building on macOS.

    Fix for building on macOS.

    Also to build, flags needed to be set:

    CFLAGS='-std=c11' CXXFLAGS='-std=c++11' make THREADS=4
    

    Built on M1:

    binlex % file ./build/binlex 
    ./build/binlex: Mach-O 64-bit executable arm64
    
    binlex % ./build/binlex
    binlex v1.1.1 - A Binary Genetic Traits Lexer
      -i  --input		input file		(required)
      -m  --mode		set mode		(optional)
      -lm --list-modes	list modes		(optional)
      -c  --corpus		corpus name		(optional)
      -g  --tag		add a tag		(optional)
               		(can be specified multiple times)
      -t  --threads		number of threads	(optional)
      -to --timeout		execution timeout in s	(optional)
      -h  --help		display help		(optional)
      -o  --output		output file		(optional)
      -p  --pretty		pretty output		(optional)
      -d  --debug		print debug info	(optional)
      -v  --version		display version		(optional)
    
    enhancement 
    opened by ooPo 10
  • Memory leak in ClearTrait()

    Memory leak in ClearTrait()

    The function ClearTrait overwrites trait->bytes_sha256 with NULL. The memory which may have been allocated is not freed resulting in a memory leak. The same is true for trait->trait.

    bug 
    opened by rpkrawczyk 10
  • Add packaging sources to create .deb-files

    Add packaging sources to create .deb-files

    Dear @c3rb3ru5d3d53c,

    this PR adds the necessary packaging sources for building Debian binary packages of the programs/libs

    • libbinlex
    • binlex
    • blyara

    in order to ease installation and package removal.

    As it is described in the updated README.md-file, users can simply run make deb to create those .deb-files, which uses dpkg-buildpackage under the hood.

    Please note, that for blyara, an unnecessary dependency on libcapstone has been removed. Packaging the development-files as libbinlex-dev has not been done yet.

    Thanks already in advance for considering this PR.

    Best regards, jgru

    enhancement 
    opened by jgru 10
  • Segment violations when using the Python binding to the Raw disassembler

    Segment violations when using the Python binding to the Raw disassembler

    Description:

    When using the Python binding and the Raw disassembler then randomly segment violations occur.

    To Reproduce:

    Using the following Python script "blfile.py":

    #! /usr/bin/env python3
    
    import sys
    import pybinlex
    
    def main():
        for fnam in sys.argv[1:]:
            raw = pybinlex.Raw()
            #raw = pybinlex.ELF()
            raw.set_architecture(pybinlex.BINARY_ARCH.BINARY_ARCH_X86, pybinlex.BINARY_MODE.BINARY_MODE_64)
            raw.read_file(fnam)
            disasm = pybinlex.Decompiler(raw)
            disasm.decompile()
            print("# file %s:\n\t%s\n" % (fnam, disasm.get_traits()))
    
    if __name__ == "__main__":
        main()
    

    Then using the command line on binary files segfaults, e.g. python3 blfile.py /bin/a*.

    Expected Behavior:

    It should produce some traits, replacing the Raw object by an ELF object works as intended.

    Affected OS/Version:

    Debian Bullseye

    bug 
    opened by rpkrawczyk 9
  • MongoDB Schema, Shards, Replicas, Configs and Routers & RabbitMQ Cluster & Binlex MongoDB / Messaging Queue Workers and HTTP API

    MongoDB Schema, Shards, Replicas, Configs and Routers & RabbitMQ Cluster & Binlex MongoDB / Messaging Queue Workers and HTTP API

    In order to work with frequency analysis on traits, we would need to track the file hashes associated with given traits.

    To do this we would need the equivalent of a stored procedure in mongodb when documents are posted to keep records of hashes for traits.

    This would make the db a little more complex, but it the pay off would be pretty great, as we would be able to search traits by sample hash and more.

    enhancement 
    opened by c3rb3ru5d3d53c 9
  • Granularity of the output

    Granularity of the output

    Hello, nice project!. I'm wondering whether it wouldn't be useful to make the output more granular - additionally generate traits/bytes for individual blocks/instructions/opcodes/operands. I believe that it would be useful for subsequent processing in some cases, e.g. the first team in the Microsoft Malware Classification Challenge used frequency of opcodes and their N-grams among other things.

    enhancement 
    opened by nofiv 8
  • GitHub Actions (macOS)

    GitHub Actions (macOS)

    This GitHub Actions script is easily extensible to include other operating systems and/or compilers. But it might be a bit difficult to get the tests running on all platforms properly.

    Related: #133

    opened by mrexodia 6
  • Remove Decompiler.Setup Method

    Remove Decompiler.Setup Method

    When we pass a file object, we set the file's BINARY_ARCH and BINARY_MODE, which are abstractions of cs_arch and cs_mode. This will make the API much simpler to use as we only need to pass the file object in the constructor and not call Decompiler.Setup method.

    enhancement 
    opened by c3rb3ru5d3d53c 5
  • TLSH Bytes and Traits

    TLSH Bytes and Traits

    Is your feature request related to a problem? Please describe. No

    Describe the solution you'd like TLSH Bytes and Traits

    Describe alternatives you've considered N/A

    Additional context

    Looks like parsing our traits and bytes based on binary data vs. string of the traits or byte sequences will yield best results.

    #!/usr/bin/env python
    
    import tlsh
    from hexdump import hexdump
    
    str_0 = b'15 d9 e0 d3 e3 23 f8 74 ba 97 6d f0 ?? ?? ?? ?? 04 1b 12 85 8d 31 4e 6f 22 77 2b 3b 5c 16 07 8d 41 55 3e ea eb ec 66 9d 5f 60 3e cd fa a0 40 c3 01 d4 df 11 e6 1c 7d 15 bb db 46 9e e7 a4 c0 5a 8d 45 20 40 8d d7 14 4e 9b 86 7f a3 6e df c5 e4 76 ea 30 4a 78 ff fd 9e 2a a3 06 6f 7d fb 16 80 8e 4a 1a e8 e4 45 49 9c 39 0b 00 d0 31 d0 0e 07 62 f0 9c 0a cd e5 ba f7 72 ed c8 a2 9c 6c 36 ab'
    str_1 = b'15 e9 e0 d3 e3 23 f8 74 ba 97 6d f4 ?? ?? ?? ?? 04 1b 12 85 8d 31 4e 6f 22 77 2b 3b 5c 16 07 8d 9b 55 3e ea eb ec 66 9d 5f 60 3e cd fa a0 40 c3 01 d4 df 11 e6 1c 7d 15 bb db 46 9e e7 a4 c0 5a 8d 45 20 40 8d d7 14 4e 9b 86 7f a3 6e df c5 e4 76 ea 30 4a 78 ff fd 9e 2a a3 06 6f 7d fb 16 80 8e 4a 1a e8 e4 45 49 9c 39 0b 00 d0 31 d0 0e 07 62 f0 9c 0a cd e5 ba f7 72 ed c8 a2 9c 6c 36 ac'
    
    bytes_0 = bytes(bytearray.fromhex(str_0.decode('utf-8').replace(' ', '').replace('?', '')))
    bytes_1 = bytes(bytearray.fromhex(str_1.decode('utf-8').replace(' ', '').replace('?', '')))
    
    str_h0 = tlsh.Tlsh()
    str_h0.update(str_0)
    str_h0.final()
    
    bytes_h0 = tlsh.Tlsh()
    bytes_h0.update(bytes_0)
    bytes_h0.final()
    
    str_h1 = tlsh.Tlsh()
    str_h1.update(str_1)
    str_h1.final()
    
    bytes_h1 = tlsh.Tlsh()
    bytes_h1.update(bytes_1)
    bytes_h1.final()
    
    print('str_h0 : ' + str_0.decode('utf-8'))
    print('str_tlsh_h0: ' + str_h0.hexdigest())
    
    print('str_h1: ' + str_1.decode('utf-8'))
    print('str_tlsh_h1: ' + str_h1.hexdigest())
    
    print('string_score: ' + str(str_h0.diff(str_h1)))
    
    print('bytes_h0: ' + str_0.decode('utf-8'))
    hexdump(bytes_0)
    print('bytes_h0: ' + bytes_h0.hexdigest())
    
    print('bytes_h0: ' + str_1.decode('utf-8'))
    hexdump(bytes_1)
    print('bytes_tlsh_h1: ' + bytes_h1.hexdigest())
    
    print('bytes_score: ' + str(bytes_h0.diff(bytes_h1)))
    
    enhancement 
    opened by c3rb3ru5d3d53c 5
  • [Platform] submit sample in auto mode fail in bldec

    [Platform] submit sample in auto mode fail in bldec

    Description: When submitting a sample in "auto" mode, neither binlex.py nor bldec defines the "mode". bldec then raises an error when parsing the architecture.

    To Reproduce: curl -X POST --insecure -H "X-API-Key: " --upload-file file.exe https://127.0.0.1:8443/binlex/api/v1/samples/goodware/auto {'corpus': 'goodware', 'mode': 'auto', 'object_name': 'sha256'}

    data = {'corpus': 'goodware', 'mode': 'auto', 'object_name': 'sha256'}
    file_type = data['mode'].split(':')[0]      # auto
    architecture = data['mode'].split(':')[1]   # IndexError: list index out of range
    

    Expected Behavior:

    • binlex.py should determine the mode or
    • bldec should handle {'mode': 'auto'}

    Affected OS/Version: OS: Ubuntu 20.04.4 LTS version: 1.1.1 branch Master

    Additional context: master branch (not release)

    bug 
    opened by m0n4 0
  • CMakeLists.txt - Change ExternalProject_Add to findpackage

    CMakeLists.txt - Change ExternalProject_Add to findpackage

    Is your feature request related to a problem? Please describe. As mentioned in #127, we should move to using findpackage instead as ExternalProject_Add introduces issues with Windows builds.

    Describe the solution you'd like The solution would be to add git submodules for LIEF, tlsh, and capstone and use findpackage instead.

    Describe alternatives you've considered N/A

    Additional context This was suggested by @mrexodia :smile:

    enhancement 
    opened by c3rb3ru5d3d53c 4
  • CIL: Strange error when processing specific obfuscated .NET binary.

    CIL: Strange error when processing specific obfuscated .NET binary.

    Description: Strange error when processing specific obfuscated .NET binary.

    To Reproduce: Download pe.cil.2.zip

    Run:

    binlex -m auto -i pe.cil.2
    Try to read 0x4 bytes from 0x153e00 (153e04) which is bigger than the binary's size
    Try to read 0x4 bytes from 0x153e00 (153e04) which is bigger than the binary's size
    Try to read 0x4 bytes from 0x153e00 (153e04) which is bigger than the binary's size
    

    Expected Behavior: Output traits

    Affected OS/Version: Linux/v1.1.1-rc1

    bug 
    opened by c3rb3ru5d3d53c 0
Releases(v1.1.1-rc1)
Owner
c3rb3ru5
μηςεηsοяεδ мαℓωαяε яεsεαяςнεя sταηδιηg gμαяδ ατ τнε gατεs οƒ мαℓωαяε нεℓℓ
c3rb3ru5
Simple C++ Genetic Algorithm library

crsGA: Simple C++ Genetic Algorithm library crsGA is a simple C++ template library for developing genetic algorithms, plus some other utilities (Logge

Rafael Gaitán 6 Apr 24, 2022
Cross-platform STL-styled and STL-compatible library with implementing containers, ranges, iterators, type traits and other tools; actors system; type-safe config interface.

Yato A small repository where I'm gatherting useful snippets and abstractions for C++ development. Yato includes 3 main modules: multidimensional cont

Alexey 10 Dec 18, 2022
C++ Type Traits for Smart Pointer

SmartPointerTypeTrait C++ Type Traits for Smart Pointer is_a_pointer is_smart_pointer template < typename T > struct is_smart_ptr : is_smart_ptr_impl<

null 12 Sep 14, 2022
This is like Inverting Binary Tree, but instead of a Binary Tree it's a File Tree.

Invert File Tree in C++ This is like Inverting Binary Tree, but instead of the Binary Tree it's a File Tree. This is intended as a simple exercise to

Tsoding 12 Nov 23, 2022
Binary Analysis Craft!

BinCraft - Binary Analysis Craft BinCraft is a future binary analysis toolkit. Features: Layered Architecture: composed by multiple libraries that can

PortalLab 236 Nov 6, 2022
Binary Analysis Craft!

BinCraft - Binary Analysis Craft BinCraft is a future binary analysis toolkit. Features: Layered Architecture: composed by multiple libraries that can

PortalLab 62 Aug 25, 2022
WIP runtime binary patcher for Aroma

Example plugin This is just a simple example plugin which can be used as a template. Building For building you need: wups wut libutils for common func

Ash 4 Sep 19, 2021
Binary Search tree

eng Binary tree Task: Create a binary search tree, the information part of which will be a symbol, make direct and symmetric traversals, search for th

Andrey 0 Nov 25, 2021
Generic parse tree, configurable lexer, `lemon` parser generator, wrapped for C++17 and Python 3.

This project augments the Lemon parser generator with a high-level parse tree interface, grammar action DSL, and an integrated, configurable lexer allowing the creation of an entire standalone, object-oriented parser from a single input grammar file. The entire parser is written in native C/C++, and the parser interface is made comfortably available to both C++ and Python3 applications.

Aubrey R. Jones 12 Dec 8, 2022
libsequence: a C++ class library for evolutionary genetic analysis

libsequence2 is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

Kevin R. Thornton 49 Apr 30, 2022
Simple C++ Genetic Algorithm library

crsGA: Simple C++ Genetic Algorithm library crsGA is a simple C++ template library for developing genetic algorithms, plus some other utilities (Logge

Rafael Gaitán 6 Apr 24, 2022
Serial traits library

Abandoned Repo is abandoned in favor of rpnx-core. No further updates will be made. https://github.com/rpnx-net/rpnx-core rpnx::serial_traits A work i

null 1 Nov 18, 2020
Cross-platform STL-styled and STL-compatible library with implementing containers, ranges, iterators, type traits and other tools; actors system; type-safe config interface.

Yato A small repository where I'm gatherting useful snippets and abstractions for C++ development. Yato includes 3 main modules: multidimensional cont

Alexey 10 Dec 18, 2022
C++ Type Traits for Smart Pointer

SmartPointerTypeTrait C++ Type Traits for Smart Pointer is_a_pointer is_smart_pointer template < typename T > struct is_smart_ptr : is_smart_ptr_impl<

null 12 Sep 14, 2022
C++ Type Traits for Smart Pointers that are not included in the standard library, containing inheritance detection and member detection.

Smart Pointer Type Trait ?? A simple, header-only cpp library implementing smart pointer type traits. You can easily compile your code diffrently depe

Woon2 12 Sep 14, 2022
This program uses genetic algorithm to find the best route possible given the conditions.

Genetic Algorithm Table Of Contents Table Of Contents Installation About Terms The Algorithm Default values for the conditions Example result Installa

Tony Trinh 1 Jan 23, 2022
This is like Inverting Binary Tree, but instead of a Binary Tree it's a File Tree.

Invert File Tree in C++ This is like Inverting Binary Tree, but instead of the Binary Tree it's a File Tree. This is intended as a simple exercise to

Tsoding 12 Nov 23, 2022
Your binary serialization library

Bitsery Header only C++ binary serialization library. It is designed around the networking requirements for real-time data delivery, especially for ga

Mindaugas Vinkelis 771 Jan 2, 2023
Fast Binary Encoding is ultra fast and universal serialization solution for C++, C#, Go, Java, JavaScript, Kotlin, Python, Ruby, Swift

Fast Binary Encoding (FBE) Fast Binary Encoding allows to describe any domain models, business objects, complex data structures, client/server request

Ivan Shynkarenka 654 Jan 2, 2023
Simple Binary Encoding (SBE) - High Performance Message Codec

Simple Binary Encoding (SBE) SBE is an OSI layer 6 presentation for encoding and decoding binary application messages for low-latency financial applic

Real Logic 2.8k Dec 28, 2022