A Modern C++ Data Sciences Toolkit

Overview

MeTA: ModErn Text Analysis

Please visit our web page for information and tutorials about MeTA!

Build Status (by branch)

  • master: Build Status Windows Build Status
  • develop: Build Status Windows Build Status

Outline

Intro

MeTA is a modern C++ data sciences toolkit featuring

  • text tokenization, including deep semantic features like parse trees
  • inverted and forward indexes with compression and various caching strategies
  • a collection of ranking functions for searching the indexes
  • topic models
  • classification algorithms
  • graph algorithms
  • language models
  • CRF implementation (POS-tagging, shallow parsing)
  • wrappers for liblinear and libsvm (including libsvm dataset parsers)
  • UTF8 support for analysis on various languages
  • multithreaded algorithms

Documentation

Doxygen documentation can be found here.

Tutorials

We have walkthroughs for a few different parts of MeTA on the MeTA homepage.

Citing

If you used MeTA in your research, we would greatly appreciate a citation for our ACL demo paper:

@InProceedings{meta-toolkit,
  author    = {Massung, Sean and Geigle, Chase and Zhai, Cheng{X}iang},
  title     = {{MeTA: A Unified Toolkit for Text Retrieval and Analysis}},
  booktitle = {Proceedings of ACL-2016 System Demonstrations},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {91--96},
  url       = {http://anthology.aclweb.org/P16-4016}
}

Project setup

Mac OS X Build Guide

Mac OS X 10.6 or higher is required. You may have success with 10.5, but this is not tested.

You will need to have homebrew installed, as well as the Command Line Tools for Xcode (homebrew requires these as well, and it will prompt for them during install, or you can install them with xcode-select --install on recent versions of OS X).

Once you have homebrew installed, run the following commands to get the dependencies for MeTA:

brew update
brew install cmake jemalloc lzlib icu4c

To get started, run the following commands:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
CXX=clang++ cmake ../ -DCMAKE_BUILD_TYPE=Release -DICU_ROOT=/usr/local/opt/icu4c
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Ubuntu Build Guide

The directions here depend greatly on your installed version of Ubuntu. To check what version you are on, run the following command:

cat /etc/issue

Based on what you see, you should proceed with one of the following guides:

If your version is less than 12.04 LTS, your operating system is not supported (even by your vendor!) and you should upgrade to at least 12.04 LTS (or 14.04 LTS, if possible).

Ubuntu 12.04 LTS Build Guide

Building on Ubuntu 12.04 LTS requires more work than its more up-to-date 14.04 sister, but it can be done relatively easily. You will, however, need to install a newer C++ compiler from a ppa, and switch to it in order to build meta. We will also need to install a newer CMake version than is natively available.

Start by running the following commands to get the dependencies that we will need for building MeTA.

# this might take a while
sudo apt-get update
sudo apt-get install python-software-properties

# add the ppa that contains an updated g++
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update

# this will probably take a while
sudo apt-get install g++ g++-4.8 git make wget libjemalloc-dev zlib1g-dev

wget http://www.cmake.org/files/v3.2/cmake-3.2.0-Linux-x86_64.sh
sudo sh cmake-3.2.0-Linux-x86_64.sh --prefix=/usr/local

During CMake installation, you should agree to the license and then say "n" to including the subdirectory. You should be able to run the following commands and see the following output:

g++-4.8 --version

should print

g++-4.8 (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

and

/usr/local/bin/cmake --version

should print

cmake version 3.2.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

Once the dependencies are all installed, you should be ready to build. Run the following commands to get started:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
CXX=g++-4.8 /usr/local/bin/cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Ubuntu 14.04 LTS Build Guide

Ubuntu 14.04 has a recent enough GCC for building MeTA, but we'll need to add a ppa for a more recent version of CMake.

Start by running the following commands to install the dependencies for MeTA.

# this might take a while
sudo apt-get update
sudo apt-get install software-properties-common

# add the ppa for cmake
sudo add-apt-repository ppa:george-edison55/cmake-3.x
sudo apt-get update

# install dependencies
sudo apt-get install g++ cmake libicu-dev git libjemalloc-dev zlib1g-dev

Once the dependencies are all installed, you should double check your versions by running the following commands.

g++ --version

should output

g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

and

cmake --version

should output

cmake version 3.2.2

CMake suite maintained and supported by Kitware (kitware.com/cmake).

Once the dependencies are all installed, you should be ready to build. Run the following commands to get started:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Ubuntu 15.10 Build Guide

Ubuntu's non-LTS desktop offering in 15.10 has enough modern software in its repositories to build MeTA without much trouble. To install the dependencies, run the following commands.

apt update
apt install g++ git cmake make libjemalloc-dev zlib1g-dev

Once the dependencies are all installed, you should be ready to build. Run the following commands to get started:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Arch Linux Build Guide

Arch Linux consistently has the most up to date packages due to its rolling release setup, so it's often the easiest platform to get set up on.

To install the dependencies, run the following commands.

sudo pacman -Sy
sudo pacman -S clang cmake git icu libc++ make jemalloc zlib

Once the dependencies are all installed, you should be ready to build. Run the following commands to get started:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
CXX=clang++ cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Fedora Build Guide

This has been tested with Fedora 22+ (the oldest currently supported Fedora as of the time of writing). You may have success with earlier versions, but this is not tested. (If you're on an older version of Fedora, use yum instead of dnf for the commands given below.)

To get started, install some dependencies:

# These may be already installed
sudo dnf install make git wget gcc-c++ jemalloc-devel cmake zlib-devel

You should be able to run the following commands and see the following output:

g++ --version

should print

g++ (GCC) 5.3.1 20151207 (Red Hat 5.3.1-2)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

and

cmake --version

should print

cmake version 3.3.2

CMake suite maintained and supported by Kitware (kitware.com/cmake).

Once the dependencies are all installed, you should be ready to build. Run the following commands to get started:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system with the following command:

./unit-test --reporter=spec

CentOS Build Guide

MeTA can be built in CentOS 7 and above. CentOS 7 comes with a recent enough compiler (GCC 4.8.5), but too old a version of CMake. We'll thus install the compiler and related libraries from the package manager and install our own more recent cmake ourselves.

# install build dependencies (this will probably take a while)
sudo yum install gcc gcc-c++ git make wget zlib-devel epel-release
sudo yum install jemalloc-devel

wget http://www.cmake.org/files/v3.2/cmake-3.2.0-Linux-x86_64.sh
sudo sh cmake-3.2.0-Linux-x86_64.sh --prefix=/usr/local --exclude-subdir

You should be able to run the following commands and see the following output:

g++ --version

should print

g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

and

/usr/local/bin/cmake --version

should print

cmake version 3.2.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

Once the dependencies are all installed, you should be ready to build. Run the following commands to get started:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
/usr/local/bin/cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

EWS/EngrIT Build Guide

Note: Please don't do this if you are able to get MeTA working in any other possible way, as the EWS filesystem has a habit of being unbearably slow and increasing compile times by several orders of magnitude. For example, comparing the cmake, make, and unit-test steps on my desktop vs. EWS gives the following:

system cmake time make time unit-test time
my desktop 0m7.523s 2m30.715s 0m36.631s
EWS 1m28s 11m28.473s 1m25.326s

If you are on a machine managed by Engineering IT at UIUC, you should follow this guide. These systems have software that is much too old for building MeTA, but EngrIT has been kind enough to package updated versions of research software as modules. The modules provided for GCC and CMake are recent enough to build MeTA, so it is actually mostly straightforward.

To set up your dependencies (you will need to do this every time you log back in to the system), run the following commands:

module load gcc
module load cmake/3.5.0

Once you have done this, double check your versions by running the following commands.

g++ --version

should output

g++ (GCC) 5.3.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

and

cmake --version

should output

cmake version 3.5.0

CMake suite maintained and supported by Kitware (kitware.com/cmake).

If your versions are correct, you should be ready to build. To get started, run the following commands:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta/

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
CXX=`which g++` CC=`which gcc` cmake ../ -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Windows Build Guide

MeTA can be built on Windows using the MinGW-w64 toolchain with gcc. We strongly recommend using MSYS2 as this makes fetching the compiler and related libraries significantly easier than it would be otherwise, and it tends to have very up-to-date packages relative to other similar MinGW distributions.

Note: If you find yourself confused or lost by the instructions below, please refer to our visual setup guide for Windows which includes screenshots for every step, including updating MSYS2 and the MinGW-w64 toolchain.

To start, download the installer for MSYS2 from the linked website and follow the instructions on that page. Once you've got it installed, you should use the MinGW shell to start a new terminal, in which you should run the following commands to download dependencies and related software needed for building:

pacman -Syu git make patch mingw-w64-x86_64-{gcc,cmake,icu,jemalloc,zlib} --force

(the --force is needed to work around a bug with the latest MSYS2 installer as of the time of writing.)

Then, exit the shell and launch the "MinGW-w64 Win64" shell. You can obtain the toolkit and get started with:

# clone the project
git clone https://github.com/meta-toolkit/meta.git
cd meta

# set up submodules
git submodule update --init --recursive

# set up a build directory
mkdir build
cd build
cp ../config.toml .

# configure and build the project
cmake .. -G "MSYS Makefiles" -DCMAKE_BUILD_TYPE=Release
make

You can now test the system by running the following command:

./unit-test --reporter=spec

If everything passes, congratulations! MeTA seems to be working on your system.

Generic Setup Notes

  • There are rules for clean, tidy, and doc. After you run the cmake command once, you will be able to just run make as usual when you're developing---it'll detect when the CMakeLists.txt file has changed and rebuild Makefiles if it needs to.

  • To compile in debug mode, just replace Release with Debug in the appropriate cmake command for your OS above and rebuild using make after.

  • Don't hesitate to reach out on the forum if you encounter problems getting set up. We routinely build with a wide variety of compilers and operating systems through our continuous integration setups (travis-ci for Linux and OS X and Appveyor for Windows), so we can be fairly certain that things should build on nearly all major platforms.

Comments
  • test not passed on Mac

    test not passed on Mac

    4/15 Test  #4: inverted-index ...................***Failed    1.42 sec
            inverted-index-build-file-corpus:       [ FAIL ] [idx.unique_terms() == 3944ul] => [3938 == 3944] (inverted_index_test.cpp:69)
            inverted-index-read-file-corpus:        [ FAIL ] [idx.unique_terms() == 3944ul] => [3938 == 3944] (inverted_index_test.cpp:69)
            inverted-index-build-line-corpus:       [ FAIL ] [idx.unique_terms() == 3944ul] => [3938 == 3944] (inverted_index_test.cpp:69)
            inverted-index-read-line-corpus:        [ FAIL ] [idx.unique_terms() == 3944ul] => [3938 == 3944] (inverted_index_test.cpp:69)
            inverted-index-build-gz-corpus:         [ FAIL ] [idx.unique_terms() == 3944ul] => [3938 == 3944] (inverted_index_test.cpp:69)
            inverted-index-read-gz-corpus:          [ FAIL ] [idx.unique_terms() == 3944ul] => [3938 == 3944] (inverted_index_test.cpp:69)
    
          Start  5: forward-index
     5/15 Test  #5: forward-index ....................***Failed    1.16 sec
            forward-index-build-file-corpus:        [ FAIL ] [idx.unique_terms() == 3944ul] => [3938 == 3944] (forward_index_test.cpp:49)
            forward-index-read-file-corpus:         [ FAIL ] [idx.unique_terms() == 3944ul] => [3938 == 3944] (forward_index_test.cpp:49)
            forward-index-build-line-corpus:        [ FAIL ] [idx.unique_terms() == 3944ul] => [3938 == 3944] (forward_index_test.cpp:49)
            forward-index-read-line-corpus:         [ FAIL ] [idx.unique_terms() == 3944ul] => [3938 == 3944] (forward_index_test.cpp:49)
    

    My mac version:

    14.3.0 Darwin Kernel Version 14.3.0, root:xnu-2782.20.48~5/RELEASE_X86_64 x86_64
    

    clang version:

    Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)
    Target: x86_64-apple-darwin14.3.0
    Thread model: posix
    
    opened by CanoeFZH 51
  • MacOS configure error

    MacOS configure error

    CMake Error at deps/meta-cmake/kludges/AlignedAlloc.cmake:54 (message): Failed to find a suitable aligned allocation routine Call Stack (most recent call first): deps/meta-cmake/CompilerKludges.cmake:18 (include) CMakeLists.txt:120 (CompilerKludges)

    opened by ohhmm 18
  • Compilation in Windows (MSVC2015)

    Compilation in Windows (MSVC2015)

    This PR gathers commits in order to compile MeTA in MSVC2015. It is a work in progress, so this message may be edited if neccesary.

    It also depends in modifications/PR of other libraries:

    • pull-request on meta-toolkit/porter2_stemmer: https://github.com/meta-toolkit/porter2_stemmer/pull/1
    opened by jgsogo 16
  • Document metadata

    Document metadata

    This branch adds support for generic metadata for documents, meaning we can now store arbitrary uint64_t, int64_t, double, and string typed information on a per-document basis. This makes adding things like boosts for specific documents, PageRank, etc. straightforward, and will support the development of a regression component of the toolkit (since we can now have a "label" that is a double instead of just a class_label). The one piece of document metadata that we are still storing the "old way" are the labels, since we need to do so many different things with those that I think it's OK to special case them.

    This is a fairly large change to the corpus format and configuration, but nothing too terrible.

    Configuring a corpus now looks like:

    prefix = "/path/to/data/folder"
    dataset = "ceeaus"
    corpus = "line.toml"
    

    where "line.toml" is a file at prefix/ceeaus/line.toml with contents like the following:

    type = "line-corpus"
    encoding = "shift_jis"
    metadata = [{name = "grade-level", type = "double"},
                {name = "institution", type = "string"},
                {name = "student-id", type = "uint"},
                {name = "delta", type = "int"}]
    

    The metadata key above describes the format of the additional metadata available for this corpus. This key is optional. If provided, the corpus will look for a file in the corpus directory called metadata.dat, which is expected to be a TAB separated file where each line denotes the metadata for documents in the corpus (from the first to the last).

    The unit tests pass locally---in order to run them on Travis-CI, we will need @smassung to update the ceeaus corpus downloaded in the CMake files to include the additional files line.toml, file.toml, and gz.toml with the appropriate configurations:

    # file: line.toml
    type = "line-corpus"
    encoding = "shift_jis"
    
    # file: file.toml
    type = "file-corpus"
    list = "ceeaus"
    encoding = "shift_jis"
    
    # file: gz.toml
    type = "gz-corpus"
    encoding = "shift_jis"
    

    I've also simplified the document representation in the case that we're reading them out from a file_corpus---that special case doesn't really make much sense to have around anymore, so documents now always float around with their full text in them, or have no content. We can revisit this if it's ever a problem (e.g., we have really stupidly huge documents).

    opened by skystrife 12
  • Precision and Recall Issue

    Precision and Recall Issue

    It seems that the precision and recall values reported by the confusion matrix, specifically the stats() method, are being flipped (I tried running it on several datasets and got flipped values)

    opened by hazimehh 10
  •  Problem with MeTA Installation on Ubuntu 14.04.2 LTS

    Problem with MeTA Installation on Ubuntu 14.04.2 LTS

    cmake ../ -DCMAKE_BUILD_TYPE=ReleaseCMake Error at CMakeLists.txt:19 (find_package): By not providing "FindICU.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "ICU", but CMake did not find one. Could not find a package configuration file provided by "ICU" with any of the following names: ICUConfig.cmake icu-config.cmake

    opened by IbrahimIbrahim 9
  • Demo app

    Demo app

    Create demo app that showcases low-level MeTA functionality like tokenization, POS-tagging, stemming, frequency counting, etc. It should be easily modified by the user to experiment with settings.

    No dataset should be necessary, just use a single text file as input and perform all analysis on that.

    feature-request 
    opened by smassung 9
  • MacOS 10.13.1 Configure Error

    MacOS 10.13.1 Configure Error

    I have an error when executing command

    CXX=clang++ cmake ../ -DCMAKE_BUILD_TYPE=Release -DICU_ROOT=/usr/local/opt/icu4c
    

    The error message is below. I have a mac of 10.13.1. I have tried the develop branch and clean the cache, but it still not working. Could you please help me debug this problem? Thanks.

    -- The C compiler identification is AppleClang 9.0.0.9000038
    -- The CXX compiler identification is AppleClang 9.0.0.9000038
    -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc
    -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc -- works
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Detecting C compile features
    -- Detecting C compile features - done
    -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang++
    -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang++ -- works
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    -- Looking for pthread.h
    -- Looking for pthread.h - found
    -- Looking for pthread_create
    -- Looking for pthread_create - found
    -- Found Threads: TRUE  
    -- Found ZLIB: /usr/lib/libz.dylib (found version "1.2.11") 
    -- Looking for lzma_auto_decoder in /usr/lib/liblzma.dylib
    -- Looking for lzma_auto_decoder in /usr/lib/liblzma.dylib - found
    -- Looking for lzma_easy_encoder in /usr/lib/liblzma.dylib
    -- Looking for lzma_easy_encoder in /usr/lib/liblzma.dylib - found
    -- Looking for lzma_lzma_preset in /usr/lib/liblzma.dylib
    -- Looking for lzma_lzma_preset in /usr/lib/liblzma.dylib - found
    -- Could NOT find LibLZMA (missing: LIBLZMA_INCLUDE_DIR) 
    -- Searching for ICU 58.2
    -- Found ICU: /usr/local/opt/icu4c/lib/libicudata.dylib;/usr/local/opt/icu4c/lib/libicui18n.dylib;/usr/local/opt/icu4c/lib/libicuuc.dylib (Required is at least version "58.2") 
    -- ICU version found is 59.1.0, expected 58.2; attempting to build ICU from scratch...
    -- ICU include dirs: /Users/yufan/Documents/workspace/meta/build/deps/icu-58.2/include
    -- ICU libraries: icui18n;icuuc;icudata
    -- Locating libc++...
    -- Located libc++: /usr/lib/libc++.dylib
    -- Located libc++ include path: /Library/Developer/CommandLineTools/usr/include/c++/v1/
    --     Locating libc++'s abi...
    --     Found libc++abi: /usr/lib/libc++abi.dylib
    -- Found LIBCXX: /usr/lib/libc++.dylib  
    -- Using regular malloc; consider installing jemalloc
    -- Performing Test META_HAS_ALIGNED_ALLOC
    -- Performing Test META_HAS_ALIGNED_ALLOC - Failed
    -- Performing Test META_HAS_POSIX_MEMALIGN
    -- Performing Test META_HAS_POSIX_MEMALIGN - Failed
    CMake Error at deps/meta-cmake/kludges/AlignedAlloc.cmake:54 (message):
      Failed to find a suitable aligned allocation routine
    Call Stack (most recent call first):
      deps/meta-cmake/CompilerKludges.cmake:18 (include)
      CMakeLists.txt:124 (CompilerKludges)
    
    
    -- Configuring incomplete, errors occurred!
    See also "/Users/yufan/Documents/workspace/meta/build/CMakeFiles/CMakeOutput.log".
    See also "/Users/yufan/Documents/workspace/meta/build/CMakeFiles/CMakeError.log".
    
    opened by alphafan 8
  • Direct forward index creation

    Direct forward index creation

    This PR allows us to create forward_index instances directly, rather than only being able to create them from inverted_index or from libsvm formatted data.

    While seemingly minor, this change will allow us to have classification-specific analyzers that can create real-valued features, which was not possible before without using pre-generated libsvm formatted data.

    This change also includes a lot of refactoring around postings_* and the multiway merge sort in a way that should make the code a bit prettier, and also signify intents a little better. For example, inverted_index::postings_data_type now explicitly states that the feature values are uint64_t, where forward_index::postings_data_type now explicitly states that they are doubles.

    This changes the postings format for forward_index slightly (total_counts can now correctly be a double for forward_index, so we can get things like proper instance-scaling!), so you may need to regenerate indexes that you have on disk if you're using develop anywhere.

    enhancement 
    opened by skystrife 8
  • Add a minimal perfect hashing language model implementation

    Add a minimal perfect hashing language model implementation

    This PR contains code that coexists with the existing language model that implements a minimal perfect hashing-based language model using the perfect hashing framework added in a previous merge.

    The LM's query interface is different from the existing language model and more closely models the interface provided by KenLM. It produces (on my limited test data) identical perplexity results, but is not particularly fast.

    A few additional minor things are included:

    • std::vector<T> can now be hashed using the hash_append framework.

    • std::vector<T> can be read/written using io::packed.

    • io::packed has been refactored to make adding packed reading/writing routines for user-defined types easier, and is similar in spirit to hash_append.

      Users can provide packed_read and packed_write functions in the same namespace scope as their type, and io::packed::read and io::packed::write will find them via ADL.

    • All chunk_iterators for util::multiway_merge have been consolidated into a single template class. This reduces quite a bit of code duplication across all of the disk-based merge algorithms we have.

    opened by skystrife 7
  • Feature selection support for binary/multiclass/regression datasets and dataset views

    Feature selection support for binary/multiclass/regression datasets and dataset views

    Updated feature selection so as it can be performed on labeled (i.e, binary or multiclass or regression) datasets and dataset views. Also, added a constructor in multiclass_dataset which takes in a featurizer function as a parameter.

    As shown below, these changes allow us to run feature selection on a portion of the data (such as the training data) and then create a new dataset consisting of only the selected features.

    auto selector = features::make_selector(*config, training_vw);
    
    uint64_t total_features_selected = 20;
    
    classify::multiclass_dataset feature_selected_dset
    (
       training_vw.begin(), training_vw.end(), total_features_selected, 
       [&](const learn::instance & inst)
       {
            learn::feature_vector inst_weights;
            for(const auto& w : inst.weights)
            {
                if(selector->selected(w.first))
                   inst_weights.emplace_back(w.first, w.second);
            }
    
            return inst_weights;
       },
       [&](const learn::instance inst)
       {
           return training_vw.label(inst);
       }
    );
    

    This is in reference to issues #149 and #111.

    opened by siddshuk 7
  • Unable to build on Windows 10 because cmake can't generate MSYS Makefiles.

    Unable to build on Windows 10 because cmake can't generate MSYS Makefiles.

    # configure and build the project
    cmake .. -G "MSYS Makefiles" -DCMAKE_BUILD_TYPE=Release
    make
    

    Did that, and am getting:

    `$ cmake .. -G "MSYS Makefiles" -DCMAKE_BUILD_TYPE=Release CMake Error: Could not create named generator MSYS Makefiles

    Generators

    • Unix Makefiles = Generates standard UNIX makefiles. Ninja = Generates build.ninja files. Ninja Multi-Config = Generates build-.ninja files. CodeBlocks - Ninja = Generates CodeBlocks project files. CodeBlocks - Unix Makefiles = Generates CodeBlocks project files. CodeLite - Ninja = Generates CodeLite project files. CodeLite - Unix Makefiles = Generates CodeLite project files. Eclipse CDT4 - Ninja = Generates Eclipse CDT 4.0 project files. Eclipse CDT4 - Unix Makefiles= Generates Eclipse CDT 4.0 project files. Kate - Ninja = Generates Kate project files. Kate - Unix Makefiles = Generates Kate project files. Sublime Text 2 - Ninja = Generates Sublime Text 2 project files. Sublime Text 2 - Unix Makefiles = Generates Sublime Text 2 project files. `

    Now I tried Unix Makefiles, but that lead to more errors. So how does one fix this issue when building on Windows 10? I have MSYS, make, and cmake installed using pacman -S cmake for example from within the MSYS command terminal.

    opened by enjoysmath 1
  • Add udpated URL for ICU dependency.

    Add udpated URL for ICU dependency.

    Add updated ICU download URL after ICU dependency's tgz file was migrated from Subversion repo to GitHub repo (source: http://site.icu-project.org/repository).

    Purpose: Metapy's Python wheel install process runs the Meta repo's cmake build process. The Meta repo's cmake steps will fail to download the ICU dependency because the old URL for the ICU dependency does not have the Tar file anymore. Without this updated URL that has the ICU dependency Meta needs to build, no-one can make builds of Meta and by extension, the Metapy repo cannot make new builds for Python 3.8 and Python 3.9 which are currently needed (but missing because of this bug).

    opened by ccasey645 0
  • Add fix for failing ICU download url

    Add fix for failing ICU download url

    Add updated ICU download URL after ICU dependency's tgz file was migrated from Subversion repo to GitHub repo (source: http://site.icu-project.org/repository).

    Purpose: Meta's cmake steps will fail to download the ICU dependency because the old URL for the ICU dependency does not have the Tar file anymore. Without this updated URL that has the ICU dependency Meta needs to build, no-one can make builds of Meta.

    opened by ccasey645 1
  • Detecting C compiler ABI info - failed

    Detecting C compiler ABI info - failed

    I strictly followed the instruction of installation page, but it seems to give me an error "Detecting C compiler ABI info - failed" and " Check for working C compiler: /Users/difangu/opt/anaconda3/bin/x86_64-apple-darwin13.4.0-clang - broken". Is there a solution to this? Much appreciated!

    image

    opened by difangu 0
Releases(v3.0.2)
  • v3.0.2(Aug 20, 2017)

    Bug fixes

    • Fix issues using MAKE_NUMERIC_IDENTIFIER instead of MAKE_NUMERIC_IDENTIFIER_UDL on GCC 7.1.1.
    • Work around (what we assume is) a bug on MSYS2 where cmake would link in additional exception handling libraries that would cause a crash during indexing by building the mman-win32 library as shared.
    • Silence fallthrough warnings on Clang from murmur_hash.

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283  beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e  crf.tar.gz
    672b10c398c1a193ba91dc8c0493d729ad3f73d9192ef33100baeb8afd4f5cde  gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0  greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9  greedy-perceptron-tagger.tar.gz
    
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(172.62 MB)
    crf.tar.gz(7.18 MB)
    gigaword-embeddings-50d.tar.gz(312.09 MB)
    greedy-constituency-parser.tar.gz(52.78 MB)
    greedy-perceptron-tagger.tar.gz(6.31 MB)
  • v3.0.1(Mar 13, 2017)

    New features

    • Add an optional xz{i,o}fstream to meta::io if compiled with liblzma available.
    • util::disk_vector<const T> can now be used to specify a read-only view of a disk-backed vector.

    Bug fixes

    • ir_eval::print_stats now takes a num_docs parameter to properly display evaluation metrics at a certain cutoff point, which was always 5 beforehand. This fixes a bug in query-runner where the stats were not being computed according to the cutoff point specified in the configuration.
    • ir_eval::avg_p now correctly stops computing after num_docs. Before, if you specified num_docs as a smaller value than the size of the result list, it would erroneously keep calculating until the end of the result list instead of stopping after num_docs elements.
    • {inverted,forward}_index can now be loaded from read-only filesystems.

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283  beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e  crf.tar.gz
    672b10c398c1a193ba91dc8c0493d729ad3f73d9192ef33100baeb8afd4f5cde  gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0  greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9  greedy-perceptron-tagger.tar.gz
    
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(172.62 MB)
    crf.tar.gz(7.18 MB)
    gigaword-embeddings-50d.tar.gz(312.09 MB)
    greedy-constituency-parser.tar.gz(52.78 MB)
    greedy-perceptron-tagger.tar.gz(6.31 MB)
  • v3.0.0(Feb 13, 2017)

    New features

    • Add an embedding_analyzer that represents documents with their averaged word vectors.

    • Add a parallel::reduction algorithm designed for parallelizing complex accumulation operations (like an E step in an EM algorithm)

    • Parallelize feature counting in feature selector using the new parallel::reduction

    • Add a parallel::for_each_block algorithm to run functions on (relatively) equal sub-ranges of an iterator range in parallel

    • Add a parallel merge sort as parallel::sort

    • Add a util/traits.h header for general useful traits

    • Add a Markov model implementation in sequence::markov_model

    • Add a generic unsupervised HMM implementation. This implementation supports HMMs with discrete observations (what is used most often) and sequence observations (useful for log mining applications). The forward-backward algorithm is implemented using both the scaling method and the log-space method. The scaling method is used by default, but the log-space method is useful for HMMs with sequence observations to avoid underflow issues when the output probabilities themselves are very small.

    • Add the KL-divergence retrieval function using pseudo-relevance feedback with the two-component mixture-model approach of Zhai and Lafferty, called kl_divergence_prf. This ranker internally can use any language_model_ranker subclass like dirichlet_prior or jelinek_mercer to perform the ranking of the feedback set and the result documents with respect to the modified query.

      The EM algorithm used for the two-component mixture model is provided as the index::feedback::unigram_mixture free function and returns the feedback model.

    • Add the Rocchio algorithm (rocchio) for pseudo-relevance feedback in the vector space model.

    • Breaking Change. To facilitate the above to changes, we have also broken the ranker hierarchy into one more level. At the top we have ranker, which has a pure virtual function rank() that can be overridden to provide entirely custom ranking behavior. This is the class the KL-divergence and Rocchio methods derive from, as we need to re-define what it means to rank documents (first retrieving a feedback set, then ranking documents with respect to an updated query).

      Most of the time, however, you will want to derive from the second level ranking_function, which is what was called ranker before. This class provides a definition of rank() to perform document-at-a-time ranking, and expects deriving classes to instead provide initial_score() and score_one() implementations to define the scoring function used for each document. Existing code that derived from ranker prior to this version of MeTA likely needs to be changed to instead derive from ranking_function.

    • Add the util::transform_iterator class and util::make_transform_iterator function for providing iterators that transform their output according to a unary function.

    • Breaking Change. whitespace_tokenizer now emits only word tokens by default, suppressing all whitespace tokens. The old default was to emit tokens containing whitespace in addition to actual word tokens. The old behavior can be obtained by passing false to its constructor, or setting suppress-whitespace = false in its configuration group in config.toml. (Note that whitespace tokens are still needed if using a sentence_boundary filter but, in nearly all circumstances, icu_tokenizer should be preferred.)

    • Breaking Change. Co-occurrence counting for embeddings now uses history that crosses sentence boundaries by default. The old behavior (clearing the history when starting a new sentence) can be obtained by ensuring that a tokenizer is being used that emits sentence boundary tags and by setting break-on-tags = true in the [embeddings] table of config.toml.

    • Breaking Change. All references in the embeddings library to "coocur" are have changed to "cooccur". This means that some files and binaries have been renamed. Much of the co-occurrence counting part of the embeddings library has also been moved to the public API.

    • Co-occurrence counting now is performed in parallel. Behavior of its merge strategy can be configured with the new [embeddings] config parameter merge-fanout = n, which specifies the maximum number of on-disk chunks to allow before kicking off a multi-way merge (default 8).

    Enhancements

    • Add additional packed_write and packed_read overloads: for std::pair, stats::dirichlet, stats::multinomial, util::dense_matrix, and util::sparse_vector
    • Additional functions have been added to ranker_factory to allow construction/loading of language_model_ranker subclasses (useful for the kl_divergence_prf implementation)
    • Add a util::make_fixed_heap helper function to simplify the declaration of util::fixed_heap classes with lambda function comparators.
    • Add regression tests for rankers MAP and NDCG scores. This adds a new dataset cranfield that contains non-binary relevance judgments to facilitate these new tests.
    • Bump bundled version of ICU to 58.2.

    Bug Fixes

    • Fix bug in NDCG calculation (ideal-DCG was computed using the wrong sorting order for non-binary judgments)
    • Fix bug where the final chunks to be merged in index creation were not being deleted when merging completed
    • Fix bug where GloVe training would allocate the embedding matrix before starting the shuffling process, causing it to exceed the "max-ram" config parameter.
    • Fix bug with consuming MeTA from a build directory with cmake when building a static ICU library. meta-utf is now forced to be a shared library, which (1) should save on binary sizes and (2) ensures that the statically build ICU is linked into the libmeta-utf.so library to avoid undefined references to ICU functions.
    • Fix bug with consuming Release-mode MeTA libraries from another project being built in Debug mode. Before, identifiers.h would change behavior based on the NDEBUG macro's setting. This behavior has been removed, and opaque identifiers are always on.

    Deprecation

    • disk_index::doc_name and disk_index::doc_path have been deprecated in favor of the more general (and less confusing) metadata(). They will be removed in a future major release.
    • Support for 32-bit architectures is provided on a best-effort basis. MeTA makes heavy use of memory mapping, which is best paired with a 64-bit address space. Please move to a 64-bit platform for using MeTA if at all possible (most consumer machines should support 64-bit if they were made in the last 5 years or so).

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283  beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e  crf.tar.gz
    672b10c398c1a193ba91dc8c0493d729ad3f73d9192ef33100baeb8afd4f5cde  gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0  greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9  greedy-perceptron-tagger.tar.gz
    

    Please note that the embeddings model has changed. Please re-download.

    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(172.62 MB)
    crf.tar.gz(7.18 MB)
    gigaword-embeddings-50d.tar.gz(312.09 MB)
    greedy-constituency-parser.tar.gz(52.78 MB)
    greedy-perceptron-tagger.tar.gz(6.31 MB)
  • v2.4.2(Sep 23, 2016)

    Bug Fixes

    • Properly shuffle documents when doing an even-split classification test
    • Make forward indexer listen to indexer-num-threads config option.
    • Use correct number of threads when deciding block sizes for parallel_for
    • Add workaround to filesystem::remove_all for Windows systems to avoid spurious failures caused by virus scanners keeping files open after we deleted them
    • Fix invalid memory access in gzstreambuf::underflow

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283  beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e  crf.tar.gz
    2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d  gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0  greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9  greedy-perceptron-tagger.tar.gz
    
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(172.62 MB)
    crf.tar.gz(7.18 MB)
    gigaword-embeddings-50d.tar.gz(312.10 MB)
    greedy-constituency-parser.tar.gz(52.78 MB)
    greedy-perceptron-tagger.tar.gz(6.31 MB)
  • v2.4.1(Sep 8, 2016)

    Bug fixes

    • Eliminate excess warnings on Darwin about double preprocessor definitions
    • Fix issue finding config.h when used as a sub-project via add_subdirectory()

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283  beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e  crf.tar.gz
    2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d  gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0  greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9  greedy-perceptron-tagger.tar.gz
    
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(172.62 MB)
    crf.tar.gz(7.18 MB)
    gigaword-embeddings-50d.tar.gz(312.10 MB)
    greedy-constituency-parser.tar.gz(52.78 MB)
    greedy-perceptron-tagger.tar.gz(6.31 MB)
  • v2.4.0(Sep 7, 2016)

    New features

    • Add a minimal perfect hashing implementation for language_model, and unify the querying interface with the existing language model.

    • Add a CMake install() command to install MeTA as a library (issue #143). For example, once the library is installed, users can do:

      find_package(MeTA 2.4 REQUIRED)
      
      add_executable(my-program src/my_program.cpp)                                
      target_link_libraries(my-program meta-index) # or whatever libs needed from MeTA
      
    • Feature selection functionality added to multiclass_dataset and binary_dataset and views (issues #111, #149 and PR #150 thanks to @siddshuk).

      auto selector = features::make_selector(*config, training_vw);               
      uint64_t total_features_selected = 20;                                       
      selector->select(total_features_selected);                                   
      auto filtered_dset = features::filter_dataset(dset, *selector);              
      
    • Users can now, similar to hash_append, declare standalone functions in the
      same scope as their type called packed_read and packed_write which will be called by io::packed::read and io::packed::write, respectively, via argument-dependent lookup.

    Bug fixes

    • Fix edge-case bug in the succinct data structures
    • Fix off-by-one error in lm::diff

    Enhancements

    • Added functionality to the meta::hashing library: hash_append overload for std::vector, manually-seeded hash function
    • Further isolate ICU in MeTA to allow CMake to install()
    • Updates to EWS (UIUC) build guide
    • Add std::vector operations to io::packed
    • Consolidated all variants of chunk iterators into one template
    • Add MeTA's citation to the README!

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283  beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e  crf.tar.gz
    2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d  gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0  greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9  greedy-perceptron-tagger.tar.gz
    
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(172.62 MB)
    crf.tar.gz(7.18 MB)
    gigaword-embeddings-50d.tar.gz(312.10 MB)
    greedy-constituency-parser.tar.gz(52.78 MB)
    greedy-perceptron-tagger.tar.gz(6.31 MB)
  • v2.3.0(Aug 2, 2016)

    New features

    • Forward and inverted indexes are now stored in one directory. To make use of your existing indexes, you will need to move their directories. For example, a configuration that used to look like the following

      dataset = "20newsgroups"
      corpus = "line.toml"
      forward-index = "20news-fwd"
      inverted-index = "20news-inv"
      

      will now look like the following

      dataset = "20newsgroups"
      corpus = "line.toml"
      index = "20news-index"
      

      and your folder structure should now look like

      20news-index
      ├── fwd
      └── inv
      

      You can do this by simply moving the old folders around like so:

      mkdir 20news-index
      mv 20news-fwd 20news-index/fwd
      mv 20news-inv 20news-index/inv
      
    • stats::multinomial now can report the number of unique event types counted (unique_events())

    • std::vector can now be hashed via hash_append.

    Bug fixes

    • Fix rounding bug in language model-based rankers. This bug caused severely degraded performance for these rankers with short queries. The unit tests have been improved to prevent such a regression in the future.

    Enhancements

    • The bundled ICU version has been bumped to ICU 57.1.
    • MeTA will now attempt to build its own version of ICU on Windows if it fails to find a suitable ICU installed.
    • CI support for GCC 6.x was added for all three major platforms.
    • CI support also uses a fixed version of LLVM/libc++ instead of trunk.

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283  beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e  crf.tar.gz
    2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d  gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0  greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9  greedy-perceptron-tagger.tar.gz
    
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(172.62 MB)
    crf.tar.gz(7.18 MB)
    gigaword-embeddings-50d.tar.gz(312.10 MB)
    greedy-constituency-parser.tar.gz(52.78 MB)
    greedy-perceptron-tagger.tar.gz(6.31 MB)
  • v2.2.0(Apr 9, 2016)

    New features

    • Parallelized versions of PageRank and Personalized PageRank have been added. A demo is available in wiki-page-rank; see the website for more information on obtaining the required data.
    • Add a disk-based streaming minimal perfect hash function library. A sub-component of this is a small memory-mapped succinct data structure library for answering rank/select queries on bit vectors.
    • Much of our CMake magic has been moved into a separate project included as a submodule: https://github.com/meta-toolkit/meta-cmake, which can now be used in other projects to simplify initial build system configuration.

    Bug fixes

    • Fix parameter settings in language model rankers not being range checked (issue #134).
    • Fix incorrect incoming edge insertion in directed_graph::add_edge().
    • Fix find_first_of and find_last_of in util::string_view.

    Enhancements

    • forward_index now knows how to tokenize a document down to a feature_vector, provided it was generated with a non-LIBSVM analyzer.
    • Allow loading of an existing index where its corpus is no longer available.
    • Data is no longer shuffled in batch_train. Shuffling the data causes horrible access patterns in the postings file, so the data should instead shuffled before indexing.
    • util::array_views can now be constructed as empty.
    • util::multiway_merge has been made more generic. You can now specify both the comparison function and merging criteria as parameters, which default to operator< and operator==, respectively.
    • A simple utility classes io::mifstream and io::mofstream have been added for places where a moveable ifstream or ofstream is desired as a workaround for older standard libraries lacking these move constructors.
    • The number of indexing threads can be controlled via the configuration key indexer-num-threads (which defaults to the number of threads on the system), and the number of threads allowed to concurrently write to disk can be controlled via indexer-max-writers (which defaults to 8).

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283  beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e  crf.tar.gz
    2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d  gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0  greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9  greedy-perceptron-tagger.tar.gz
    
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(172.62 MB)
    crf.tar.gz(7.18 MB)
    gigaword-embeddings-50d.tar.gz(312.10 MB)
    greedy-constituency-parser.tar.gz(52.78 MB)
    greedy-perceptron-tagger.tar.gz(6.31 MB)
  • v2.1.0(Feb 13, 2016)

    New features

    • Add the GloVe algorithm for training word embeddings and a library class word_embeddings for loading and querying trained embeddings. To facilitate returning word embeddings, a simple util::array_view class was added.
    • Add simple vector math library (and move fastapprox into the math namespace).

    Bug fixes

    • Fix probe_map::extract() for inline_key_value_storage type; old implementation forgot to delete all sentinel values before returning the vector.
    • Fix incorrect definition of l1norm() in sgd_model.
    • Fix gmap calculation where 0 average precision was ignored
    • Fix progress output in multiway_merge.

    Enhancements

    • Improve performance of printing::progress. Before, progress::operator() in tight loops could dramatically hurt performance, particularly due to frequent calls to std::chrono::steady_clock::now(). Now, progress::operator() simply sets an atomic iteration counter and a background thread periodically wakes to update the progress output.
    • Allow full text storage in index as metadata field. If store-full-text = true (default false) in the corpus config, the string metadata field "content" will be added. This is to simplify the creation of full text metadata: the user doesn't have to duplicate their dataset in metadata.dat, and metadata.dat will still be somewhat human-readable without large strings of full text added.
    • Allow make_index to take a user-supplied corpus object.

    Miscellaneous

    • ZLIB is now a required dependency.
    • Switch to just using the standalone ./unit-test instead of ctest. There aren't really many advantages for us to using CTest at this point with the new unit test framework, so just use our unit test executable.
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(172.62 MB)
    crf.tar.gz(7.18 MB)
    gigaword-embeddings-50d.tar.gz(312.10 MB)
    greedy-constituency-parser.tar.gz(52.78 MB)
    greedy-perceptron-tagger.tar.gz(6.31 MB)
  • v2.0.1(Feb 5, 2016)

    Bug fixes

    • Fix issue where metadata_parser would not consume spaces in string metadata fields. Thanks to hopsalot on the forum for the bug report!
    • Fix build issue on OS X with Xcode 6.4 and clang related to their shipped version of string_view lacking a const to_string() method

    Enhancements

    • The ./profile executable ensures that the file exists before operating on it. Thanks to @domarps for the PR!
    • Add a generic util::multiway_merge algorithm for performing the merge-step of an external memory merge sort.
    • Build with the following Xcode versions on Travis CI:
      • Xcode 6.1 and OS X 10.9 (as before)
      • Xcode 6.4 and OS X 10.10 (new)
      • Xcode 7.1.1 and OS X 10.10 (new)
      • Xcode 7.2 and OS X 10.11 (new)
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(172.62 MB)
    crf.tar.gz(7.18 MB)
    greedy-constituency-parser.tar.gz(52.78 MB)
    greedy-perceptron-tagger.tar.gz(6.31 MB)
  • v2.0.0(Jan 30, 2016)

    New features and major changes

    Indexing

    • Index format rewrite: both inverted and forward indices now use the same compressed postings format, and intermediate chunks are now also compressed on-the-fly. There is now a built in tool to dump any forward index to libsvm format (as this is not the on-disk format for that type of index anymore).
    • Metadata support: indices can now store arbitrary metadata associated with individual documents with string, integer, unsigned integer, and floating point values
    • Corpus configuration is now stored within the corpus directory itself, allowing for corpora to be distributed with their proper configurations rather than having to bake this into the main configuration file
    • RAM limits can be set for the indexing process via the configuration file. These are approximate and based on heuristics, so you should always set these to lower than available RAM.
    • Forward indices can now be created directly instead of forcing the creation of an inverted index first

    Tokenization and Analysis

    • ICU will be built and statically linked if the system provided library is too old on both OS X and Linux platforms. MeTA now will specify an exact version of ICU that should be used per release for consistency. That version is 56.1 as of this release.
    • Analyzers have been modified to support both integral and floating point values via the use of the featurizer object passed to tokenize()
    • Documents no longer store any count information during the analysis process

    Ranking

    • Postings lists can now be read in a streaming fashion rather than all at once via postings_stream
    • Ranking is now performed using a document-at-a-time scheme
    • Ranking functions now use fast approximate math from fastapprox
    • Rank correlation measures have been added to the evaluation library

    Language Model

    • Rewrite of the language model library which can load models from the .arpa format
    • SyntacticDiff implementation for comparative text mining, which may include grammatical error correction, summarization, or feature generation

    Machine Learning

    • A feature selection library for selecting features for machine learning using chi square, information gain, correlation coefficient, and odds ratio has been added
    • The API for the machine learning algorithms has been changed to use dataset classes; these are separate from the index classes and represent data that is memory-resident
    • Support for regression has been added (currently only via SGD)
    • The SGD algorithm has been improved to use a normalized adaptive gradient method which should make it less sensitive to feature scaling
    • The SGD algorithm now supports (approximate) L1 regularization via a cumulative penalty approach
    • The libsvm modules are now also built using CMake

    Miscellaneous

    • Packed binary I/O functions allow for writing integers/floating point values in a compressed format that can be efficiently decoded. This should be used for most binary I/O that needs to be performed in the toolkit unless there is a specific reason not to.
    • An interactive demo application has been added for the shift-reduce constituency parser
    • A string_view class is provided in the meta::util namespace to be used for non-owning references to strings. This will use std::experimental::string_view if available and our own implementation if not
    • meta::util::optional will resolve to std::experimental::optional if it is available
    • Support for jemalloc has been added to the build system. We strongly recommend installing and linking against jemalloc for improved indexing performance.
    • A tool has been added to print out the top k terms in a corpus
    • A new library for hashing has been added in namespace meta::hashing. This includes a generic framework for writing hash functions that are randomly keyed as well as (insertion only) probing-based hash sets/maps with configurable resizing and probing strategies
    • A utility class fixed_heap has been added for places where a fixed size set of maximal/minimal values should be maintained in constant space
    • The filesystem management routines have been converted to use STLsoft in the event that the filesystem library in std::experimental::filesystem is not available
    • Building MeTA on Windows is now officially supported via MSYS2 and MinGW-w64, and continuious integration now builds it on every commit in this environment
    • A small support library for things related to random number generation has been added in meta::random
    • Sparse vectors now support operator+ and operator-
    • An STL container compatible allocator aligned_allocator<T, Alignment> has been added that can over-align data (useful for performance in some situations)
    • Bandit is now used for the unit tests, and these have been substantially improved upon
    • io::parser deprecated and removed; most uses simply converted to std::fstream
    • binary_file_{reader,writer} deprecated and removed; io::packed or io::{read,write}_binary should be used instead

    Bug fixes

    • knn classifier now only requests the top k when performing classification
    • An issue where uncompressed model files would not be found if using a zlib-enabled build (#101)

    Enhancements

    • Travis CI integration has been switched to their container infrastructure, and it now builds with OS X with Clang in addition to Linux with Clang and GCC
    • Appveyor CI for Windows builds alongside Travis
    • Indexing speeds are dramatically faster (thanks to many changes both in the in-memory posting chunks as well as optimizations in the tokenization process)
    • If no build type is specified, MeTA will be built in Release mode
    • The cpptoml dependency version has been bumped, allowing the use of things like value_or for cleaner code
    • The identifiers library has been dramatically simplified
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(172.62 MB)
    crf.tar.gz(7.18 MB)
    greedy-constituency-parser.tar.gz(52.78 MB)
    greedy-perceptron-tagger.tar.gz(6.31 MB)
  • v1.3.8(Sep 1, 2015)

    Bug fixes

    • Fix issue with confusion_matrix where precision and recall values were swapped. Thanks to @husseinhazimeh for finding this!

    Enhancements

    • Better unit tests for confusion_matrix
    • Add functions to confusion_matrix to directly access precision, recall, and F1 score
    • Create a predicted_label opaque identifier to emphasize class_labels that are output from some model (and thus shouldn't be interchangeable)
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(171.03 MB)
    crf.tar.gz(7.39 MB)
    greedy-constituency-parser.tar.gz(51.60 MB)
    greedy-perceptron-tagger.tar.gz(6.47 MB)
  • v1.3.7(Jun 13, 2015)

    Bug fixes

    • Fix inconsistent behavior of utf::segmenter (and thus icu_tokenizer) for different locales. Thanks @CanoeFZH and @tng-konrad for helping debug this!

    Enhancements

    • Allow for specifying the language and country for locale generation in setting up utf::segmenter (and thus icu_tokenizer)
    • Allow for suppression of <s> and </s> tags within icu_tokenizer, mostly useful for information retrieval experiments with unigram words. Thanks @husseinhazimeh for the suggestion!
    • Add a default-unigram-chain filter chain preset which is suitable for information retrieval experiments using unigram words. Thanks @husseinhazimeh for the suggestion!
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(171.03 MB)
    crf.tar.gz(7.39 MB)
    greedy-constituency-parser.tar.gz(51.60 MB)
    greedy-perceptron-tagger.tar.gz(6.47 MB)
  • v1.3.6(Jun 3, 2015)

  • v1.3.4(Apr 7, 2015)

    New features

    • Support building with biicode
    • Add Vagrantfile for virtual machine configuration
    • Add Dockerfile for Docker support

    Enhancements

    • Improve ir_eval unit tests

    Bug fixes

    • Fix ir_eval::ndcg incorrect log base and addition instead of subtraction in IDCG calculation
    • Fix ir_eval::avg_p incorrect early termination
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(171.03 MB)
    crf.tar.gz(7.39 MB)
    greedy-constituency-parser.tar.gz(51.60 MB)
    greedy-perceptron-tagger.tar.gz(6.47 MB)
  • v1.3.3(Mar 24, 2015)

    Bug fixes

    • Fix issues with system-defined integer widths in binary model files (mainly impacted the greedy tagger and parser); please re-download any parser model files you may have had before
    • Fix bug where parser model directory is not created if a non-standard prefix is used (anything other than "parser")

    Enhancements

    • Silence inconsistent missing overrides warning on clang >= 3.6
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(171.03 MB)
    crf.tar.gz(7.39 MB)
    greedy-constituency-parser.tar.gz(51.60 MB)
    greedy-perceptron-tagger.tar.gz(6.47 MB)
  • v1.3.2(Mar 18, 2015)

  • v1.3.1(Mar 5, 2015)

  • v1.3(Mar 4, 2015)

    New features:

    • additions to the graph library:
      • myopic search
      • BFS
      • preferential attachment graph generation model (supports node attractiveness from different distributions)
      • betweenness centrality
      • eigenvector centrality
    • added a new natural language parsing library:
      • parse tree library (visitor-based)
      • shift-reduce constituency parser for generating phrase structure trees
      • reimplementation of evalb metrics for evaluating parsers
      • new filter for Penn Treebank-style normalization
    • added a greedy averaged Perceptron-based tagger
    • demo application for various basic text processing (profile)
    • basic iostreams that support gzip compression (if compiled with ZLib support)
    • added iteration method for stats::multinomial seen events
    • added expected value and entropy functions to stats namespace
    • added linear_model: a generic multiclass classifier storage class
    • added gz_corpus: a compressed version of line_corpus
    • added macros for generating type safe identifiers with user defined literal suffixes
    • added a persistent stack data structure to meta::util

    Enhancements:

    • added operator== for util::optional<T>
    • better CMake support for building the libsvm modules
    • better CMake support for downloading unit-test data
    • improved setup guide in README (for OS X, Ubuntu, Arch, and EWS/ENGRIT)
    • tree analyzers refactored to use the new parser library (removes dependency on outside toolkits for generating tree files)
    • analyzers that are not part of the "core" have been moved into their respective folders (so ngram_pos_analyzer is in src/sequence, tree_analyzer is in src/parser)
    • make_index now checks if the files exist before loading an index, and if they are missing creates a new one (as opposed to just throwing an exception on a nonexistent file)
    • cpptoml upgraded to support TOML v0.4.0
    • enable extra warnings (-Wextra) for clang++ and g++

    Bug fixes:

    • fix sequence_analyzer::analyze() const when applied to untagged sequences (was throwing when it shouldn't)
    • ensure that the inverted index object is destroyed first before uninverting occurs in the creation of a forward_idnex
    • fix bug where icu_tokenizer would output spaces as tokens
    • fix bugs where index objects were not destroyed before trying to delete their files in the unit tests
    • fix bug in sparse_vector::find() where it would return a non-end iterator when asked to find an element that does not exist
    Source code(tar.gz)
    Source code(zip)
    beam-search-constituency-parser-4.tar.gz(174.80 MB)
    crf.tar.gz(7.39 MB)
    greedy-constituency-parser.tar.gz(51.58 MB)
    greedy-perceptron-tagger.tar.gz(6.46 MB)
  • v1.2(Mar 5, 2015)

    New features:

    • demo application for CRF-based POS tagging
    • nearest_centroid classifier
    • basic statistics library for representing relevant probability distributions
    • sparse_vector utility class

    Enhancements:

    • ngram_pos_analyzer now uses the CRF internally (issue #46)
    • knn classifier now supports weighted knn
    • classifier cross validation can now optionally create even class splits
    • filesystem::copy_file() no longer hangs without progress reporting with large files
    • CMake build system now includes INTERFACE targets (better inclusion as subproject in external projects)
    • MeTA can now (optionally) be built with C++14 support

    Bug fixes:

    • language_model_ranker scoring function corrected (issue #50)
    • naive_bayes classifier scoring corrected
    • several incorrect instances of numeric_limits::min() replaced with numeric_limits::lowest()
    • compilation fixed with versions of ICU < 4.4
    Source code(tar.gz)
    Source code(zip)
C-based/Cached/Core Computer Vision Library, A Modern Computer Vision Library

Build Status Travis CI VM: Linux x64: Raspberry Pi 3: Jetson TX2: Backstory I set to build ccv with a minimalism inspiration. That was back in 2010, o

Liu Liu 6.9k Jan 6, 2023
oneAPI Data Analytics Library (oneDAL)

Intel® oneAPI Data Analytics Library Installation | Documentation | Support | Examples | Samples | How to Contribute Intel® oneAPI Data Analytics Libr

oneAPI-SRC 534 Dec 30, 2022
A Modern C++ Data Sciences Toolkit

MeTA: ModErn Text Analysis Please visit our web page for information and tutorials about MeTA! Build Status (by branch) master: develop: Outline Intro

null 656 Dec 25, 2022
Ancient Northwestern University Dept. of Geological Sciences plotting software

nplot Ancient Northwestern University Dept. of Geological Sciences plotting software Consists of: library routines (libloc) Fortran-callable routine (

null 1 Nov 21, 2021
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17.3k Dec 23, 2022
Lite.AI.ToolKit 🚀🚀🌟: A lite C++ toolkit of awesome AI models such as RobustVideoMatting🔥, YOLOX🔥, YOLOP🔥 etc.

Lite.AI.ToolKit ?? ?? ?? : A lite C++ toolkit of awesome AI models which contains 70+ models now. It's a collection of personal interests. Such as RVM, YOLOX, YOLOP, YOLOR, YoloV5, DeepLabV3, ArcFace, etc.

DefTruth 2.4k Jan 9, 2023
Zenotech 6 Oct 21, 2022
Insight Toolkit (ITK) is an open-source, cross-platform toolkit for N-dimensional scientific image processing, segmentation, and registration

ITK: The Insight Toolkit C++ Python Linux macOS Windows Linux (Code coverage) Links Homepage Download Discussion Software Guide Help Examples Issue tr

Insight Software Consortium 1.1k Dec 26, 2022
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

The Microsoft Cognitive Toolkit is a unified deep learning toolkit that describes neural networks as a series of computational steps via a directed graph.

Microsoft 17.3k Jan 6, 2023
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 5, 2023
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Dec 31, 2022
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Dec 31, 2022
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Dec 31, 2022
Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.

slow5tools Slow5tools is a simple toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format. Abou

Hasindu Gamaarachchi 57 Dec 2, 2022
This is a tool for software engineers to view,record and analyse data(sensor data and module data) In the process of software development.

![Contributors][Huang Jianyu] Statement 由于工具源码在网上公开,除使用部分开源项目代码外,其余代码均来自我个人,工具本身不包含公司的知识产权,所有与公司有关的内容均从软件包中移除,软件发布遵循Apache协议,任何人均可下载进行修改使用,如使用过程中出现任何问

HuangJianyu 36 Dec 25, 2022
BerylDB is a data structure data manager that can be used to store data as key-value entries.

BerylDB is a data structure data manager that can be used to store data as key-value entries. The server allows channel subscription and is optimized to be used as a cache repository. Supported structures include lists, sets, and keys.

BerylDB 203 Dec 16, 2022
A modern day direct port of BOOM 2.02 for modern times. Aiming to tastefully continue the development of BOOM, in the style of TeamTNT.

ReBOOM ReBOOM is a continuation of the BOOM source port, version 2.02. what is it ReBOOM is a source port, directly ported from BOOM 2.02 with additio

Gibbon 12 Jul 27, 2022
Open3D: A Modern Library for 3D Data Processing

Open3D is an open-source library that supports rapid development of software that deals with 3D data. The Open3D frontend exposes a set of carefully selected data structures and algorithms in both C++ and Python. The backend is highly optimized and is set up for parallelization. We welcome contributions from the open-source community.

Intel ISL (Intel Intelligent Systems Lab) 7.9k Dec 29, 2022
openFrameworks is a community-developed cross platform toolkit for creative coding in C++.

openFrameworks openFrameworks is a C++ toolkit for creative coding. If you are new to OF, welcome! Build status The master branch contains the newest,

openFrameworks 9.2k Jan 3, 2023
Facebook AI Research's Automatic Speech Recognition Toolkit

wav2letter++ Important Note: wav2letter has been moved and consolidated into Flashlight in the ASR application. Future wav2letter development will occ

Facebook Research 6.2k Jan 3, 2023