An extremely fast FEC filing parser written in C

Related tags

CLI c parser fec elections
Overview

FastFEC

A C program to stream and parse FEC filings, writing output to CSV. This project is in early stages but works on a wide variety of filings and will benefit from additional rigorous testing.

Usage

Once you've downloaded the latest release or built a binary (see below), you can run it as follows:

Usage: fastfec [flags] <id, file, or url> [output directory=output] [override id]
  • [flags]: optional flags which must come before other args; see below
  • <id, file, or url> is either
    • a numeric ID, in which case the filing is streamed from the FEC website
    • a file, in which case the filing is read from disk at the specified local path
    • a url, in which case the filing is streamed from the specified remote URL
  • [output directory] is the folder in which CSV files will be written. By default, it is output/.
  • [override id] is an ID to use as the filing ID. If not specified, this ID is pulled out of the first parameter as a numeric component that can be found at the end of the path/URL.

The CLI will download or read from disk the specified filing and then write output CSVs for each form type in the output directory. The paths of the outputted files are:

  • {output directory}/{filing id}/{form type}.csv

You can also pipe the output of another command in by following this usage:

[some command] | fastfec [flags] <id> [output directory=output]

Flags

The CLI supports the following flags:

  • --include-filing-id / -i: if this flag is passed, then the generated output will include a column at the beginning of every generated file called filing_id that gets passed the filing ID. This can be useful for bulk uploading CSVs into a database
  • --silent / -s : suppress all non-error output messages
  • --warn / -w : show warning messages (e.g. for rows with unexpected numbers of fields or field types that don't match exactly)

The short form of flags can be combined, e.g. -is would include filing IDs and suppress output.

Examples

fastfec -s 13360 fastfec_output/

  • This will run FastFEC in silent mode, download and parse filing ID 13360, and store the output in CSV files at fastfec_output/13360/.

Local development

Build system

Zig is used to build and compile the project. Download and install the latest version of Zig (>=9.0.0) by following the instructions on the website (you can verify it's working by typing zig in the terminal and seeing help commands).

Dependencies

The following libraries are used:

  • curl (needed for the CLI, not the library)
  • pcre (only needed on Windows)

Installing these libraries varies by OS:

Mac OS X

Ensure Homebrew is installed and run the following brew command to install the libraries:

brew install pkg-config curl

Ubuntu

sudo apt install -y libcurl4-openssl-dev

Windows

Install vcpkg and run the following:

vcpkg integrate install
vcpkg install pcre curl --triplet x64-windows-static

Building

From the root directory of the repo, run:

zig build

On Windows, you may have to supply additional arguments to locate vcpkg dependencies and ensure the msvc toolchain is used:

zig build --search-prefix C:/vcpkg/packages/pcre_x64-windows-static --search-prefix C:/vcpkg/packages/curl_x64-windows-static --search-prefix C:/vcpkg/packages/zlib_x64-windows-static -Dtarget=x86_64-windows-msvc

The above commands will output a binary at zig-out/bin/fastfec and a shared library file in the zig-out/lib/ directory. If you want to only build the library, you can pass -Dlib-only=true as a build option following zig build.

Time benchmarks

Using massive 1464847.fec (8.4gb) on an M1 MacBook Air

  • 1m 42s

Testing

Currently, there's only C tests for specific parsing/buffer/write functionality, but we hope to expand unit testing soon.

To run the current tests: zig build test

Scripts

python scripts/generate_mappings.py: A Python script to auto-generate C header files containing column header and type mappings

Issues
  • fix: f99 text as a csv field and floats to two decimals

    fix: f99 text as a csv field and floats to two decimals

    Instead of populating a separate f99 text file, this change produces the f99 text in the F99 CSV as intended. It also incorporates changes to provide floats to two decimals of precision.

    opened by freedmand 5
  • Cross-compiled Python package/distribution

    Cross-compiled Python package/distribution

    Description

    This PR adds a script to cross-compile wheels for the FastFEC Python package along with a GitHub actions workflow to generate wheels for all relevant OS's. This approach subverts the typical setup.py script in favor of a make_wheels.py script that automatically constructs the wheels for each OS that is based on https://github.com/ziglang/zig-pypi/blob/main/make_wheels.py

    Note that there are many commits in this PR. I tried to get it working with cibuildwheel first, which would be the de facto way to do cross-platform builds, but each build took over 30 minutes and there were OS-specific issues that were hard to debug. This approach mirrors the way the actual FastFEC package is built and should be just as stable.

    Also if for whatever reason the wheel does not work, setup.py will automatically run as backup when pip install fastfec happens in the future.

    To verify this PR works, I launched a Windows VM and confirmed that the Python library could be installed. I also launched AWS Linux x86_64 and aarch64 (Gravitron) instances and confirmed both could install the Python wheel.

    Jira ticket

    https://arcpublishing.atlassian.net/browse/ELEX-141

    Test steps

    • Go here https://github.com/washingtonpost/FastFEC/actions/runs/1668364069
    • Click the artifact file to download it
    • Extract the zip contents of the artifact
    • Find the path to the .whl file inside that has your desired architecture
    • Run pip install {path to the the desired whl file}
    • Run the attached test.py and it should print JSON-esque output and not error!

    test.py:

    from fastfec import FastFEC
    from io import BytesIO
    
    file = b""""HDR","FEC","5.1","Navision AVF","3.00","^","","000",""
    "F3XN","C00397000","South Dakota Women Vote!","1120 Connecticut Avenue NW","Ste 1100","Washington","DC","20036","","","TER","","","","20041001","20041122","61571.00","0.00","61571.00","61571.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","61571.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","61571.00","61571.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","2004","61775.00","61775.00","61775.00","0.00","61750.00","25.00","61775.00","0.00","0.00","61775.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","61775.00","61775.00","0.00","0.00","204.00","204.00","61571.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","61775.00","61775.00","61775.00","0.00","61775.00","204.00","0.00","204.00","Caroline C. Fines","20041201","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00","0.00"
    "SB22","C00397000","PTY","","207 East Capitol #103","PO Box 737","Pierre","SD","57501","","Transfer to Affiliate","","","20041019","61144.66","","","","","","","","","","","","","","","","SB22-2533","","","","","","","South Dakota Democratic Party ","","","","",""
    "SB22","C00397000","PAC","","805 15th St., NW #400","","Washington","DC","20005","","Transfer to Affiliate","","","20041019","426.34","","","","","","","","","","","","","","","","SB22-2532","","","","","","","EMILY's List Federal Operating ","","","","",""
    """
    
    with BytesIO(file) as f:
        with FastFEC() as fastfec:
            for form, line in fastfec.parse(f):
                print("GOT", form, line)
    
    :handshake: review in progress 
    opened by freedmand 4
  • feat: refactor python bindings, add line callback

    feat: refactor python bindings, add line callback

    This is a significant refactor of the Python bindings to be more library-driven, with the purpose of pushing a Python package to the Python package index (PyPI) as soon as possible. In the process, FastFEC was modified to provide custom line callback functionality for the sake of providing a convenient Python API that mimics other popular/inspirational packages such as fecfile.

    The main file that drives this PR is python/fastfec.py. See this file for more detailed comments on how the API works. The top-level setup.py file is provided as a proof-of-concept that the Python package can be built automatically (not as a necessarily viable end result yet).

    Test Steps

    If you haven't already, ensure you can build the fastfec library by running zig build (the steps in the README outline how to do this, but tl;dr brew install zig --head if you don't yet have zig installed). This will ensure the latest shared library file is discoverable by Python (which will allow all the new changes to work).

    You can test the library by running a Python REPL (python) in the root directory of the repo. Also, make sure you have a .fec file in this directory for testing (in the following commands, it is assumed you have 13360.fec downloaded from https://docquery.fec.gov/dcdev/posted/13360.fec in this root directory — but you can sub in whatever .fec file you have handy or want to test).

    To test the line by line parsing, run:

    from python.fastfec import FastFEC
    
    with open('13360.fec', 'rb') as f:
        with FastFEC() as fastfec:
            for form, line in fastfec.parse(f):
                print("GOT", form, line)
    

    This should print line information for each line in the passed in .fec file.

    To test the file to output file parsing, run:

    from python.fastfec import FastFEC
    
    with open('13360.fec', 'rb') as f:
        with FastFEC() as fastfec:
            fastfec.parse_as_files(f,'python_output')
    

    This should return 1 to indicate success and output .csv files in the python_output directory corresponding to a successful parse.

    To test the file to output file custom parsing, run:

    import os
    from pathlib import Path
    from python.fastfec import FastFEC
    
    # Custom open method
    def open_output_file(filename, *args, **kwargs):
        filename = os.path.join('custom_python_output', filename)
        output_file = Path(filename)
        output_file.parent.mkdir(exist_ok=True, parents=True)
        return open(filename, *args, **kwargs)
    
    with open('13360.fec', 'rb') as f:
        with FastFEC() as fastfec:
            fastfec.parse_as_files_custom(f, open_output_file)
    

    This should return 1 to indicate success and output .csv files in the custom_python_output directory corresponding to a successful parse.

    :bow: changes requested 
    opened by freedmand 4
  • Add support for version 8.4

    Add support for version 8.4

    Description

    This PR brings the mappings.json file up to date with the current version in the fecfile python library. The most noteworthy change is supporting version 8.4 of the .fec file format. Here is the corresponding commit to the fech-source library. It appears as though the only differences between versions 8.3 and 8.4 are the addition of the lobbyist_registrant_pac_3 and lobbyist_registrant_pac_4 fields to the F1, coming right after leadership_pac.

    This PR also also includes a fix to version 2 of the schedule A, added by this commit.

    Link to Jira Ticket

    Test Steps

    After Screenshot(s)

    Before Screenshot(s)

    opened by esonderegger 2
  • NE-1284: create python wrapper for FastFEC

    NE-1284: create python wrapper for FastFEC

    Description

    create python wrapper for FastFEC

    Jira Ticket

    https://arcpublishing.atlassian.net/browse/NE-1284

    Test Steps

    • Checkout branch and set up per README as necessary
    • Go here https://s3.console.aws.amazon.com/s3/buckets/elex-fec-test?region=us-east-1&prefix=test-architecture/test-filings/&showversions=false, find a filing that is more than 0B of data AND for which there isn't a corresponding output folder here https://s3.console.aws.amazon.com/s3/buckets/elex-fec-test?region=us-east-1&prefix=test-architecture/test-fastfec-output/&showversions=false and then use that filing number in the command below, which you should run from the root of the repo:
    python /python/fastfec.py -f "[FILING NUMBER GOES HERE]" -i "s3://elex-fec-test/test-architecture/test-filings" -o "s3://elex-fec-test/test-architecture/test-fastfec-output"
    

    After running the command, make sure you get something like this in the console:

    ➜  fastFEC git:(python-ctypes) ✗ python /Users/foremanH/Projects/fastFEC/python/fastfec.py -f "1375137" -i "s3://elex-fec-test/test-architecture/test-filings" -o "s3://elex-fec-test/test-architecture/test-fastfec-output"
    Filing ID is 1375137
    Input file is s3://elex-fec-test/test-architecture/test-filings/1375137
    Output file is s3://elex-fec-test/test-architecture/test-fastfec-output/1375137
    Parsing (py)
    Parsed; status 1
    1.2386590242385864e-07
    4.439614713191986e-06
    7.579103112220764e-06
    

    And make sure the output for that filing appears here: https://s3.console.aws.amazon.com/s3/buckets/elex-fec-test?region=us-east-1&prefix=test-architecture/test-fastfec-output/&showversions=false Screen Shot 2021-11-19 at 7 21 30 PM

    :hand: ready for review multiple :eyes: 
    opened by hs4man21 2
  • fix: add last column support due to off by 1 error

    fix: add last column support due to off by 1 error

    Another bug identified by @chriszs — the last column isn't read in FEC files due to an off-by-one error. This pretty simple PR fixes that error and adds a bit more logging and a test case (which isn't actually what the issue was but good to have regardless).

    opened by freedmand 1
  • fix: normalize the filename to prevent errors with slashes in form types

    fix: normalize the filename to prevent errors with slashes in form types

    @chriszs discovered a segfault on old form types where the form type has a slash in it, e.g. "SC/10" and "SC1/9" (from filing ID 34411, for reference). This fix adopts a convention he used in his Node parser wherein filenames are normalized to convert slashes in form types to dashes.

    opened by freedmand 1
  • Update apt-get before installing libcurl to fix GitHub Actions failure

    Update apt-get before installing libcurl to fix GitHub Actions failure

    Description

    libcurl4-openssl-dev was failing to install in GitHub Actions with the following error:

    Run sudo apt install -y libcurl4-openssl-dev
    
    [...snip...]
    
    Err:1 http://azure.archive.ubuntu.com/ubuntu focal-updates/main amd64 libcurl4-openssl-dev amd64 7.68.0-1ubuntu2.10
      404  Not Found [IP: 40.81.13.82 80]
    E: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/pool/main/c/curl/libcurl4-openssl-dev_7.68.0-1ubuntu2.10_amd64.deb  404  Not Found [IP: 40.81.13.82 80]
    E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
    Error: Process completed with exit code 100.
    

    According to this thread and the docs, the solution may be to run apt-get update first. That's what this PR does.

    Test Steps

    File PR, wait for build to work. Update: It does!

    :hand: ready for review 
    opened by chriszs 0
  • Sync with main

    Sync with main

    Description

    Lots of hotfixes around GitHub actions sporadic failures. It's working now, but it isn't clear why it failed before (StackOverflow seems to indicate this is a known problem with GitHub issues).

    opened by freedmand 0
  • feat: fix filing_id bug with header

    feat: fix filing_id bug with header

    Description

    Fixes a filing_id bug where including the filing id causes the header to be corrupted in modern FEC versions

    Link to Jira Ticket

    https://arcpublishing.atlassian.net/browse/ELEX-80?atlOrigin=eyJpIjoiYWFlNzdiYTNiODI5NDAzYWE2ZTUwNjU2ODMyZGZkZTAiLCJwIjoiamlyYS1zbGFjay1pbnQifQ

    Test Steps

    Run with filing IDs 15324 and 13360, with and without --include-filing-id

    opened by freedmand 0
  • feat: add beta release GitHub workflow

    feat: add beta release GitHub workflow

    Description

    Added beta release GitHub workflow to release to "latest" whenever a branch is merged into develop

    Link to Jira Ticket

    https://arcpublishing.atlassian.net/browse/NE-1802

    opened by hs4man21 0
  • Move zig to 0.8.1

    Move zig to 0.8.1

    Description

    Encountered the following error when setting up zig in the test GitHub Action:

    Screen Shot 2022-06-01 at 12 02 27 PM

    On examining the relevant action, the obvious network requests were to get a list of zig versions and presumably to request one. The list of zig versions didn't return the 403 forbidden error, so that probably isn't it. But I didn't see the version of zig we were using (0.9.0-dev.1675+3d528161c) in there, so I suspected that was no longer available and possibly causing the error. I pinned the zig setup to 0.9.0 (update: 0.9.1) instead and that seemed to clear the error.

    Test Steps

    Push to branch, wait for Action, Action succeeds.

    :thought_balloon: wip 
    opened by chriszs 0
  • Building from source no longer links with Homebrew PCRE

    Building from source no longer links with Homebrew PCRE

    Attempting to build from source with brew install --build-from-source fastfec no longer links with Homebrew PCRE, but the system-provided one instead.

    This seems to be due to a change in Zig 0.9.0 (Homebrew/[email protected]).

    ❯ brew install --quiet --build-from-source fastfec
    ==> zig build -Dvendored-pcre=false
    🍺  /usr/local/Cellar/fastfec/0.0.4: 6 files, 982.8KB, built in 11 seconds
    ==> Running `brew cleanup fastfec`...
    Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.
    Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
    ❯ brew linkage fastfec
    System libraries:
      /usr/lib/libSystem.B.dylib
      /usr/lib/libcurl.4.dylib
      /usr/lib/libpcre.0.dylib
    

    I've poked at this for a bit, but I don't use Zig so I'm unsure how to get this to ignore the system libpcre. Passing --search-prefix doesn't work. I'd appreciate it if you could take a look. Thanks!

    opened by carlocab 4
Releases(0.0.8)
Owner
The Washington Post
The Washington Post
The KISS file manager: CLI-based, ultra-lightweight, lightning fast, and written in C

CliFM is a CLI-based, shell-like (non-curses) and KISS terminal file manager written in C: simple, fast, and lightweight as hell

leo-arch 511 Jun 24, 2022
A simple header-only C++ argument parser library. Supposed to be flexible and powerful, and attempts to be compatible with the functionality of the Python standard argparse library (though not necessarily the API).

args Note that this library is essentially in maintenance mode. I haven't had the time to work on it or give it the love that it deserves. I'm not add

Taylor C. Richberger 981 Jun 23, 2022
A simple to use, composable, command line parser for C++ 11 and beyond

Clara v1.1.5 !! This repository is unmaintained. Go here for a fork that is somewhat maintained. !! A simple to use, composable, command line parser f

Catch Org 651 Jun 15, 2022
CLI11 is a command line parser for C++11 and beyond that provides a rich feature set with a simple and intuitive interface.

CLI11: Command line parser for C++11 What's new • Documentation • API Reference CLI11 is a command line parser for C++11 and beyond that provides a ri

null 2.1k Jun 24, 2022
Lightweight C++ command line option parser

Release versions Note that master is generally a work in progress, and you probably want to use a tagged release version. Version 3 breaking changes I

null 3.1k Jun 25, 2022
A simple to use, composable, command line parser for C++ 11 and beyond

Lyra A simple to use, composing, header only, command line arguments parser for C++ 11 and beyond. Obtain License Standards Stats Tests License Distri

Build Frameworks Group 342 Jun 16, 2022
Argument Parser for Modern C++

Highlights Single header file Requires C++17 MIT License Quick Start Simply include argparse.hpp and you're good to go. #include <argparse/argparse.hp

Pranav 1.2k Jun 28, 2022
A simple header-only C++ argument parser library. Supposed to be flexible and powerful, and attempts to be compatible with the functionality of the Python standard argparse library (though not necessarily the API).

args Note that this library is essentially in maintenance mode. I haven't had the time to work on it or give it the love that it deserves. I'm not add

Taylor C. Richberger 896 Aug 31, 2021
⛳ Simple, extensible, header-only C++17 argument parser released into the public domain.

⛳ flags Simple, extensible, header-only C++17 argument parser released into the public domain. why requirements api get get (with default value) posit

sailormoon 198 Jun 18, 2022
Tiny command-line parser for C / C++

tinyargs Another commandline argument parser for C / C++. This one is tiny, source only, and builds cleanly with -Wall -pedantic under C99 and C++11 o

Erik Agsjö 5 Nov 11, 2021
Elf and PE file parser

PelfParser PelfParser is a very simple C++ library for parsing Windows portable executable files and Executable and Linkable Format files, it only sup

Rebraws 1 Oct 29, 2021
A math parser made in 1 hour using copilot.

An entire math parser made with Copilot Copilot wrote 91% of the code in this, amazing isn't it? It supports all normal mathematical expressions excep

Duckie 4 Dec 7, 2021
A parser for InnoDB file formats

Introduction Inno_space is a parser for InnoDB file formats. It parse the .ibd file to human readable format. The origin idea come from Jeremy Cole's

Zongzhi Chen 76 Jun 20, 2022
JSONes - c++ json parser & writer. Simple api. Easy to use.

JSONes Just another small json parser and writer. It has no reflection or fancy specs. It is tested with examples at json.org Only standart library. N

Enes Kaya ÖCAL 2 Dec 28, 2021
A simple parser for the PBRT file format

PBRT-Parser (V1.1) The goal of this project is to provide a free (apache-lincensed) open source tool to easily (and quickly) load PBRT files (such as

Ingo Wald 190 May 18, 2022
Zinit is a flexible and fast Zshell plugin manager

zinit Note: Sebastian Gniazdowski, the original zinit dev, deleted zdharma randomly. This is a reliable fork / place for the continuation of the proje

null 1.1k Jun 22, 2022
Flexible and fast Z-shell plugin manager that will allow installing everything from GitHub and other sites.

ZINIT News Zinit Wiki Quick Start Install Automatic Installation (Recommended) Manual Installation Usage Introduction Plugins and snippets Upgrade Zin

z-shell 25 Jun 9, 2022
CfgManipulator is a fast and powerful tool for working with configuration files for the C++ language

CfgManipulator is a fast and powerful tool for working with configuration files for the C++ language. It can read, create strings and sections, change the value of a string and much more.

Sanya 2 Jan 28, 2022
CLIp is a clipboard emulator for a command line interface written in 100% standard C only. Pipe to it to copy, pipe from it to paste.

CLIp v2 About CLIp is a powerful yet easy to use and minimal clipboard manager for a command line environment, with no dependencies or bloat. Usage Sy

A.P. Jo. 12 Sep 18, 2021