Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.

Overview

slow5tools

Slow5tools is a simple toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.

About SLOW5 format

SLOW5 is a new file format for signal data from Oxford Nanopore Technologies (ONT) devices. SLOW5 was developed to overcome inherent limitations in the standard FAST5 data format that prevent efficient, scalable analysis and cause many headaches for developers.

SLOW5 is a simple tab-separated values (TSV) file encoding metadata and time-series signal data for one nanopore read per line, with global metadata stored in a file header. Parallel file access is facilitated by an accompanying index file, also in TSV format, that specifies the position of each read (in Bytes) within the main SLOW5 file. SLOW5 can be encoded in human-readable ASCII format, or a more compact and efficient binary format (BLOW5) - this is analogous to the seminal SAM/BAM format for storing DNA sequence alignments. The BLOW5 binary format can be compressed using standard zlib compression, thereby minimising the data storage footprint while still permitting efficient parallel access.

Detailed benchmarking experiments have shown that SLOW5 format is an order of magnitude faster and 25% smaller than FAST5.

GitHub Downloads SLOW5 C/C++ CI Github

Full documentation: https://hasindu2008.github.io/slow5tools
Pre-print: https://www.biorxiv.org/content/10.1101/2021.06.29.450255v1

Quick start

If you are a Linux user on x86_64 architecture and want to quickly try out download the compiled binaries from the latest release. For example:

VERSION=v0.1.0
wget "https://github.com/hasindu2008/slow5tools/releases/download/$VERSION/slow5tools-$VERSION-x86_64-linux-binaries.tar.gz" && tar xvf slow5tools-$VERSION-x86_64-linux-binaries.tar.gz && cd slow5tools-$VERSION/
./slow5tools

Binaries should work on most Linux distributions and the only dependency is zlib which is available by default on most distros.

Building

Building a release

Users are recommended to build from the latest release tar ball. Quick example for Ubuntu :

sudo apt-get install libhdf5-dev zlib1g-dev   #install HDF5 and zlib development libraries
VERSION=v0.1.0
wget "https://github.com/hasindu2008/slow5tools/releases/download/$VERSION/slow5tools-$VERSION-release.tar.gz" && tar xvf slow5tools-$VERSION-release.tar.gz && cd slow5tools-$VERSION/
./configure
make

The commands to install hdf5 (and zlib) development libraries on some popular distributions :

On Debian/Ubuntu : sudo apt-get install libhdf5-dev zlib1g-dev
On Fedora/CentOS : sudo dnf/yum install hdf5-devel zlib-devel
On Arch Linux: sudo pacman -S hdf5
On OS X : brew install hdf5

If you skip ./configure hdf5 will be compiled locally. It is a good option if you cannot install hdf5 library system wide. However, building hdf5 takes ages.

Building from GitHub

Building from the Github repository additionally requires autoreconf which can be installed on Ubuntu using sudo apt-get install autoconf automake. To build from GitHub:

sudo apt-get install libhdf5-dev zlib1g-dev autoconf automake  #install HDF5 and zlib development libraries and autotools
git clone --recursive https://github.com/hasindu2008/slow5tools
cd slow5tools
autoreconf
./configure
make

If you want to locally build HDF5 (takes ages) and build slow5tools against that:

git clone --recursive https://github.com/hasindu2008/slow5tools
cd slow5tools
autoreconf
scripts/install-hdf5.sh         # download and compiles HDF5 in the current folder
./configure --enable-localhdf5
make

Usage

Visit the man page for all the commands and options.

Examples

#convert a directory of fast5 files into .blow5 (compression enabled) using 8 I/O processes
slow5tools f2s fast5_dir -d blow5_dir  -p 8
#convert a single fast5 file into a blow5 file(compression enabled)
slow5tools f2s file.fast5 -o file.blow5  -p 1
#merge all blow5 files in a directory into a single blow5 file using 8 threads
slow5tools merge blow5_dir -o file.blow5 -t8

#Convert a BLOW5 file into SLOW5 ASCII
slow5tools view file.blow5 --to slow5 -o file.slow5
#convert a SLOW5 file to BLOW5
slow5tools view file.slow5 --to blow5 -o file.blow5

#index a slow5/blow5 file
slow5tools index file.blow5

#extract records from a slow5/blow5 file corresponding to given read ids
slow5tools get file.blow5 readid1 readid2

#split a blow5 file into separate blow5 files based on the read groups
slow5tools split file.blow5 -d blow5_dir -r
#split a blow5 file (single read group) into separate blow5 files such that there are 4000 reads in one file
slow5tools split file.blow5 -d blow5_dir -r 4000

#convert a directory of blow5 files to fast5 using 8 I/O processes
slow5tools s2f blow5_dir -d fast5  -p 8

Visit here for example workflows.

Acknowledgement

slow5tools uses klib. Some code snippets have been taken from Minimap2 and Samtools.

Issues
  • Header attributes warning

    Header attributes warning

    I got a number of warning messages when running slow5tools merge on a directory of blow5 files. I was wondering if this is a big problem or whether I can ignore it?

    [merge_main::WARNING] In file Calu/Calu_I_2hpi/FAO66427_4fd309c9_127.blow5, read_group 0 has a different number of header attributes than what the processed files had
    

    When I ran slow5tools f2s on the original data I did see the following warning a number of times

    [search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header.
    [search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header. This warning is suppressed now onwards.
    

    are these two warnings related?

    opened by mbhall88 14
  • data loss

    data loss

    I converted a test fast5 file to slow5 then back to fast5. I thought that by using the --lossless flag that the final fast5 would be the same as the original; however, it appears that I am loosing data during the conversion.

    Is it possible to preserve all of the data from the original fast5?

    Commands:

    # Convert fast5 to slow5
    slow5tools f2s --lossless true -o slow.slow5 original.fast5
    
    # Convert slow5 back to fast5
    slow5tools s2f -o new.fast5 slow.slow5
    

    Resulting file sizes original.fast5 97M slow.slow5 116M new.fast5 61M

    opened by jrhendrix 11
  • run_id issue in certain files

    run_id issue in certain files

    Hi there,

    I am testing slow5tools on a few datasets, converting fast5 to blow5 (slow5tools f2s) For most datasets it works well (~37% smaller file sizes) but unfortunately for some datasets I am running into the following warnings etc

    [search_and_warn::WARNING] slow5tools-v0.3.0: Weird fast5: Attribute file_version/read_001445a6-1ee9-4e18-90c3-7afee025d1b3 in AKH_1a/AKH_1a_8.fast5 is unexpected. This warning is suppressed now onwards. [search_and_warn::WARNING] slow5tools-v0.3.0: Weird fast5: Attribute previous_read_id/PreviousReadInfo in AKH_1a/AKH_1a_8.fast5 is unexpected. This warning is suppressed now onwards. [search_and_warn::WARNING] slow5tools-v0.3.0: Weird fast5: Attribute previous_read_number/PreviousReadInfo in AKH_1a/AKH_1a_8.fast5 is unexpected. This warning is suppressed now onwards. [fast5_group_itr::ERROR] Bad fast5: run_id is missing in the AKH_1a/AKH_1a_8.fast5 in read_id 001445a6-1ee9-4e18-90c3-7afee025d1b3. [read_fast5::ERROR] Bad fast5: Could not iterate over the read groups in the fast5 file AKH_1a/AKH_1a_8.fast5. [f2s_child_worker::ERROR] Bad fast5: Could not read contents of the fast5 file 'AKH_1a/AKH_1a_8.fast5'.

    So it appears the first real issue is that the run_id is missing in these files and this only appears to occur in my older fast5 datasets (data from around 2017-2018), where it seems that the fast5 format didn't contain run_id info. Even using the --allow option doesn't help These files can be manipulated using h5dump, ont_fast5_api and basecalled etc so it is not that the files are corrupted in some way Plus by manually checking the files with h5dump I can see the difference in fast5 structure between files that worked with slow5tools and those that didn't (including but not limited to the missing run_id information)

    If there is not a run_id present could one just be placed in it's absence (a randomly generated name as is done)? considering the --allow option appears to choose the first run_id anyway. Or in general just to handle older fast5 files?

    Thanks a lot

    enhancement 
    opened by SAMtoBAM 11
  • Conversion back to fast5 is broken in some cases

    Conversion back to fast5 is broken in some cases

    Using slow5tools v0.3.0

    The input file had an unexpected field (end_reason) which triggered a warning on conversion to blow5.

    Trying to recreate the fast5 from the blow5 file resulted in:

    [list_all_items] Looking for '*low5' files in test_blow5/ [s2f_main] 1 files found - took 0.001s [s2f_iop] 1 proceses will be used [s2f_child_worker] Converting test_blow5//FAL37440_pass_c4fa58d7_179.blow5 to fast5 [slow5_get_aux_enum_labels::ERROR] No enum auxiliary type exists. At src/slow5.c:1181 [slow5_get_aux_enum_labels::ERROR] Exiting on error. At src/slow5.c:1181

    If you can provide an email I can send an example file.

    Data were generated in May 2020 using GridION - 19.12.6 - Guppy Version was 3.2.10

    opened by mattloose 8
  • BLOW5 file size is larger than FAST5 VBZ compressed

    BLOW5 file size is larger than FAST5 VBZ compressed

    Thanks for the cool tools! I think SLOW5 is a great idea and hope that ONT follows suit. I started to play around a little with slow5tools with the thought of converting my large number of projects from folders of FAST5 files to single BLOW5 format -- I initially thought I could save some space (it's mentioned a 25% reduction in file size).

    This seems to be true for older FAST5 datasets, for example:

    $ du -sh old_project_fast5
    100G    old_project_fast5
    $ du -sh old_project.blow5
    93G     old_project.blow5
    

    So here I saved 7 GB by converting to BLOW5, which isn't 25% but still decent.

    I then noticed that the newer VBZ compressed files are actually quite a bit smaller than BLOW5, presumably due to the compression, for example:

    $ du -sh newer_project_fast5_vbz
    28G     newer_project_fast5_vbz
    $ slow5tools f2s newer_project_fast5_vbz -d newer_project_blow5 -p 12
    $ du -sh newer_project_blow5
    36G     newer_project_blow5
    

    In this case, converting the VBZ FAST5 files to BLOW5 format actually increased the size substantially. So it seems that the VBZ compression method that ONT has rolled out recently does a lot in saving space. Is it possible to use VBZ compression for BLOW5 format instead of zlib?

    opened by nextgenusfs 6
  • how to compare fast5 files

    how to compare fast5 files

    Hi folks.

    I'm very happy to see that you worked on this project and published it. I hope the folks in Oxford pick up some of what you showed and takes advantage of these findings.

    I am running some tests using version 0.3.0.

    I have the following datasets based on a single .fast5 file containing 4000 reads of genomic ONT promethion reads.

    dataset | file method | file size (kb) -- | -- | -- 1 | original zlib fast5 | 3027586 2 | input: 1, f2s zlib record and svb-zd signal compression | 1839624 3 | input: 1, f2s zstd record and svb-zd signal compression | 1771464 4 | input: 2, s2f | 2701116 5 | input: 3 s2f | 2701116

    The .fast5->blow5->.fast5 results in 4 and 5 resulted in the exact same file independent of the compression params, but both files differ in size from the original.

    I've been poking around with h5diff to verify that everything in my original fast5 is still recovered in the resulting .fast5 after roundtripping through blow5. I can't seem to get it to report differences, or to confirm that the file contents are the same. Can you share the approach you used to compare .fast5 file contents? I also tried basecalling multiple times with guppy, but the non-deterministic manner in which the basecalling is done limits the utility here.

    Also, is the size difference between 1, 4 and 5 possibly just due to a difference in compression? I don't see any information describing if the .fast5->blow5->.fast5 resulting file is zlib, vbz, or otherwise compressed.

    thanks Richard

    opened by RichardCorbett 5
  • Missing auxiliary field

    Missing auxiliary field

    Hello there, I am using pyslow5 for my task and I found there's one column called "end_reason" in my slow5 file, but I couldn't access it.
    It is not in the keys of s5.seq_reads(aux = 'all') or s5_data = s5.seq_reads(aux = ["start_time", "read_number", "start_mux", "median_before", "end_reason", "channel_number"])...

    opened by chilampoon 4
  • Allow to keep multiple run_ids when multi-fast5 contains them

    Allow to keep multiple run_ids when multi-fast5 contains them

    Hello, I have been learning about the fasta5 and slow5 formats recently (thanks a lot for all the tools and info you have provided in your recent papers).

    I have started using slow5tools to convert fast5 to slow5, and as example dataset I'm using the .fast5 files from: https://github.com/nanopore-wgs-consortium/NA12878 For example, if you follow the RNA links you will come across with download links like: http://s3.amazonaws.com/nanopore-human-wgs/rna/Multi_Fast5/Chip137_IVT_NA12878_Data_reads/Chip137_IVT_NA12878_Data_reads_0.fast5

    When I try to convert this file I get an error like:

    $ slow5tools f2s Chip137_IVT_NA12878_Data_reads_0.fast5 -o Chip137_IVT_NA12878_Data_reads_0.slow5
    
    [fast5_attribute_itr::ERROR] Ancient fast5: Different run_ids found in an individual multi-fast5 file. Cannot create a single header slow5/blow5. Consider --allow option.
    

    If I use the --allow option, then only the first run_id is used:

    $ slow5tools f2s Chip137_IVT_NA12878_Data_reads_0.fast5 -o Chip137_IVT_NA12878_Data_reads_0.slow5 -a
    [search_and_warn::WARNING] slow5tools-v0.3.0: Ancient fast5: Different run_ids found in an individual multi-fast5 file. First seen run_id will be set in slow5 header.
    

    From what I understood in FAST5 Demystified, you expect the run_id to be unique across all reads in a multi-fast5 file. And from what I understood from the slow5 description, the slow5 format supports multiple read groups.

    With that in mind, I have some questions that maybe you can help me with:

    • Do you know why or how common is it for multi-fast5 files to have multiple run_ids?
    • In the ERROR and warning you mention "Ancient fast5". Does it mean the multiple run_ids was allowed in old fast5 versions? (the version of my example file is 2.0)
    • Given that the slow5 format already supports multiple run_ids, would it be possible (or worth it) to add support for this multiple run_ids fast5 files, by keeping all the original run_ids instead of just the first one?
    • If you have a small dataset with proper fast5 files that you can share as part of the repo, that could help a lot.

    Thanks for your help.

    opened by waltergallegog 4
  • Converting back to fast5 increases size

    Converting back to fast5 increases size

    Hello there, I was trying the conversion f2s and it all worked out pretty well, but once I tried s2f it generated a folder bigger than the original one:

    du -sh fast5_1st blow5_1st fast5_again with output: 4.5G fast5_1st 2.8G blow5_1st 6.0G fast5_again

    Commands I used: slow5tools f2s fast5_1st/ -d blow5_1st/ slow5tools s2f blow5_1st/ -d fast5_again/

    I don't know if it's an issue or if it has something to do with compression mechanisms I am just not aware of, it just felt right to report back! Thanks for any reply, have a good day and keep up the good work! Simone

    opened by DrownedMala 4
  • Option to retain directory structure

    Option to retain directory structure

    Hi,

    Would it be possible to add a feature to allow for maintaining directory structure when converting between slow5 and fast5?

    For example, if I have a directory structure

      └── Vero
         ├── Vero_C_2hpi
         │  └── 20200513_Vero_c_2hpi
         │     ├── fast5
         │     │  ├── fast5_fail
         │     │  └── fast5_pass
         ├── Vero_C_24hpi
         │  ├── 20200430_Vero_c_24hpi
         │  │  └── 20200430_Vero_c_24hpi
         │  │     └── 20200430_0856_MN28696_FAN24843_c1ec9e3c
         │  │        ├── fast5
    

    it would be nice if I could have the output directory mirror this.

    For example, if I run

    $ slow5tools s2f -d out/ Vero/
    

    Then all of the fast5 files are dumped in out/. I would be great to have an option to have them dumped in the same nested structure as the input directory.

    Thanks for the great tool!

    enhancement 
    opened by mbhall88 3
  • Are pA values from slow5tools normalized?

    Are pA values from slow5tools normalized?

    Hi there, I am looking at the normalization methods of ont raw signals. I am not clear about whether the picoamp values gotten from slow5tools seq_reads(pA=True) are normalized already, or not? I've also tried the normalization in tombo using their function tombo_stats.normalize_raw_signal, where they scale the values of raw signals to ~0.

    Seems like using either pA values or those normalized raw signal values didn't affect too much for my downstream analysis, but I am curious if I'd like to normalize the squiggles of my dataset globally, which method you'll suggest? Thanks.

    opened by chilampoon 2
Releases(v0.5.0)
Owner
Hasindu Gamaarachchi
I am currently working on domain-specific computer systems for ultra-efficient real-time genomic data processing
Hasindu Gamaarachchi
Runtime Archiver plugin for Unreal Engine. Cross-platform archiving and unarchiving directories and files. Currently supports ZIP format.

Runtime Archiver Archiving and dearchiving directories and files Explore the docs » Marketplace . Releases . Support Chat Features Fast speed Easy arc

Georgy Treshchev 6 May 25, 2022
Multi-format archive and compression library

Welcome to libarchive! The libarchive project develops a portable, efficient C library that can read and write streaming archives in a variety of form

null 1.8k Jun 20, 2022
Brotli compression format

SECURITY NOTE Please consider updating brotli to version 1.0.9 (latest). Version 1.0.9 contains a fix to "integer overflow" problem. This happens when

Google 11.2k Jun 22, 2022
Experimental data compressor for 8-bit computers and low-end platforms

ZX5 (experimental) ZX5 is an experimental data compressor derived from ZX0, similarly targeted for low-end platforms, including 8-bit computers like t

Einar Saukas 9 Apr 14, 2022
Lossless data compression codec with LZMA-like ratios but 1.5x-8x faster decompression speed, C/C++

LZHAM - Lossless Data Compression Codec Public Domain (see LICENSE) LZHAM is a lossless data compression codec written in C/C++ (specifically C++03),

Rich Geldreich 626 Jun 15, 2022
A variation CredBandit that uses compression to reduce the size of the data that must be trasnmitted.

compressedCredBandit compressedCredBandit is a modified version of anthemtotheego's proof of concept Beacon Object File (BOF). This version does all t

Conor Richard 17 Apr 9, 2022
Data compression utility for minimalist demoscene programs.

bzpack Bzpack is a data compression utility which targets retrocomputing and demoscene enthusiasts. Given the artificially imposed size limits on prog

Milos Bazelides 17 Apr 8, 2022
Simple data packing library (written in C99)

Features Compressed file pack creation Runtime file pack reading Supported operating systems Ubuntu MacOS Windows Build requirements C99 compiler CMak

Nikita Fediuchin 3 Feb 25, 2022
A C++ static library offering a clean and simple interface to the 7-zip DLLs.

bit7z A C++ static library offering a clean and simple interface to the 7-zip DLLs Supported Features • Getting Started • Download • Requirements • Bu

Riccardo 270 Jun 28, 2022
New generation entropy codecs : Finite State Entropy and Huff0

New Generation Entropy coders This library proposes two high speed entropy coders : Huff0, a Huffman codec designed for modern CPU, featuring OoO (Out

Yann Collet 1.1k Jun 27, 2022
LZFSE compression library and command line tool

LZFSE This is a reference C implementation of the LZFSE compressor introduced in the Compression library with OS X 10.11 and iOS 9. LZFSE is a Lempel-

null 1.7k Jun 24, 2022
Fuzzing harnesses, corpora, scripts, and target-specific notes for fuzzing IrfanView

FuzzIrfanView Here is the accompany repository for my blog post, Fuzzing IrfanView with WinAFL. It contains the following: The scripts used to downloa

Moshe Kaplan 16 Jun 6, 2022
PNGFuse is a cross-platform application that allows you to embed and extract full zlib-compressed files within PNG metadata.

PNGFuse PNGFuse is a portable, lightweight, and cross-platform application written in C++ that allows you to embed and extract full zlib-compressed fi

Eta 3 Dec 29, 2021
Advanced DXTc texture compression and transcoding library

crunch/crnlib v1.04 - Advanced DXTn texture compression library Public Domain - Please see license.txt. Portions of this software make use of public d

null 743 Jun 19, 2022
A command-line tool for converting heightmaps in GeoTIFF format into tiled optimized meshes.

TIN Terrain TIN Terrain is a command-line tool for converting heightmaps presented in GeoTIFF format into tiled optimized meshes (Triangulated Irregul

HERE Technologies 462 Jun 10, 2022
C/C++ library for converting a stream of OPL FM synth chip commands to the OPB music format

OPBinaryLib C/C++ library for converting a stream of OPL FM synth chip commands to the OPB music format Basic usage Store your OPL commands as a conti

Emma Maassen 10 Feb 2, 2022
Microsoft 2.4k Jun 27, 2022
Memory Process File System (MemProcFS) is an easy and convenient way of viewing physical memory as files in a virtual file system

The Memory Process File System (MemProcFS) is an easy and convenient way of viewing physical memory as files in a virtual file system.

Ulf Frisk 1.3k Jun 23, 2022
Bolt is an algorithm for compressing vectors of real-valued data and running mathematical operations directly on the compressed representations.

Bolt is an algorithm for compressing vectors of real-valued data and running mathematical operations directly on the compressed representations.

null 2.2k Jun 28, 2022
Viewing files of technological logs 1C (WinAPI)

YellowViewer Viewing files of technological logs 1C (WinAPI) 1C technological log files viewer (WinAPI). Works with large files. Minimal memory consum

null 18 Jun 20, 2022
C library for encoding, decoding and manipulating JSON data

Jansson README Jansson is a C library for encoding, decoding and manipulating JSON data. Its main features and design principles are: Simple and intui

Petri Lehtinen 2.6k Jun 21, 2022
A small header-only library for converting data between json representation and c++ structs

Table of Contents Table of Contents What Is json_dto? What's new? v.0.3.0 v.0.2.14 v.0.2.13 v.0.2.12 v.0.2.11 v.0.2.10 v.0.2.9 v.0.2.8 v.0.2.7 v.0.2.6

Stiffstream 97 Jun 20, 2022
Draco is a library for compressing and decompressing 3D geometric meshes and point clouds.

Draco is a library for compressing and decompressing 3D geometric meshes and point clouds. It is intended to improve the storage and transmission of 3D graphics.

Google 5k Jun 20, 2022
A combined suite of utilities for manipulating binary data files.

BinaryTools A combined suite of utilities for manipulating binary data files. It was developed for use on Windows but might compile on other systems.

David Walters 3 Sep 29, 2021
Computational geometry and spatial indexing on the sphere

S2 Geometry Library Overview This is a package for manipulating geometric shapes. Unlike many geometry libraries, S2 is primarily designed to work wit

Google 1.8k Jun 26, 2022
RediSearch is a Redis module that provides querying, secondary indexing, and full-text search for Redis.

A query and indexing engine for Redis, providing secondary indexing, full-text search, and aggregations.

null 3.8k Jun 27, 2022
network packet indexing and querying

_______ _____.___. ________ _____ _____ \ \\__ | |/ _____/ / \ / _ \ / | \/ | / \ ___ / \ / \ / /_\

64k & stackless-goto 18 Dec 15, 2021
Compressed Log Processor (CLP) is a free tool capable of compressing text logs and searching the compressed logs without decompression.

CLP Compressed Log Processor (CLP) is a tool capable of losslessly compressing text logs and searching the compressed logs without decompression. To l

null 49 Jun 8, 2022
A simple C library for compressing lists of integers using binary packing

The SIMDComp library A simple C library for compressing lists of integers using binary packing and SIMD instructions. The assumption is either that yo

Daniel Lemire 370 Jun 23, 2022