Slow5tools is a toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.

Overview

slow5tools

Slow5tools is a simple toolkit for converting (FAST5 <-> SLOW5), compressing, viewing, indexing and manipulating data in SLOW5 format.

About SLOW5 format

SLOW5 is a new file format for signal data from Oxford Nanopore Technologies (ONT) devices. SLOW5 was developed to overcome inherent limitations in the standard FAST5 data format that prevent efficient, scalable analysis and cause many headaches for developers.

SLOW5 is a simple tab-separated values (TSV) file encoding metadata and time-series signal data for one nanopore read per line, with global metadata stored in a file header. Parallel file access is facilitated by an accompanying index file, also in TSV format, that specifies the position of each read (in Bytes) within the main SLOW5 file. SLOW5 can be encoded in human-readable ASCII format, or a more compact and efficient binary format (BLOW5) - this is analogous to the seminal SAM/BAM format for storing DNA sequence alignments. The BLOW5 binary format can be compressed using standard zlib compression, thereby minimising the data storage footprint while still permitting efficient parallel access.

Detailed benchmarking experiments have shown that SLOW5 format is an order of magnitude faster and 25% smaller than FAST5.

GitHub Downloads SLOW5 C/C++ CI Github

Full documentation: https://hasindu2008.github.io/slow5tools
Pre-print: https://www.biorxiv.org/content/10.1101/2021.06.29.450255v1

Quick start

If you are a Linux user on x86_64 architecture and want to quickly try out download the compiled binaries from the latest release. For example:

VERSION=v0.1.0
wget "https://github.com/hasindu2008/slow5tools/releases/download/$VERSION/slow5tools-$VERSION-x86_64-linux-binaries.tar.gz" && tar xvf slow5tools-$VERSION-x86_64-linux-binaries.tar.gz && cd slow5tools-$VERSION/
./slow5tools

Binaries should work on most Linux distributions and the only dependency is zlib which is available by default on most distros.

Building

Building a release

Users are recommended to build from the latest release tar ball. Quick example for Ubuntu :

sudo apt-get install libhdf5-dev zlib1g-dev   #install HDF5 and zlib development libraries
VERSION=v0.1.0
wget "https://github.com/hasindu2008/slow5tools/releases/download/$VERSION/slow5tools-$VERSION-release.tar.gz" && tar xvf slow5tools-$VERSION-release.tar.gz && cd slow5tools-$VERSION/
./configure
make

The commands to install hdf5 (and zlib) development libraries on some popular distributions :

On Debian/Ubuntu : sudo apt-get install libhdf5-dev zlib1g-dev
On Fedora/CentOS : sudo dnf/yum install hdf5-devel zlib-devel
On Arch Linux: sudo pacman -S hdf5
On OS X : brew install hdf5

If you skip ./configure hdf5 will be compiled locally. It is a good option if you cannot install hdf5 library system wide. However, building hdf5 takes ages.

Building from GitHub

Building from the Github repository additionally requires autoreconf which can be installed on Ubuntu using sudo apt-get install autoconf automake. To build from GitHub:

sudo apt-get install libhdf5-dev zlib1g-dev autoconf automake  #install HDF5 and zlib development libraries and autotools
git clone --recursive https://github.com/hasindu2008/slow5tools
cd slow5tools
autoreconf
./configure
make

If you want to locally build HDF5 (takes ages) and build slow5tools against that:

git clone --recursive https://github.com/hasindu2008/slow5tools
cd slow5tools
autoreconf
scripts/install-hdf5.sh         # download and compiles HDF5 in the current folder
./configure --enable-localhdf5
make

Usage

Visit the man page for all the commands and options.

Examples

#convert a directory of fast5 files into .blow5 (compression enabled) using 8 I/O processes
slow5tools f2s fast5_dir -d blow5_dir  -p 8
#convert a single fast5 file into a blow5 file(compression enabled)
slow5tools f2s file.fast5 -o file.blow5  -p 1
#merge all blow5 files in a directory into a single blow5 file using 8 threads
slow5tools merge blow5_dir -o file.blow5 -t8

#Convert a BLOW5 file into SLOW5 ASCII
slow5tools view file.blow5 --to slow5 -o file.slow5
#convert a SLOW5 file to BLOW5
slow5tools view file.slow5 --to blow5 -o file.blow5

#index a slow5/blow5 file
slow5tools index file.blow5

#extract records from a slow5/blow5 file corresponding to given read ids
slow5tools get file.blow5 readid1 readid2

#split a blow5 file into separate blow5 files based on the read groups
slow5tools split file.blow5 -d blow5_dir -r
#split a blow5 file (single read group) into separate blow5 files such that there are 4000 reads in one file
slow5tools split file.blow5 -d blow5_dir -r 4000

#convert a directory of blow5 files to fast5 using 8 I/O processes
slow5tools s2f blow5_dir -d fast5  -p 8

Visit here for example workflows.

Acknowledgement

slow5tools uses klib. Some code snippets have been taken from Minimap2 and Samtools.

Comments
  • Header attributes warning

    Header attributes warning

    I got a number of warning messages when running slow5tools merge on a directory of blow5 files. I was wondering if this is a big problem or whether I can ignore it?

    [merge_main::WARNING] In file Calu/Calu_I_2hpi/FAO66427_4fd309c9_127.blow5, read_group 0 has a different number of header attributes than what the processed files had
    

    When I ran slow5tools f2s on the original data I did see the following warning a number of times

    [search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header.
    [search_and_warn::WARNING] slow5tools-v0.3.0: The attribute 'pore_type' is empty and will be stored in the SLOW5 header. This warning is suppressed now onwards.
    

    are these two warnings related?

    opened by mbhall88 14
  • [f2s_child_worker::ERROR] Bad fast5

    [f2s_child_worker::ERROR] Bad fast5

    I am running the command as follows:

    slow5tools f2s ${dir_name}/fast5 -d "${outDir}" Where ${dir_name} is the input directory and ${outDir} the output directory.

    But shortly, it gives me this error:

    list_all_items] Looking for '*.fast5' files in dir_name/fast5
    [f2s_main] 3245 fast5 files found - took 0.391s
    [f2s_main] Just before forking, peak RAM = 0.000 GB
    [f2s_iop] 8 proceses will be used.
    [f2s_iop] Spawning 8 I/O processes to circumvent HDF hell.
    [f2s_child_worker::ERROR] Bad fast5: Fast5 file 'dir_name/fast5/P5a_279.fast5' could not be opened or is corrupted.
    [f2s_iop] Child process 1969843 exited with status=1.
    
    

    I appreciate your feedback/suggestions, Medhat

    opened by MeHelmy 12
  • Basecall with Guppy

    Basecall with Guppy

    Hello @hasindu2008,

    Thank you for developing slow5tools - it's been really useful to compress and store ONT data!

    Apologies if this is a naïve question but I thought it was possible to basecall blow5 files using Guppy. I have successfully converted my fast5 files to slow5, and am trying to basecall, however I get the following error:

    2022-08-01 18:11:09.698375 [guppy/message] ONT Guppy basecalling software version 6.2.1+6588110, minimap2 version 2.22-r1101
    config file:        /gpfs/mrc0/projects/Research_Project-MRC148213/sl693/softwares/ont-guppy-cpu/data/dna_r9.4.1_450bps_modbases_5mc_cg_hac.cfg
    model file:         /gpfs/mrc0/projects/Research_Project-MRC148213/sl693/softwares/ont-guppy-cpu/data/template_r9.4.1_450bps_hac.jsn
    input path:         /gpfs/mrc0/projects/Research_Project-MRC148213/sl693/ONT/Mouse_Whole_Genome/1_raw/20200807_1632_MC-110214_0_add313_506ffc5b/barcode10/blow5
    save path:          /gpfs/mrc0/projects/Research_Project-MRC148213/sl693/ONT/Mouse_Whole_Genome/2_basecalled/20200807_1632_MC-110214_0_add313_506ffc5b/barcode10
    chunk size:         2000
    chunks per runner:  256
    minimum qscore:     9
    records per file:   4000
    num basecallers:    1
    cpu mode:           ON
    threads per caller: 4
    
    alignment file:     /gpfs/mrc0/projects/Research_Project-MRC148213/sl693/reference_2019/mm10.fa
    alignment type:     auto
    
    Use of this software is permitted solely under the terms of the end user license agreement (EULA).By running, copying or accessing this software, you are demonstrating your acceptance of the EULA.
    The EULA may be found in /gpfs/mrc0/projects/Research_Project-MRC148213/sl693/softwares/ont-guppy-cpu/bin
    2022-08-01 18:11:09.698970 [guppy/info] crashpad_handler not supported on this platform.
    2022-08-01 18:13:19.554000 [guppy/message] Full alignment will be performed.
    2022-08-01 18:13:19.595740 [guppy/message] Found 0 fast5 files to process.
    2022-08-01 18:13:19.596981 [guppy/message] Init time: 129876 ms
    2022-08-01 18:13:19.697374 [guppy/message] Caller time: 100 ms, Samples called: 0, samples/s: 0
    2022-08-01 18:13:19.697410 [guppy/message] Finishing up any open output files.
    2022-08-01 18:13:19.779375 [guppy/message] Basecalling completed successfully.
    

    With the commands (slow5tools 0.5.1):

    slow5tools f2s <input_dir> -d <output_dir> -p 8
    guppy_basecaller -i <output_dir> -s <output_basecalled_dir> -c dna_r9.4.1_450bps_modbases_5mc_cg_hac.cfg --bam_out --recursive --align_ref mm10.fasta
    

    Am I missing something/argument or is basecall only possible with the original raw fast5 files?

    Thank you, Szi Kay

    opened by SziKayLeung 11
  • data loss

    data loss

    I converted a test fast5 file to slow5 then back to fast5. I thought that by using the --lossless flag that the final fast5 would be the same as the original; however, it appears that I am loosing data during the conversion.

    Is it possible to preserve all of the data from the original fast5?

    Commands:

    # Convert fast5 to slow5
    slow5tools f2s --lossless true -o slow.slow5 original.fast5
    
    # Convert slow5 back to fast5
    slow5tools s2f -o new.fast5 slow.slow5
    

    Resulting file sizes original.fast5 97M slow.slow5 116M new.fast5 61M

    opened by jrhendrix 11
  • run_id issue in certain files

    run_id issue in certain files

    Hi there,

    I am testing slow5tools on a few datasets, converting fast5 to blow5 (slow5tools f2s) For most datasets it works well (~37% smaller file sizes) but unfortunately for some datasets I am running into the following warnings etc

    [search_and_warn::WARNING] slow5tools-v0.3.0: Weird fast5: Attribute file_version/read_001445a6-1ee9-4e18-90c3-7afee025d1b3 in AKH_1a/AKH_1a_8.fast5 is unexpected. This warning is suppressed now onwards. [search_and_warn::WARNING] slow5tools-v0.3.0: Weird fast5: Attribute previous_read_id/PreviousReadInfo in AKH_1a/AKH_1a_8.fast5 is unexpected. This warning is suppressed now onwards. [search_and_warn::WARNING] slow5tools-v0.3.0: Weird fast5: Attribute previous_read_number/PreviousReadInfo in AKH_1a/AKH_1a_8.fast5 is unexpected. This warning is suppressed now onwards. [fast5_group_itr::ERROR] Bad fast5: run_id is missing in the AKH_1a/AKH_1a_8.fast5 in read_id 001445a6-1ee9-4e18-90c3-7afee025d1b3. [read_fast5::ERROR] Bad fast5: Could not iterate over the read groups in the fast5 file AKH_1a/AKH_1a_8.fast5. [f2s_child_worker::ERROR] Bad fast5: Could not read contents of the fast5 file 'AKH_1a/AKH_1a_8.fast5'.

    So it appears the first real issue is that the run_id is missing in these files and this only appears to occur in my older fast5 datasets (data from around 2017-2018), where it seems that the fast5 format didn't contain run_id info. Even using the --allow option doesn't help These files can be manipulated using h5dump, ont_fast5_api and basecalled etc so it is not that the files are corrupted in some way Plus by manually checking the files with h5dump I can see the difference in fast5 structure between files that worked with slow5tools and those that didn't (including but not limited to the missing run_id information)

    If there is not a run_id present could one just be placed in it's absence (a randomly generated name as is done)? considering the --allow option appears to choose the first run_id anyway. Or in general just to handle older fast5 files?

    Thanks a lot

    enhancement 
    opened by SAMtoBAM 11
  • Bad fast5 attribute pore type

    Bad fast5 attribute pore type

    I got this error when using f2s in v0.6.0.

    Bad fast5: Attribute read_008c8f1f-f32f-4b23-ab6f-1ef4f4531a20/pore_type in /nfs/research/zi/mbhall/tech_wars/data/madagascar/nanopore/raw_data/md_tb_reseq_2019_5/multi_fast5s/BC76_r0b1235_0.fast5 is duplicated and has two different values in two different places. Please report this with an example FAST5 file at 'https://github.com/hasindu2008/slow5tools/issues' for us to investigate.
    

    Here is the fast5 that caused it. BC76_r0b1235_0.fast5.gz

    opened by mbhall88 10
  • memory issue

    memory issue

    I have a few runs that give this error : (converted from fast5 to blow5 then merged into a single file)

    # slow5tools stats merged.blow5
    file path       merged.blow5
    file version    0.2.0
    file format     BLOW5
    record compression method       zlib
    signal compression method       svb-zd
    number of read groups   2
    number of auxiliary fields      6
    auxiliary fields        end_reason,channel_number,median_before,read_number,start_mux,start_time
    [stats_main] counting number of slow5 records...
    [slow5_get_next_mem::ERROR] Failed to allocate memory: Cannot allocate memory At src/slow5.c:3240
    [stats_main::ERROR] Error reading the file.
    

    The machine isn't under any memory pressure. (total 128 GB memory) and slow5tool doesn't seem to be using much memory either during the run. (it looks to be CPU bound) The merged.blow5 is about 387 GB large.

    Is this a bug or somewhat expected with larger blow5 files ? I use the stats to get the "number of records" to compare with as a sanity check, if there is an other method i'm open to suggestions :)

    Thanks!

    opened by svennd 9
  • BLOW5 file size is larger than FAST5 VBZ compressed

    BLOW5 file size is larger than FAST5 VBZ compressed

    Thanks for the cool tools! I think SLOW5 is a great idea and hope that ONT follows suit. I started to play around a little with slow5tools with the thought of converting my large number of projects from folders of FAST5 files to single BLOW5 format -- I initially thought I could save some space (it's mentioned a 25% reduction in file size).

    This seems to be true for older FAST5 datasets, for example:

    $ du -sh old_project_fast5
    100G    old_project_fast5
    $ du -sh old_project.blow5
    93G     old_project.blow5
    

    So here I saved 7 GB by converting to BLOW5, which isn't 25% but still decent.

    I then noticed that the newer VBZ compressed files are actually quite a bit smaller than BLOW5, presumably due to the compression, for example:

    $ du -sh newer_project_fast5_vbz
    28G     newer_project_fast5_vbz
    $ slow5tools f2s newer_project_fast5_vbz -d newer_project_blow5 -p 12
    $ du -sh newer_project_blow5
    36G     newer_project_blow5
    

    In this case, converting the VBZ FAST5 files to BLOW5 format actually increased the size substantially. So it seems that the VBZ compression method that ONT has rolled out recently does a lot in saving space. Is it possible to use VBZ compression for BLOW5 format instead of zlib?

    opened by nextgenusfs 6
  • Bad fast5 - [fast5_attribute_itr::ERROR]

    Bad fast5 - [fast5_attribute_itr::ERROR]

    Hello, it's me again! So we have a new minion run using the updated minknow, then when I usedslow5tools to convert the fast5 files I just got these errors:

    [list_all_items] Looking for '*.fast5' files in FAS16606_pass_13cc46f5_0.fast5
    [f2s_main] 1 fast5 files found - took 0.000s
    [f2s_main] Just before forking, peak RAM = 0.000 GB
    [f2s_iop] 1 proceses will be used.
    [fast5_attribute_itr::ERROR] Bad fast5: Attribute Raw/num_minknow_events in FAS16606_pass_13cc46f5_0.fast5 is unexpected. Please report this with an example FAST5 file at 'https://github.com/hasindu2008/slow5tools/issues' for us to investigate.
    [read_fast5::ERROR] Bad fast5: Could not iterate over the read groups in the fast5 file FAS16606_pass_13cc46f5_0.fast5.
    [f2s_child_worker::ERROR] Could not read contents of the fast5 file 'FAS16606_pass_13cc46f5_0.fast5'.
    

    I don't know if it's caused by the new minknow or what, but I can use h5py to open them. Could you help fix this? Thanks.

    I can also send over our fast5 if you need it.

    opened by chilampoon 5
  • how to compare fast5 files

    how to compare fast5 files

    Hi folks.

    I'm very happy to see that you worked on this project and published it. I hope the folks in Oxford pick up some of what you showed and takes advantage of these findings.

    I am running some tests using version 0.3.0.

    I have the following datasets based on a single .fast5 file containing 4000 reads of genomic ONT promethion reads.

    dataset | file method | file size (kb) -- | -- | -- 1 | original zlib fast5 | 3027586 2 | input: 1, f2s zlib record and svb-zd signal compression | 1839624 3 | input: 1, f2s zstd record and svb-zd signal compression | 1771464 4 | input: 2, s2f | 2701116 5 | input: 3 s2f | 2701116

    The .fast5->blow5->.fast5 results in 4 and 5 resulted in the exact same file independent of the compression params, but both files differ in size from the original.

    I've been poking around with h5diff to verify that everything in my original fast5 is still recovered in the resulting .fast5 after roundtripping through blow5. I can't seem to get it to report differences, or to confirm that the file contents are the same. Can you share the approach you used to compare .fast5 file contents? I also tried basecalling multiple times with guppy, but the non-deterministic manner in which the basecalling is done limits the utility here.

    Also, is the size difference between 1, 4 and 5 possibly just due to a difference in compression? I don't see any information describing if the .fast5->blow5->.fast5 resulting file is zlib, vbz, or otherwise compressed.

    thanks Richard

    opened by RichardCorbett 5
  • Missing auxiliary field

    Missing auxiliary field

    Hello there, I am using pyslow5 for my task and I found there's one column called "end_reason" in my slow5 file, but I couldn't access it.
    It is not in the keys of s5.seq_reads(aux = 'all') or s5_data = s5.seq_reads(aux = ["start_time", "read_number", "start_mux", "median_before", "end_reason", "channel_number"])...

    opened by chilampoon 4
  • Conversion back to fast5 is broken in some cases

    Conversion back to fast5 is broken in some cases

    Using slow5tools v0.3.0

    The input file had an unexpected field (end_reason) which triggered a warning on conversion to blow5.

    Trying to recreate the fast5 from the blow5 file resulted in:

    [list_all_items] Looking for '*low5' files in test_blow5/ [s2f_main] 1 files found - took 0.001s [s2f_iop] 1 proceses will be used [s2f_child_worker] Converting test_blow5//FAL37440_pass_c4fa58d7_179.blow5 to fast5 [slow5_get_aux_enum_labels::ERROR] No enum auxiliary type exists. At src/slow5.c:1181 [slow5_get_aux_enum_labels::ERROR] Exiting on error. At src/slow5.c:1181

    If you can provide an email I can send an example file.

    Data were generated in May 2020 using GridION - 19.12.6 - Guppy Version was 3.2.10

    opened by mattloose 11
Releases(v0.8.0)
Owner
Hasindu Gamaarachchi
I am currently working on domain-specific computer systems for ultra-efficient real-time genomic data processing
Hasindu Gamaarachchi
Runtime Archiver plugin for Unreal Engine. Cross-platform archiving and unarchiving directories and files. Currently supports ZIP format.

Runtime Archiver Archiving and dearchiving directories and files Explore the docs » Marketplace . Releases . Support Chat Features Fast speed Easy arc

Georgy Treshchev 25 Nov 25, 2022
Multi-format archive and compression library

Welcome to libarchive! The libarchive project develops a portable, efficient C library that can read and write streaming archives in a variety of form

null 1.9k Nov 25, 2022
Brotli compression format

SECURITY NOTE Please consider updating brotli to version 1.0.9 (latest). Version 1.0.9 contains a fix to "integer overflow" problem. This happens when

Google 11.7k Nov 27, 2022
Experimental data compressor for 8-bit computers and low-end platforms

ZX5 (experimental) ZX5 is an experimental data compressor derived from ZX0, similarly targeted for low-end platforms, including 8-bit computers like t

Einar Saukas 9 Apr 14, 2022
Lossless data compression codec with LZMA-like ratios but 1.5x-8x faster decompression speed, C/C++

LZHAM - Lossless Data Compression Codec Public Domain (see LICENSE) LZHAM is a lossless data compression codec written in C/C++ (specifically C++03),

Rich Geldreich 639 Nov 27, 2022
A variation CredBandit that uses compression to reduce the size of the data that must be trasnmitted.

compressedCredBandit compressedCredBandit is a modified version of anthemtotheego's proof of concept Beacon Object File (BOF). This version does all t

Conor Richard 18 Sep 22, 2022
Data compression utility for minimalist demoscene programs.

bzpack Bzpack is a data compression utility which targets retrocomputing and demoscene enthusiasts. Given the artificially imposed size limits on prog

Milos Bazelides 20 Jul 27, 2022
Simple data packing library (written in C99)

Features Compressed file pack creation Runtime file pack reading Supported operating systems Ubuntu MacOS Windows Build requirements C99 compiler CMak

Nikita Fediuchin 3 Feb 25, 2022
A C++ static library offering a clean and simple interface to the 7-zip DLLs.

bit7z A C++ static library offering a clean and simple interface to the 7-zip DLLs Supported Features • Getting Started • Download • Requirements • Bu

Riccardo 321 Nov 27, 2022
New generation entropy codecs : Finite State Entropy and Huff0

New Generation Entropy coders This library proposes two high speed entropy coders : Huff0, a Huffman codec designed for modern CPU, featuring OoO (Out

Yann Collet 1.1k Nov 24, 2022
LZFSE compression library and command line tool

LZFSE This is a reference C implementation of the LZFSE compressor introduced in the Compression library with OS X 10.11 and iOS 9. LZFSE is a Lempel-

null 1.7k Nov 25, 2022
Fuzzing harnesses, corpora, scripts, and target-specific notes for fuzzing IrfanView

FuzzIrfanView Here is the accompany repository for my blog post, Fuzzing IrfanView with WinAFL. It contains the following: The scripts used to downloa

Moshe Kaplan 15 Oct 29, 2022
PNGFuse is a cross-platform application that allows you to embed and extract full zlib-compressed files within PNG metadata.

PNGFuse PNGFuse is a portable, lightweight, and cross-platform application written in C++ that allows you to embed and extract full zlib-compressed fi

Eta 3 Dec 29, 2021
Advanced DXTc texture compression and transcoding library

crunch/crnlib v1.04 - Advanced DXTn texture compression library Public Domain - Please see license.txt. Portions of this software make use of public d

null 770 Nov 26, 2022
A command-line tool for converting heightmaps in GeoTIFF format into tiled optimized meshes.

TIN Terrain TIN Terrain is a command-line tool for converting heightmaps presented in GeoTIFF format into tiled optimized meshes (Triangulated Irregul

HERE Technologies 511 Dec 1, 2022
C/C++ library for converting a stream of OPL FM synth chip commands to the OPB music format

OPBinaryLib C/C++ library for converting a stream of OPL FM synth chip commands to the OPB music format Basic usage Store your OPL commands as a conti

Emma Maassen 10 Feb 2, 2022
Microsoft 2.5k Nov 28, 2022
Bolt is an algorithm for compressing vectors of real-valued data and running mathematical operations directly on the compressed representations.

Bolt is an algorithm for compressing vectors of real-valued data and running mathematical operations directly on the compressed representations.

null 2.3k Nov 26, 2022
Memory Process File System (MemProcFS) is an easy and convenient way of viewing physical memory as files in a virtual file system

The Memory Process File System (MemProcFS) is an easy and convenient way of viewing physical memory as files in a virtual file system.

Ulf Frisk 1.7k Dec 2, 2022
Viewing files of technological logs 1C (WinAPI)

YellowViewer Viewing files of technological logs 1C (WinAPI) 1C technological log files viewer (WinAPI). Works with large files. Minimal memory consum

null 19 Sep 30, 2022