Aligns short reads using strobemers

Overview

StrobeAlign

Strobealign is a single or paired-end short-read aligner using syncmer-thinned strobemers. Strobealign is multithreaded and implements both alignment (SAM) and mapping (PAF). It is 12-15 times faster than BWA and Bowtie2 with similar accuracy for single-end reads, and about 10 times faster with a loss of 0.1-0.2% accuracy for paired-end reads. See experimentins in preprint.

The default parameter setting is tailored for Illumina single or paired-end reads of lengths about 150-500nt.

Strobealign is currently not recommended for reads shorter than 150nt as a lower value for parameter -k is needed (e.g. 15-17) and extensive testing in this setting remains to be done.

Strobealign is also currently not recommended for long reads (>500nt) as significant implementation changes is needed to keep its relative speed. For long reads we need a different extention algorithm (chaining of seeds instead of the current approach described in the preprint) and split-mapping funcitionality.

INSTALLATION

You can acquire precompiled binaries for Linux and Mac OSx from here. For example, for linux, simply do

wget https://github.com/ksahlin/StrobeAlign/tree/main/bin/Linux/StrobeAlign-v0.0.3.1
mv StrobeAlign-v0.0.3.1 strobealign  # rename to strobealign
chmod +x strobealign # make executable
./strobealign  # test program

If you want to compile from the source, you need to have a newer g++ and zlib installed. Then do the following:

git clone https://github.com/ksahlin/StrobeAlign
cd StrobeAlign
# Needs a newer g++ version. Tested with version 8 and upwards.
g++ -std=c++14 main.cpp source/index.cpp source/ksw2_extz2_sse.c -lz -fopenmp -o StrobeAlign -O3 -mavx2

Common installation from source errors

If you have zlib installed, and the zlib.h file is in folder /path/to/zlib/include and the libz.so file in /path/to/zlib/lib but you get

main.cpp:12:10: fatal error: zlib.h: No such file or directory
 #include <zlib.h>
          ^~~~~~~~
compilation terminated.

add -I/path/to/zlib/include -L/path/to/zlib/lib to the compilation, that is

g++ -std=c++14 -I/path/to/zlib/include -L/path/to/zlib/lib main.cpp source/index.cpp source/ksw2_extz2_sse.c -lz -fopenmp -o StrobeAlign -O3 -mavx2

USAGE

For alignment to SAM file:

StrobeAlign [-k 22 -s 18 -f 0.0002] -o <output.sam> ref.fa reads.fa 

For mapping to PAF file (option -x):

StrobeAlign [-k 22 -s 18 -f 0.0002] -x -o <output.sam> ref.fa reads.fa 

TODO

  1. Add option to separate build index and perform alignment in separate steps.

CREDITS

Kristoffer Sahlin. Faster short-read mapping with strobemer seeds in syncmer space. bioRxiv, 2021. doi:10.1101/2021.06.18.449070. Preprint available here.

VERSION INFO

Version 0.0.3.1

  1. Bugfix. Takes care of segmentation fault bug in paired-end mapping mode (-x) when none of the reads have NAMs.

Version 0.0.3

  1. Implements a paired-end alignment mode.
  2. Implements a rescue mode both in SE and PE alignment modes (described in preprint v2).
  3. Changed to symmetrical strobemer hashvalues due to inversions (described in preprint v2).

Version 0.0.2

  1. Implements multi-threading.
  2. Allow reads in fast[a/q] format and gzipped files through kseqpp library.

Version 0.0.1

The aligner used for the experiments presented in the preprint (v1) on bioRxiv. Only single threaded alignment and aligns reads as single reads (no PE mapping).

LICENCE

GPL v3.0, see LICENSE.txt.

Comments
  • pipe output to downstream programs

    pipe output to downstream programs

    hi @ksahlin , your approach looks really promising. I already started to work with it and do some benchmarks in respect with downstream analyses. Would it be possible to allow StrobeAlign to directly print the sam alignments to standard output? that would be very useful to pipe downstream tasks, such as marking duplicates and sorting with samtools. Thanks in advance!

    feature 
    opened by TDDB-limagrain 30
  • Adressing Issue #75 with some initial refactoring, both functions now…

    Adressing Issue #75 with some initial refactoring, both functions now…

    … call a helper that returns a reference to a vetor of alignments and align_PE simply returns the best (by SW score as it did before) while align_SE_secondary_hits does it's usual thing with the vector.

    I am relatively certain that this will do the same thing as before, however since we now have unit tests, adding some unit tests that check that alignments look as they should with test data could be really good.

    opened by PaulPyl 18
  • suggestion to align medium size indel

    suggestion to align medium size indel

    I have illumina paired reads 150bps that is targeted. The average coverage is pretty high. There is a known duplication of 77bps that is homozygous. It looks very good in IGV see snapshot.

    With the default parameters the alignment looks like this. Is there way I can adjust the settings to get the alignment to properly anchor both ends of the reads.

    139854754-c310eb28-384f-4115-8a15-1d541ddb36f3

    opened by husamia 14
  • ALIGNMENT TO REF LONGER THAN 2000bp

    ALIGNMENT TO REF LONGER THAN 2000bp

    hi @ksahlin , sorry for coming back with a new possible issue! I used the very latest patch you've made to align a full sample against the public genome we discussed later. The output was correctly piped to samtools for further duplicate marking and sorting, so thank you very much for allowing the redirection !

    Total mapping sites tried: 260811459
    Total calls to ssw: 83238353
    Calls to ksw (rescue mode): 24136154
    Did not fit strobe start site: 452500
    Tried rescue: 8059431
    Total time mapping: 4317.84 s.
    Total time reading read-file(s): 178.746 s.
    Total time creating strobemers: 188.722 s.
    Total time finding NAMs (non-rescue mode): 618.305 s.
    Total time finding NAMs (rescue mode): 341.146 s.
    Total time sorting NAMs (candidate sites): 75.2316 s.
    Total time reverse compl seq: 0 s.
    Total time base level alignment (ssw): 2239.28 s.
    Total time writing alignment to files: 525.32 s.
    

    Everything goes well, but looking at the standard output, I obtained the following message several times: ALIGNMENT TO REF LONGER THAN 2000bp - REPORT TO DEVELOPER.

    For example: ALIGNMENT TO REF LONGER THAN 2000bp - REPORT TO DEVELOPER. Happened for read: TGCCAATCCTGTGGGACGAATATGGGGCTCAAACCTCAGCCAAAACTCAATAGACACAGTGACGAATGTCTGGTAAAAAATTCAGACCAAAATACCAAAGGAGTAAGGCGTAGCAAGTCCCAGACCGAGAGTGAATAAAACCGGTTTTCCG ref len:53438756

    Do you have any idea about such a behavior? I ran strobealign with default parameters.

    opened by TDDB-limagrain 12
  • Feat/save index to file

    Feat/save index to file

    I did two things:

    1. Collected the index into a struct.
    2. Added the possibility to read/write the index from/to disk.

    I have not created test cases (I don't know how in this project) but have tested the code, it produces identical output for phix compared to previous runs.

    I tested to run on the human transcriptome: Read of reference: 4.3 s Generation of index: 31.1 s Writing index: 6.8 s - this is a bit slow, it is probably possible to speed it up but I left it for now. Reading index: 3.6 s Index size: 1.5 GB - the kallisto index for the same is 2.4 GB, somewhat the same order of magnitude, the fasta file is 350 MB.

    I developed the code in Visual Studio, worked fine. I have not tested to compile with GCC, but it should work. Would be good if someone with that environment up could test it quickly.

    There are three things we discovered that should be changed later at some point, but it is better to do these as separate actions:

    • idx_to_acc - change this to a vector, no point in having a map with <i,data>, where i is just the index - access will be identical, i.e., index.acc_map[i]
    • Remove the first uint64_t in the mers_vector - according to Kristoffer it should not be needed, which will reduce the memory taken up by the index (and file size).
    • I think we should rename unsigned int etc. to uint32_t - the number of bits in types like int has traditionally been different between machine architectures to optimize speed, so the size is a bit unclear, although it is pretty safe to assume they will be 32 bits.

    I can fix these things once this is approved.

    opened by johan-gson 11
  • bam file header not compatible with GATK

    bam file header not compatible with GATK

    Hi, using GATK as a variant caller raises an error about the sort order of the chromosomes and indeed they are not sorted although samtools sort was used. The header looks like this

    @HD     VN:1.6  SO:coordinate
    @SQ     SN:StSOLv1.1ch07_RagTag LN:51859799
    @SQ     SN:StSOLv1.1ch01_RagTag LN:89994189
    @SQ     SN:StSOLv1.1ch12_RagTag LN:108042198
    @SQ     SN:Chr0_RagTag  LN:74722543
    @SQ     SN:StSOLv1.1ch08_RagTag LN:122763581
    @SQ     SN:StSOLv1.1ch02_RagTag LN:52958123
    @SQ     SN:StSOLv1.1ch11_RagTag LN:82684344
    @SQ     SN:StSOLv1.1ch09_RagTag LN:110327934
    @SQ     SN:StSOLv1.1ch06_RagTag LN:77416214
    @SQ     SN:StSOLv1.1ch04_RagTag LN:108266590
    @SQ     SN:StSOLv1.1ch05_RagTag LN:91627361
    @SQ     SN:StSOLv1.1ch10_RagTag LN:63742240
    @SQ     SN:StSOLv1.1ch03_RagTag LN:45113244
    @PG     ID:strobealign  PN:strobealign  VN:0.7  CL:strobealign
    @PG     ID:samtools     PN:samtools     PP:strobealign  VN:1.13 CL:samtools fixmate [email protected] -u -m - -
    @PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.13 CL:samtools view -bhS -
    @PG     ID:samtools.2   PN:samtools     PP:samtools.1   VN:1.13 CL:samtools sort [email protected]
    @PG     ID:samtools.3   PN:samtools     PP:samtools.2   VN:1.13 CL:samtools view -H Altus.ColombaNRGene.ragtag.strobealign.bam
    

    any ideas what is going wrong ? Other aligners like bwa produce a proper sorted bam file

    opened by danessel 11
  • Accuracy regressions since v0.7.1

    Accuracy regressions since v0.7.1

    This is to document which accuracy changes I found since v0.7.1.

    Test data

    Only paired-end reads.

    • simulated Drosophila
      • Queries: First 1 million 50 bp reads from /proj/snic2022-6-31/nobackup/strobemap_eval/reads_PE/drosophila/
      • Reference: /proj/snic2022-6-31/nobackup/strobemap_eval/genomes/drosophila.fa
    • real Drosophila
    • CHM13
      • Queries: First 100'000 reads of /proj/snic2022-6-31/nobackup/strobemap_eval/reads_PE/CHM13/200_?.fq
      • Reference: /proj/snic2022-6-31/nobackup/strobemap_eval/genomes/CHM13.fa

    Command used: strobealign-${commithash} -t 1 ${reference} ${r1} ${r2} | samtools view --no-PG -o out.bam

    Commits with accuracy or other output changes

    |Commit|Accuracy in CHM13|Description|Fix/discussion|Trigger in real Drosophila| |-|-|-|-|-| |v0.7.1|94.3725||baseline|| |cd46f8f|94.363|Use std::{max,min,abs} instead of ternary operators|#136|SRR6055476.524| |14d512f|94.3650|Fix a probable logic error due to a typo||| |d987274|94.169|Do not convert reference to uppercase in seq_to_randstrobes2|#138|?| |13bdc5d|94.169|Factor out rescue_read|#139|SRR6055476.15| |6973009|92.875|Invert conditions to allow early break|#127|?| |d82ce31|93.997|Bug in only this one commit: filter_cutoff was set incorrectly||| |720887d|92.876|Fix incorrect refactor, part 1||| |8d512a9|94.169|Fix incorrect refactor, part 2||| |e758839|94.1765|Merge pull request #136 from ksahlin/fix-max|| |464a852|94.3745|Merge pull request https://github.com/ksahlin/StrobeAlign/pull/138 from ksahlin/uppercase||

    opened by marcelm 10
  • sam file format error

    sam file format error

    Dear StrobeAlign author,

    Use default settings, the mapping sam file output cannot be transformed into sorted bam file by samtools:

    (base) [[email protected] Competitive_mapping]$ time StrobeAlign -t 24 T4Aer_MAG.fasta T4AerOil_R1.fa T4AerOil_R2.fa Using n: 2 k: 22 s: 18 t: 24 R: 2 w_min: 6 w_max: 14 [w_min, w_max] under thinning w roughly corresponds to sampling from downstream read coordinates (under random minimizer sampling): [30, 70] Time reading references: 7.76452 s

    ref vector approximate size: 9290379 Ref vector actual size: 8664874 Unique strobemers: 8651483 Total time generating flat vector: 2.85387 s

    Flat vector size: 8664874 Total strobemers count: 8664874 Total strobemers occur once: 8639067 Total strobemers highly abundant > 100: 0 Total strobemers mid abundance (between 2-100): 12415 Total distinct strobemers stored: 8651483 Ratio distinct to non distinct: 696 Filtered cutoff index: 1730 Filtered cutoff count: 2

    Total time generating hash table index: 0.72899 s

    Total time indexing: 11.3475 s

    Using rescue cutoff: 4 Running PE mode Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 104.07, stddev: 124.369) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 96.6191, stddev: 175.71) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 90.6641, stddev: 208.097) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 117.778, stddev: 245.558) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 106.226, stddev: 274.345) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 101.158, stddev: 301.601) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 97.5571, stddev: 323.28) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 102.831, stddev: 345.341) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 109.84, stddev: 371.438) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 114.511, stddev: 392.006) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 117.354, stddev: 407.773) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 138.058, stddev: 426.87) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 114.614, stddev: 440.569) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 108.915, stddev: 452.172) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 132.276, stddev: 467.806) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 129.727, stddev: 483.589) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 109.746, stddev: 500.041) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 107.748, stddev: 512.066) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 107.68, stddev: 525.201) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 117.557, stddev: 539.586) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 107.239, stddev: 553.863) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 116.547, stddev: 566.07) Mapping chunk of 1000000 query sequences... Estimated diff in start coordinates b/t mates, (mean: 110.742, stddev: 578.011) Mapping chunk of 148502 query sequences... Estimated diff in start coordinates b/t mates, (mean: 110.161, stddev: 589.893) Total mapping sites tried: 13055987 Total calls to ksw: 3218137 Calls to ksw (rescue mode): 1276608 Did not fit strobe start site: 1776340 Tried rescue: 33208162 Total time mapping: 103.476 s. Total time reading read-file(s): 25.1783 s. Total time creating strobemers: 18.6664 s. Total time finding NAMs (non-rescue mode): 3.60437 s. Total time finding NAMs (rescue mode): 0.409914 s. Total time sorting NAMs (candidate sites): 0.0890606 s. Total time reverse compl seq: 0 s. Total time extending alignment: 17.081 s. Total time writing alignment to files: 5.61499 s.

    real 1m55.191s user 17m49.681s sys 0m14.051s (base) [[email protected] Competitive_mapping]$ ls -lhs mapped.sam 7.3G -rw-r--r-- 1 jzhao399 p-ktk3 7.3G Nov 21 20:23 mapped.sam (base) [[email protected] Competitive_mapping]$ samtools view -bS [email protected] 24 mapped.sam | samtools sort [email protected] 24 -O bam -o mapped_sorted.bam samtools view: error reading file "mapped.sam": Input/output error samtools view: error closing "mapped.sam": -5 [bam_sort_core] merging from 0 files and 24 in-memory blocks...

    Any idea?

    Thanks,

    Jianshu

    bug 
    opened by jianshu93 10
  • Illegal instruction ?

    Illegal instruction ?

    strobealign StaphylococcusLT963435.fasta fastpSRR13329724_1.fastq.gz fastpSRR13329724_2.fastq.gz > outputStaphreads.sam Using k: 20 s: 16 w_min: 5 w_max: 11 Read length (r): 129 Maximum seed length: 79 Threads: 3 R: 2 Expected [w_min, w_max] in #syncmers: [5, 11] Expected [w_min, w_max] in #nucleotides: [25, 55] A: 2 B: 8 O: 12 E: 1 Time reading references: 0.01117 s

    ref vector approximate size: 555450 Illegal instruction (strobealign) [email protected]:~/Turtle$

    opened by AntonioBaeza 9
  • Create classes that represent SAM files and records

    Create classes that represent SAM files and records

    To solve #31 and #56, we should create an AlignmentRecord or so class (or something like that). The idea is that objects of this class represent the information from one row in a SAM/BAM file, and that there would be a method that writes out the data in SAM format. All the functions that want to write SAM output then would create one instance per row that they want to write and let the class do the formatting. That way, the formatting (serialization) is done in one place only and could, at least in theory, be much more easily switched for something else, for example, if we wanted to support writing BAM or CRAM (not that I suggest this, I think writing SAM is totally fine).

    Possibly some kind of SamFile or so class could be involved in order to store the read group ID that we want to add as part of #31.

    opened by marcelm 8
  • terminate called after throwing an instance of 'std::out_of_range'

    terminate called after throwing an instance of 'std::out_of_range'

    Hi @ksahlin , I am giving a try to strobealign v0.6. I was able to compile it correctly on my computing nodes. Genome indexing works fine but in the PE mode, it crashes from the beginning of the mapping part:

    ...
    Using rescue cutoff: 208
    @SQ     SN:Chr01        LN:51433939
    @SQ     SN:Chr04        LN:48048378
    @SQ     SN:Chr08        LN:63048260
    @SQ     SN:Chr11        LN:53580169
    @SQ     SN:Chr09        LN:38250102
    @SQ     SN:Chr05        LN:40923498
    @SQ     SN:Chr02        LN:49670989
    @SQ     SN:Chr10        LN:44302882
    @SQ     SN:Chr07        LN:40041001
    @SQ     SN:Chr06        LN:31236378
    @SQ     SN:Chr03        LN:53438756
    @PG     ID:strobealign  PN:strobealign  VN:0.6  CL:strobealign
    Running PE mode
    Mapping chunk of 1000000 query sequences...
    terminate called after throwing an instance of 'std::out_of_range'
      what():  basic_string::substr
    Aborted (core dumped)
    

    This is strange, because running it in SE mode with either the forward or the reverse reads work perfectly. Any idea about that possible issue?

    Thanks!

    opened by TDDB-limagrain 8
  • Picard ValidateSamFile: INVALID_ALIGNMENT_START

    Picard ValidateSamFile: INVALID_ALIGNMENT_START

    Picard ValidateSamFile has many complaints like this:

    ERROR::INVALID_ALIGNMENT_START:Record 23701, Read name SRR6055476.11851, Mate Alignment start (3496047) must be <= reference sequence length (44411) on reference NW_007931074.1
    
    bug 
    opened by marcelm 3
  • Different robin_hood::unordered_map behavior when clearing vs. re-creating

    Different robin_hood::unordered_map behavior when clearing vs. re-creating

    Commit 5b462d6 introduced a change in alignment results. This is the diff of the alignments (- is before, + is after) for the first of two affected read pairs in the simulated Drosophila dataset:

    -simulated.262476/1 97  211000022278483 515 60  2S48=   3L  14468499    14468024    TGGAAATTAAATTTTTTGGCATTATTTTGCAAATTTTGATGACCCCCCTC  HIHHGGIHHDHHFFIGIHIDIIHIEFG>IGEIFDIEHIBIG;I7>[email protected]    NM:i:0  AS:i:96
    +simulated.262476/1 65  211000022278483 515 60  2S48=   211000022278398 443 -122    TGGAAATTAAATTTTTTGGCATTATTTTGCAAATTTTGATGACCCCCCTC  HIHHGGIHHDHHFFIGIHIDIIHIEFG>IGEIFDIEHIBIG;I7>[email protected]    NM:i:0  AS:i:96
    
    -simulated.262476/2 145 3L              14468499  60  9S10=1X30=  211000022278483 515 -14468024   ATATGGAATGTGTTGGAATNTACTATTCAACCTACAAAAATAACGTTAAA  HIII7CIIIFIIIEIFIIGIHIB8IIIIIIBIIFIIFIEGIGHIIIGHIH    NM:i:1  AS:i:72
    +simulated.262476/2 129 211000022278398      443  60  30=1X9=10S  211000022278483 515       122   TTTAACGTTATTTTTGTAGGTTGAATAGTANATTCCAACACATTCCATATthe new version chooses an alignment that is worse than the best (given how the 'best' is defined) - as none of the alignment configurations are 'mapped in proper pair'. (of course the second one is better if inversion is more likely than translocation).
    
    This could be an issue that comes back and haunts us.  HIHGIIIHGIGEIFIIFIIBIIIIII8BIHIGIIFIEIIIFIIIC7IIIH    NM:i:1  AS:i:70
    

    I have no idea at the moment why such a change occurs. It appears that the robin_hood::unordered_map behaves differently when it is cleared and then re-filled compared to creating a new one from scratch.

    Originally posted by @marcelm in https://github.com/ksahlin/StrobeAlign/issues/140#issuecomment-1302126477

    And @ksahlin replied:

    [...] the new version chooses an alignment that is worse than the best (given how the 'best' is defined) - as none of the alignment configurations are 'mapped in proper pair'. (of course the second one is better if inversion is more likely than translocation).

    This could be an issue that comes back and haunts us.

    opened by marcelm 0
  • [E::read_ncigar] No CIGAR operations  issue

    [E::read_ncigar] No CIGAR operations issue

    Hi, I'm using strobealing to test alignment performance and I get this error from samtools view command with sam created by strobealign (see below command and output) The .fna reference file is Homo sapiens chromosome 18, GRCh38.p14 the sam creation was succesful, aligning with a fastq that is a pbsim simulated directly from the same NC000018.fna What am I doing wrong ? Thanks in advance .

    The command samtools view was: samtools view -bS -T NC000018.fna strobealign_ps01_NC000018.sam -O BAM -o strobealign_ps01_NC000018.bam [email protected] 12

    and the output was

    samtools view: error reading file "NC000018.sam"
    [E::read_ncigar] No CIGAR operations
    samtools view: error closing "NC000018.sam": -5
    

    This is the command and output of sam ceration with strobealign

     /packages/StrobeAlign/build/strobealign NC000018.fna ps01_NC000018.fastq -o strobealign_ps01_NC000018.sam -t 12
    
    Using
    k: 23
    s: 17
    w_min: 5
    w_max: 15
    Read length (r): 10032
    Maximum seed length: 278
    Threads: 12
    R: 2
    Expected [w_min, w_max] in #syncmers: [5, 15]
    Expected [w_min, w_max] in #nucleotides: [35, 105]
    A: 2
    B: 8
    O: 12
    E: 1
    Time reading references: 0.965166 s
    
    ref vector approximate size: 11481897
    Ref vector actual size: 10783788
    Time generating seeds: 2.99098 s
    
    Time sorting seeds: 1.18937 s
    
    Unique strobemers: 9922754
    Total time generating flat vector: 4.18048 s
    
    Flat vector size: 10783788
    Total strobemers count: 10783788
    Total strobemers occur once: 9840601
    Fraction Unique: 0.991721
    Total strobemers highly abundant > 100: 783
    Total strobemers mid abundance (between 2-100): 81369
    Total distinct strobemers stored: 9922754
    Ratio distinct to highly abundant: 12672
    Ratio distinct to non distinct: 120
    Filtered cutoff index: 1984
    Filtered cutoff count: 41
    
    
    Total time generating hash table index: 0.584856 s
    
    Total time indexing: 5.73055 s
    
    Using rescue cutoff: 82
    Running SE mode
    Done!
    Total mapping sites tried: 502929
    Total calls to ssw: 502929
    Calls to ksw (rescue mode): 0
    Did not fit strobe start site: 1585
    Tried rescue: 16813
    Total time mapping: 91.5354 s.
    Total time reading read-file(s): 13.2334 s.
    Total time creating strobemers: 2.68648 s.
    Total time finding NAMs (non-rescue mode): 0.603002 s.
    Total time finding NAMs (rescue mode): 0.0492406 s.
    Total time sorting NAMs (candidate sites): 0.0038492 s.
    Total time reverse compl seq: 5.79428e-311 s.
    Total time base level alignment (ssw): 3.01523 s.
    Total time writing alignment to files: 6.92938e-310 s.
    
    opened by rudigaspardo 1
  • TLEN should be set to zero when reads map to different contigs

    TLEN should be set to zero when reads map to different contigs

    SAM spec:

    TLEN: [...]. It is set as 0 for a single-segment template or when the information is unavailable (e.g., when the first or last segment of a multi-segment template is unmapped or when the two are mapped to different reference sequences).

    opened by marcelm 0
  • The /1 and /2 suffixes should be stripped from query names

    The /1 and /2 suffixes should be stripped from query names

    A FASTQ record with header @readname/1 or @readname/2 should be named readname in the SAM output.

    The SAM spec says:

    QNAME: Query template NAME. Reads/segments having identical QNAME are regarded to come from the same template. That is, if we want paired-end reads to be considered coming from the same template, then they need to get the same QNAME.

    BWA-MEM does this.

    opened by marcelm 1
Releases(v0.7.1)
  • v0.7.1(Apr 17, 2022)

    Improvements mainly for large repetitive genomes.

    • Introduces maximum limit on repetitive seeds before calling optimized merged match finder (optimized for repetitive reads). This reduces the computational time if the genome is large and repetitive, e.g., maize (2.4Gb), rye (7.8Gb), significantly.
    • Fixes sam header issue https://github.com/ksahlin/StrobeAlign/issues/22
    • Removes dependency on ksw2.
    Source code(tar.gz)
    Source code(zip)
    strobealign-v0.7.1_linux.zip(148.97 KB)
    strobealign-v0.7.1_osx.zip(159.51 KB)
  • v0.7(Apr 1, 2022)

    Major update in the implemented parallelization. The new parallel implementation allows a much more efficient interplay with reading input -> aligning -> writing output. This results in much better CPU usage as the number of threads increases. For example, I observed an almost a 2x speedup (50-30% reduced runtime) across four larger datasets when using 16 cores (SIM and GIAB 150bp and 250bp reads, see README benchmarks).

    For reference, previous naive parallelization ran in sequential order: 1. Read batch of reads with one thread 2. Align batch input in parallel with OpenMP 3. Write output with one thread. New parallelization performs 1-3 across threads with mutex on input and output. Such types of parallelization are commonly applied in other tools.

    This release also includes:

    • Implemented automatic inference of read length, which removes the need of specifying -r (as reported in https://github.com/ksahlin/StrobeAlign/issues/19)
    • Some minor bugfixes. For example, this bug is fixed.

    This release has identical or near-identical alignments to the previous version v0.6.1 (same accuracy and SV calling stats across tested datasets)

    Source code(tar.gz)
    Source code(zip)
    strobealign-v0.7_linux.zip(157.42 KB)
    strobealign-v0.7_OSX.zip(164.98 KB)
  • v0.6.1(Feb 23, 2022)

  • v0.6(Feb 20, 2022)

    Version 0.6 fixes a crucial bug introduced in v0.5 and has two additional bug fixes that improve accuracy. It is highly recommended to update to this version.

    1. Crucial bugfix to v0.5 causing rare but occasional alignments to very long reference regions due to bug in coordinate. This becomes detrimental to speed.
    2. Identifying symmetrical hash collisions and in those cases test the reverse orientation. This leads to a further slight bump in alignment accuracy over previous versions, particularly for shorter read lengths.
    3. Fix to rare but occasional uninitialized joint alignment score S calculation that would cause suboptimal alignment
    4. Fixes reporting of template len field in SAM output if deletion in alignment.
    Source code(tar.gz)
    Source code(zip)
    strobealign-v0.6_linux.zip(141.81 KB)
    strobealign-v0.6_osx.zip(150.89 KB)
  • v0.5(Feb 16, 2022)

    Added features, some improvements in alignment (accuracy), and minor bugfixes.

    1. Added parameter -N [INT] to output secondary alignments
    2. Base level alignment parameters can now be specified from command line -A -B -E -O
    3. Improved MAPQ calculation: calculating them from alignments (if alignment mode) instead of from seeds.
    4. Update default base-level alignment parameters for better alignments around indels.
    5. Added Quality values, AS:i and NM:i tags to SAM output.

    See INDEL/SNV calling benchmark in README.

    Source code(tar.gz)
    Source code(zip)
    strobealign-v0.5_linux.zip(138.31 KB)
    strobealign-v0.5_osx.zip(147.69 KB)
  • v0.4(Jan 16, 2022)

  • v0.3(Jan 13, 2022)

  • v0.2.1(Jan 9, 2022)

  • v0.2(Dec 30, 2021)

  • v0.1(Dec 27, 2021)

    Major update of strobealign. This version comes with an improvement in accuracy (and the number of aligned reads) around lengths 100-125nt reads, and it is also faster than older versions for these lengths. Most notable changes:

    • Algorithm changes

      • Using xxhash instead of no hash for strobes. Gives a better pseudorandom generation of hashes for linking.
      • Linking strobes using bitcount( (h_1 ^ h_2) ^ q) which creates a skewed seed length distribution towards shorter seeds in the window. This improves mapping candidate read detection particularly for shorter reads (100nt).
    • Parameters

      • Adding the option to customize sampling window of second strobe with -l and -u.
      • Adding a parameter -r [INT] for approximate read length (default 150). This will make strobealign customize parameters -l -u, and -k
    • Also cuts the reference accessions at first space, which fixes issue #4

    Source code(tar.gz)
    Source code(zip)
    strobealign-v0.1_osx.zip(119.93 KB)
    strobealign_v0.1_linux.zip(114.32 KB)
  • v0.0.3.2(Nov 30, 2021)

  • v0.0.3(Nov 3, 2021)

    Version 0.0.3

    1. Has paired-end alignment mode
    2. Implements a rescue mode both in SE and PE alignment modes (described in preprint).
    3. Changed to symmetrical strobemer hash values due to inversions (described in preprint).

    Known bugs:

    • Negative SAM coordinate bug in Single-end alignment mode. Observed once in 150M simulated reads
    • Segfault in paired-end mapping mode (never in alignment mode). Observed for the shortes reads (100nt) three times in 150M simulated reads
    Source code(tar.gz)
    Source code(zip)
  • v0.0.2(Sep 27, 2021)

    StrobeAlign is now parallelized with OpenMP and can read fastq and gzipped fastq files with kseqpp.

    TODO

    • PE-alignment mode and joint scoring
    • Separate creation and storage of reference index
    Source code(tar.gz)
    Source code(zip)
  • v0.0.1(Sep 21, 2021)

Owner
Kristoffer
Kristoffer
Get Next Line is a project at 42. It is a function that reads a file and allows you to read a line ending with a newline character from a file descriptor

Get Next Line is a project at 42. It is a function that reads a file and allows you to read a line ending with a newline character from a file descriptor. When you call the function again on the same file, it grabs the next line

Mhamed Ajjig 5 Nov 15, 2022
A short and sweet hex dumper!

██████╗ ██████╗ ███╗ ███╗██████╗ ██╔═══██╗██╔══██╗████╗ ████║██╔══██╗ ██║ ██║██║ ██║██╔████╔██║██████╔╝ ██║▄▄ ██║██║ ██║██║╚██╔╝██║██╔═══

Victor Sarkisov 1 Nov 18, 2021
Just another short video app (not tiktok) but 3 in 1.

Short videos app - India Another short videos app for Hindi audience. Made with 3 different apis: Moj app Josh app Chingari app Authetication No authe

Not Your Surya 2 Jan 6, 2022
This is a very short tool that predicts the number of cycles and execution time in Fulcrum when the operands and operations are known.

fulcrum-analytical-tool This is a very short tool that predicts the number of cycles and execution time in Fulcrum when the operands and operations ar

null 2 Feb 6, 2022
A test using a TTGO module (ESP32 + screen) which renders a 3d scene using pingo library

A simple 3D renderer tested and developed for the TTGO T-Display ESP32 board. The 3d renderer is: https://github.com/fededevi/pingo The 3D renderer is

fedevi 10 Nov 2, 2022
credential dump using foreshaw technique using SeTrustedCredmanAccessPrivilege

forkatz credential dump using forshaw technique using SeTrustedCredmanAccessPrivilege This code is based off of the blog post by james forshaw: https:

Barbarisch 117 Nov 21, 2022
Another version of EVA using anti-debugging techs && using Syscalls

EVA2 Another version of EVA using anti-debugging techs && using Syscalls First thing: Dont Upload to virus total. this note is for you and not for me.

null 270 Nov 9, 2022
In this Program, I am using C language and creating All Patterns Program using Switch case

In this Program, I am using C language and creating All Patterns Program using Switch case. It has 15 pattern programs like a pyramid, half pyramid, etc...

Rudra_deep 1 Nov 13, 2021
In DFS-BFS Implementation In One Program Using Switch Case I am Using an Simple And Efficient Code of DFS-BFS Implementation.

DFS-BFS Implementation-In-One-Program-Using-Switch-Case-in-C Keywords : Depth First Search(DFS), Breadth First Search(BFS) In Depth First Search(DFS),

Rudra_deep 1 Nov 17, 2021
multi-sdr-gps-sim generates a IQ data stream on-the-fly to simulate a GPS L1 baseband signal using a SDR platform like HackRF or ADLAM-Pluto.

multi-sdr-gps-sim generates a GPS L1 baseband signal IQ data stream, which is then transmitted by a software-defined radio (SDR) platform. Supported at the moment are HackRF, ADLAM-Pluto and binary IQ file output. The software interacts with the user through a curses based text user interface (TUI) in terminal.

null 66 Oct 29, 2022
CMSIS-DAP using TinyUSB

Dapper Mime This unearths the name of a weekend project that I did in 2014. Both then and now, this is a port of ARM's CMSIS-DAP code to a platform wi

null 70 Nov 28, 2022
And ESP32 powered VU matrix using the INMP441 I2S microphone

ESP32-INMP441-Matrix-VU This is the repository for a 3D-printed, (optionally) battery-powered, WS2812B LED matrix that produces pretty patterns using

null 55 Nov 18, 2022
Using a RP2040 Pico as a basic logic analyzer, exporting CSV data to read in sigrok / Pulseview

rp2040-logic-analyzer This project modified the PIO logic analyzer example that that was part of the Raspberry Pi Pico examples. The example now allow

Mark 60 Oct 31, 2022
Arduino sample code to help you get started using the Soracom IoT Starter Kit!

Soracom IoT Starter Kit The Soracom IoT Starter Kit includes everything you need to build your first connected device. It includes an Arduino MKR GSM

Soracom Labs 13 Jul 30, 2022
A laser cut Dreamcast Pop'n Music controller and integrated memory card using the Raspberry Pi Pico's Programmable IO

Dreamcast Pop'n Music Controller Using Raspbery Pi Pico (RP2040) Intro This is a homebrew controller for playing the Pop'n Music games on the Sega Dre

null 38 Oct 25, 2022
Web Server based on the Raspberry Pico using an ESP8266 with AT firmware for WiFi

PicoWebServer This program runs on a Raspberry Pico RP2040 to provide a web server when connected to an Espressif ESP8266. This allows the Pico to be

null 50 Nov 27, 2022
Using the LilyGo EPD 4.7" display to show OWM Weather Data

LilyGo-EPD-4-7-OWM-Weather-Display Using the LilyGo EPD 4.7" display to show OWM Weather Data Version 2.72 Improved Icon shapes and positioning Adjust

G6EJD 13 Apr 2, 2021
A Walkie-Talkie based around the ESP32 using UDP broadcast or ESP-NOW

Overview We've made a Walkie-Talkie using the ESP32. Explanatory video Audio data is transmitted over either UDP broadcast or ESP-NOW. So the Walkie-T

atomic14 255 Nov 14, 2022
Control your mouse using razer synapse

rzctl Control your mouse using razer synapse Compile in x64 Not tested for x86 Credits Process Hacker - https://github.com/processhacker/processhacker

null 43 Nov 23, 2022