This repo implements the Landau-Vishkin algorithm to compute the edit distance between two strings. This is a fast method for highly similar strings. The actual implementation follows a simplified WFA formulation rather than the original formulation. It also learns a performance trick from WFA. For a pair of ~5Mb HLA sequences with ~123k edits, the implementation here can find the result in 71 seconds, faster than edlib. Edlib will be faster for more divergent sequences.
C implementation of the Landau-Vishkin algorithm
You write in the readme that the presentation in Landau-Vishkin'89 is closer to modern one than the one from Ukkonen'85, but after some reading I've actually come to the opposite conclusion, so I'm wondering what I'm missing:
- LV finds all occurences of a pattern in a text, while you do global alignment.
- LV 'only' introduces the suffix-tree speedup for constant time extending, which you don't seem to use.
- The recursion presented in Ukkonen and LV otherwise looks the same to me.
// filter out diagonals that are unlikely to reach a better alignment than the current best. Only applicable to global alignment. This is not an approximation. static inline void wf_prune_global(int32_t tl, int32_t ql, int32_t *st, int32_t *en, const wf_diag_t *b)
Looking at this function, this seems to be the approximation/reduction explained in section 4.2 if the WFA paper. However, your comment 'this is not an approximation' sounds like it preserves optimality of the alignment.
What am I missing?
I came across this repo when looking for an implementation of the Landau-Vishkin algorithm. I'd like to compare my own work-in-progress algorithm to it, and since WFA is more general, a dedicated implementation for gap-linear or unit costs would be nice.
Are you aware of any other implementations? Or is the lack of them the reason you started this project? Are you planning to publish a paper based on this code that could be cited at some point?