Snowball compiler and stemming algorithms

Related tags

Algorithms snowball
Overview

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algorithms implemented using it.

Snowball was originally designed and built by Martin Porter. Martin retired from development in 2014 and Snowball is now maintained as a community project. Martin originally chose the name Snowball as a tribute to SNOBOL, the excellent string handling language from the 1960s. It now also serves as a metaphor for how the project grows by gathering contributions over time.

The Snowball compiler translates a Snowball program into source code in another language - currently Ada, ISO C, C#, Go, Java, Javascript, Object Pascal, Python and Rust are supported.

This repository contains the source code for the snowball compiler and the stemming algorithms. The snowball compiler is written in ISO C - you'll need a C compiler which support C99 to build it (but the C code it generates should work with any ISO C compiler).

See https://snowballstem.org/ for more information about Snowball.

What is Stemming?

Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

Comments
  • Rust Backend

    Rust Backend

    Hey,

    for my own needs I hacked together a rust backend for the snowball compiler. It's mostly a literal translation of the python and java backends.

    The current state of it is sufficient for my needs but not enough to merge it upstream. As far as I see it, following things would need to be done:

    • [x] Add a library similar to the ones residing in java/ and python/
    • [x] Test the implementation against all stemmers (currently only tested on eight)
    • [x] Integrate these tests into travis
    • [x] Yield warning-free rust code
    • [x] Implement missing features

    Before tackeling these tasks I just wanted to know if pull request to that repository are still accepted/merged.

    opened by JDemler 17
  • Optimized snowball code for tamil stemmer

    Optimized snowball code for tamil stemmer

    I have replaced simple or conditions with among() command as suggested by Martin. I tried to use among() which consisted of some test() commands inside and they didnt work as expected so have left them as it is. I tested the modification against the data in https://github.com/rdamodharan/snowball-data/tree/master/tamil and the results were same.

    opened by rdamodharan 17
  • Add Python generator

    Add Python generator

    I am not the original author of these generators; @shibukawa is. However I needed to use the Python generator, so I have fixed some compilation errors in the code, rebased it against the current snowball code, and submitted this pull request.

    The original code by Yoshiki can be found at shibukawa/snowball (in snowball directory).

    Python needs no introduction. JSX is a statically typed JavaScript-like language that compiles to JavaScript, the compiler can be found at jsx/JSX (requires node.js).

    Update: I excluded the JSX generator from this pull request

    opened by mitya57 15
  • add Go generator

    add Go generator

    initial implementation ported from rust generator primarily focused on getting functionality working not on the best or most performant Go code

    seems to work, based on: make check_go and other ad-hoc testing

    opened by mschoch 14
  • Problems with Russian letter Ё

    Problems with Russian letter Ё

    As http://snowballstem.org/algorithms/russian/stemmer.html properly mentions, Russian alphabet contains letter Ё [jo] which is quite often replaced with Е, especially in regular, non-academic texts.

    So indeed the beast approach is to replace Ё -> Е when stemming.

    Now if you check the existing demo http://snowballstem.org/demo.html you can see that it doesn't actually happen.

    Let's take Russian word for "honey" — «мёд», and its form with different ending — «мёдом». If you paste it along with its "normalized" form (with Е) you can see that the form with Ё is not properly stemmed: demo

    Here's sample input so that you can run the tests yourself: "мёд мёдом мед медом".

    This is a serious problem when searching through the corpus of natural texts. Even if you're purist (like me in this case) and type all your search terms with properly placed Ё you won't be able to match the original texts that are using Е.

    opened by emirotin 13
  • Java builds and tests

    Java builds and tests

    Hi @rboulton ,

    I would like the Java stemmers to also be included in travis-ci builds, ideally with stemming tests like it's C counterparts (see #5).

    If it's alright with you, I'd like to work on this and submit a pull-request shortly.

    oerd

    opened by oerd 11
  • [Ada] Add support for Ada generator

    [Ada] Add support for Ada generator

    This pull request adds the support for Ada code generator.

    The Stemmer library is available in https://github.com/stcarrez/ada-stemmer

    The Ada code generator has been checked with English, Danish, Dutch, French, German, Greek, Italian, Serbian, Spanish, Swedish, Russian.

    opened by stcarrez 10
  • Snowball version of Porter stemmer for Lithuanian language

    Snowball version of Porter stemmer for Lithuanian language

    Hello,

    I've been working on a Snowball stemmer for Lithuanian language and I'd like to contribute it to a wider community. By contributing my work I hope that community can have some benefit.

    Please let me know if you would like to know more about anything in this pull request. If there are any problem with my code or there are more things to do in order to merge the code, I'm more than willing put effort to fix it.

    Best wishes, Dainius

    opened by dainiusjocas 10
  • Clarification on if `snowball` (specifically python implemenation) is not thread-safe

    Clarification on if `snowball` (specifically python implemenation) is not thread-safe

    Hi, we've been experiencing intermittent inconsistent outputs (i.e. bug) when using snowball with dask multiprocessing. We can stop these bugs occurring by any of a) using a single process/thread, b) removing the stemming, or c) moving the instantiation of the stemmer inside the function which is being applied within threads.

    Could someone with expertise input on whether snowball is thread-safe or not?

    Might be related to #146 which seems to imply that the C# implementation is not thread safe.

    opened by DBCerigo 8
  • python AttributeError snowballstemmer.algorithms()

    python AttributeError snowballstemmer.algorithms()

    Hello,

    I installed the library from PyPI

    pip install snowballstemmer
    

    There is a bug in https://github.com/snowballstem/snowball/blob/master/python/create_init.py#L42

    ----> 1 snowballstemmer.algorithms()
    
         67         return Stemmer.language()
         68     else:
    ---> 69         return list(_languages.key())
         70
         71 def stemmer(lang):
    
    AttributeError: 'dict' object has no attribute 'key'
    

    It should be _languages.keys()

    opened by kkaiser 8
  • UTF-8 ?

    UTF-8 ?

    From: algorithms/french/stem_ISO_8859_1.sbl

    stringdef a^   hex 'E2'  // a-circumflex
    stringdef a`   hex 'E0'  // a-grave
    stringdef c,   hex 'E7'  // c-cedilla
    
    stringdef e"   hex 'EB'  // e-diaeresis (rare)
    stringdef e'   hex 'E9'  // e-acute
    stringdef e^   hex 'EA'  // e-circumflex
    stringdef e`   hex 'E8'  // e-grave
    stringdef i"   hex 'EF'  // i-diaeresis
    stringdef i^   hex 'EE'  // i-circumflex
    stringdef o^   hex 'F4'  // o-circumflex
    stringdef u^   hex 'FB'  // u-circumflex
    stringdef u`   hex 'F9'  // u-grave
    

    So far there is no UTF-8 version. Why?

    opened by drzraf 8
  • Turkish stemmer has a problem with word

    Turkish stemmer has a problem with word "aile"

    Hello,

    I have an issue related to the Turkish stemmer. But the problem is more related to the Turkish stemming algorithm, as on page https://snowballstem.org/algorithms/turkish/stemmer.html.

    When I want to use a snowball to stem the Turkish word (aile), it always cuts the "le" and leaves the phrase only "ai." And the word "ai" doesn't have any meaning in Turkish. I think because "le" in Turkish means "with." that's why it cut the word "aile" into two words, "ai" and "le."

    How do I exclude the word "aile" in stemming using snowball? Thank you.

    opened by dwicak 1
  • Is it normal that comparatives and superlatives are not stemmed?

    Is it normal that comparatives and superlatives are not stemmed?

    >>> import Stemmer
    >>> stemmer = Stemmer.Stemmer('english')
    >>> print(stemmer.stemWord('poorer'))
    poorer
    >>> print(stemmer.stemWord('cleaner'))
    cleaner
    >>> print(stemmer.stemWord('cleanest'))
    cleanest
    
    opened by raffaem 1
  • Spelling

    Spelling

    This PR corrects misspellings identified by the check-spelling action.

    The misspellings have been reported at https://github.com/jsoref/snowball/commit/25df83387d7b449b530cdc1a38306cba71d9e714#commitcomment-69675396

    The action reports that the changes in this PR would make it happy: https://github.com/jsoref/snowball/commit/3ed0647bd9596b724133e8273188e38624fef328

    Note: this PR does not include the action. If you're interested in running a spell check on every PR and push, that can be offered separately.

    opened by jsoref 0
  • German stemmer possible improvements

    German stemmer possible improvements

    Hello, Snowball developers team!

    I work in developing translation software. We use snowball algorithms in our product to find inflected forms of terms in texts. We have gathered feedback from our customers on German stemming algorithm and developed some changes.

    1. Remove ending -ers

    Example (word - stem by Snowball demo - stem by customized algorithm): Förderer - ford - ford Förderers - forder - ford Förderern - ford - ford

    1. Feminine nouns

    -erinnen is replaced with -erin

    There are already some discussions on feminine endings in German (#153, #85). We have opted out to let our customers to decide themselves how a gendered word in German should be translated to a different language. Our addition to the algorithm simply provides a way to stem plural feminine nouns and singular feminine nouns in the same manner.

    Example (word - stem by Snowball demo - stem by customized algorithm): Politikerin - politikerin - politikerin Politikerinnen - politikerinn - politikerin

    1. Remove -stern

    Example (word - stem by Snowball demo - stem by customized algorithm): morgenstern - morgen - morgen morgensterne - morgenstern - morgen

    1. Remove ending -em

    That change does lead to ocassional overstemming. However, the word "systems" is often used in the CS and engineering terminology, so it is crucial for our customers to find words like "...system" when searching for "...systems".

    Example (word - stem by Snowball demo - stem by customized algorithm): system - syst - syst systems - system - syst

    1. -ln replaced with -l

    Example (word - stem by Snowball demo - stem by customized algorithm): artikel - artikel - artikel artikeln - artikeln - artikel

    We have implemented those changes (including updating word lists), so if after discussion you find changes (or some of them) useful, I can create a PR.

    Standart suffix algorithms with described above changes
     define standard_suffix as (
    	do (
    	[substring] R1 among(
    		'ers'
    		(
    			delete
    		)
                )
    	)	
            do (
                [substring] R1 among(
    		'erinnen'
    		(
    			 <- 'erin'
    		)
                    'em' 'ern' 'er' 
                    (   delete
                    )						
                    'e' 'en' 'es' 
                    (   delete
                        try (['s'] 'nis' delete)
                    )
                    's'
                    (   s_ending delete
                    )
                )
            )
            do (
                [substring] R1 among(
    		'stern'
    		(
    		delete 
    		)
                    'en' 'er' 'est' 'em'
                    (   delete
                    )
                    'st'
                    (   st_ending hop 3 delete
                    )
                )
            )
            do (
                [substring] R2 among(
                    'end' 'ung'
                    (   delete
                        try (['ig'] not 'e' R2 delete)
                    )
                    'ig' 'ik' 'isch'
                    (   not 'e' delete
                    )
                    'lich' 'heit'
                    (   delete
                        try (
                            ['er' or 'en'] R1 delete
                        )
                    )
                    'keit'
                    (   delete
                        try (
                            [substring] R2 among(
                                'lich' 'ig'
                                (   delete
                                )
                            )
                        )
                    )
                )
            )
    	do (
                [substring] R1 among(
                    'ln'
                    (   <- 'l'
                    )
    	)
        )
    )
    

    Thanks you for your time!

    opened by OlgaGuselnikova 0
  • Test for language features

    Test for language features

    It would be good to have testing of Snowball language features (especially those not used by any current algorithm implementation) which ran for each target language.

    As David Corbett noted in #156, several backends weren't implementing integer division. I've fixed them, but we lack a regression test, and lack automated testing that new backends get this right.

    This is the test code I added at the start of stem in english.sbl locally to check these fixes worked and that other backends weren't affected:

        $p1 = 7
        $p1 /= 4
        $p1 = 1
    
        $(7 / 4 * 4 == 4)
    
        $p1 = -7
        $p1 /= -4
        $p1 = 1
    
        $(-7 / -4 * 4 == 4)
    
        $p1 = -7
        $p1 /= 4
        $p1 = -1
    
        $((-7) / 4 * 4 == -4)
    
        $p1 = 7
        $p1 /= -4
        $p1 = -1
    
        $(7 / -4 * 4 == -4)
    
    opened by ojwb 3
Owner
Snowball Stemming language and algorithms
Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval
Snowball Stemming language and algorithms
Collection of algorithms and data structures in C++ and Java

Collection of algorithms and data structures in C++ and Java

Andrei Navumenka 1.7k Nov 30, 2022
A library of common data structures and algorithms written in C.

C Algorithms The C programming language includes a very limited standard library in comparison to other modern programming languages. This is a coll

Simon Howard 2.8k Nov 27, 2022
Several algorithms and data structures implemented in C++ by me (credited to others where necessary).

Algorithms This repository contains my implementations of several algorithms and data structures in C++ (credited to others where necessary). It has i

Petar Veličković 588 Nov 26, 2022
C++ implementations of well-known (and some rare) algorithms, while following good software development practices

ProAlgos: C++ This project is focused on implementing algorithms and data structures in C++, while following good software engineering practices, such

ProAlgos 483 Nov 19, 2022
Provide building blocks (software, hardware and algorithms) for implementing SLAM using small sensors

RemoteSLAM The purpose of this repo is to provide the building blocks (software drivers, hardware and algorithms) for implementing SLAM systems using

Autonomous Drones Lab, Tel Aviv University 38 Jan 20, 2022
Fundamentals of Data structures and algorithms in c++

Data Structures & Algorithms About the repository: Contains theories and programming questions related to fundamentals of data structures and algorith

fifu 45 Oct 6, 2022
CXXGraph is a Header-Only C++ Library for Graph Representation and Algorithms

CXXGraph is a small library, header only, that manages the Graph and it's algorithms in C++. In other words a "Comprehensive C++ Graph Library".

ZigRazor 181 Nov 28, 2022
Header-only C++ library for robotics, control, and path planning algorithms.

Header-only C++ library for robotics, control, and path planning algorithms.

null 357 Nov 25, 2022
c++ library including few algorithms and datastructures

c++ library including few algorithms and datastructures

null 2 Dec 25, 2021
Every week exercises for Introduction to Algorithms and Programming

cen109-algorithms commands to compile and link C and C++ programs gcc filename.c -o executableFileName g++ filename.cpp -o executableFileName filename

null 3 Mar 19, 2022
c language's datastruct and algorithms.

cdsaa 介绍 学习数据结构与算法的C语言实现 主要数据结构 动态字符串 动态数组 单向链表 栈 主要算法 更新中. . . 目录结构 |-- include |---- CArray.h 动态数组 |---- CList.h 单向链表 |---- CStack.h 栈 |---- CString

Ticks 1 Nov 24, 2021
Collection of various algorithms in mathematics, machine learning, computer science, physics, etc implemented in C for educational purposes.

The Algorithms - C # {#mainpage} Overview The repository is a collection of open-source implementation of a variety of algorithms implemented in C and

The Algorithms 15k Nov 25, 2022
Algorithms & Data structures in C++.

Algorithms & Data Structures in C++ 目标 ( goal ) : 经典的算法实现 (classical algorithms implementations) 服务器端 (based on linux/gcc) 正确,易于使用和改造, 一个头文件一个算法,并附带一个

xtaci 4.7k Nov 30, 2022
This repository contains path planning algorithms in C++ for a grid based search.

This repository contains path planning algorithms in C++ for a grid based search.

null 239 Nov 30, 2022
Library for building multi-level indoor routes using routing algorithms.

Library for building multi-level indoor routes using routing algorithms. You can easily construct routing graphs and find the shortest path for optimal indoor navigation.

Navigine 5 Nov 21, 2022
This library contains a set of algorithms for working with the routing graph.

Library for building multi-level indoor routes using routing algorithms. You can easily construct routing graphs and find the shortest path for optimal indoor navigation.

Navigine 5 Nov 21, 2022
IntX is a C++11 port of IntX arbitrary precision Integer library with speed, about O(N * log N) multiplication/division algorithms implementation.

IntX IntX is a C++11 port of IntX arbitrary precision Integer library with speed, about O(N * log N) multiplication/division algorithms implementation

Telepati 9 Mar 9, 2022
In this project, we implemented twelve different sorting algorithms.

C - Sorting algorithms & Big O In this project, we implemented twelve different sorting algorithms. Tests tests: Folder of test files. Provided by Alx

Nicholas M Mwanza 1 Oct 26, 2021
All basic algorithms for solving problems of CP

All basic algorithms for solving problems of CP

Islam Assanov 1 Nov 14, 2021