Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.

Overview

GitHub build Documentation Status Language Machines Badge DOI GitHub release Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Frog - A Tagger-Lemmatizer-Morphological-Analyzer-Dependency-Parser for Dutch

Copyright 2006-2020
Ko van der Sloot, Maarten van Gompel, Antal van den Bosch, Bertjan Busser

Centre for Language and Speech Technology, Radboud University Nijmegen
Induction of Linguistic Knowledge Research Group, Tilburg University

Website: https://languagemachines.github.io/frog

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package. Most modules were created in the 1990s at the ILK Research Group (Tilburg University, the Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium). Over the years they have been integrated into a single text processing tool, which is currently maintained and developed by the Language Machines Research Group and the Centre for Language and Speech Technology at Radboud University Nijmegen. A dependency parser, a base phrase chunker, and a named-entity recognizer module were added more recently. Where possible, Frog makes use of multi-processor support to run subtasks in parallel.

Various (re)programming rounds have been made possible through funding by NWO, the Netherlands Organisation for Scientific Research, particularly under the CGN project, the IMIX programme, the Implicit Linguistics project, the CLARIN-NL programme and the CLARIAH programme.

License

Frog is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version (see the file COPYING)

frog is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Comments and bug-reports are welcome at our issue tracker at https://github.com/LanguageMachines/frog/issues or by mailing lamasoftware (at) science.ru.nl. Updates and more info may be found on https://languagemachines.github.io/frog .

Installation

To install Frog, first consult whether your distribution's package manager has an up-to-date package. If not, for easy installation of Frog and its many dependencies, it is included as part of our software distribution LaMachine: https://proycon.github.io/LaMachine .

To be able to succesfully build Frog from source instead, you need the following dependencies:

The data for Frog is packaged seperately and needs to be installed prior to installing frog:

To compile and install manually from source instead, provided you have all the dependencies installed:

$ bash bootstrap.sh
$ ./configure
$ make
$ make install

and optionally:

$ make check

This software has been tested on:

  • Intel platforms running several versions of Linux, including Ubuntu, Debian, Arch Linux, Fedora (both 32 and 64 bits)
  • Apple platform running macOS

Contents of this distribution:

  • Sources
  • Licensing information ( COPYING )
  • Installation instructions ( INSTALL )
  • Build system based on GNU Autotools
  • Example data files ( in the demos directory )
  • Documentation ( in the docs directory and on https://frognlp.readthedocs.io )

Documentation

The Frog documentation can be found on https://frognlp.readthedocs.io

Credits

Many thanks go out to the people who made the developments of the Frog components possible: Walter Daelemans, Jakub Zavrel, Ko van der Sloot, Sabine Buchholz, Sander Canisius, Gert Durieux, Peter Berck and Maarten van Gompel.

Thanks to Erik Tjong Kim Sang and Lieve Macken for stress-testing the first versions of Tadpole, the predecessor of Frog

Comments
  • Endless loop of parsing empty sentences when frog server connection is closed

    Endless loop of parsing empty sentences when frog server connection is closed

    When I send something to a frog server (frog -S ), I get back results. But when the connection is broken it ends up in an endless loop. See the attached screenshot. I'm using the latest LaMachine distribution, but I've had this issue forever and therefore never used the server parameter.

    image

    opened by jsteggink 17
  • Frog (through python-frog) accumulates a huge number of temporary files

    Frog (through python-frog) accumulates a huge number of temporary files

    User @stergiosmorakis ran a script processing tweets (via python-frog) that ran for a while and produced a lot of /tmp/frog* files with short input sequences. A million files were accumulated at a certain point. I should investigate when these are created (initial investigation seemed as if it was only used in server mode, which is not the case here, but I probably missed something). Then they should be cleared up in an earlier stage.

    bug 
    opened by proycon 11
  • Frog breaks while processing large amount of txt data

    Frog breaks while processing large amount of txt data

    Frog is used to analyze 64 different txt files on 64 cores. It is initiated in LaMachine with frog.nf --inputdir chunks --outputdir chunks --inputformat text --sentenceperline --workers 64. However, I started the process several times, it once ran for a whole day but another time broke after only a few hours. The absolute runtime on the data should comprise around 20 days according to my calculations. Here is an excerpt of the error message.

    executor >  local (64)
    [7f/9f749c] process > frog_text2folia (48) [ 97%] 62 of 64, failed: 62
    WARN: Killing pending tasks (63)
    Error executing process > 'frog_text2folia (37)'
    
    Caused by:
      Process `frog_text2folia (37)` terminated with an error exit status (1)
    
    Command executed:
    
      set +u
            if [ ! -z "/vol/customopt/lamachine.stable" ]; then
                source /vol/customopt/lamachine.stable/bin/activate
            fi
            set -u
      
            opts=""
            if [[ "true" == "true" ]]; then
                opts="$opts -n"
            fi
            if [ ! -z "" ]; then
      frog-mopts="$opts --skip="
      fi
      
            #move input files to separate staging directory
            mkdir input
            mv *.txt input/
      
            #output will be in cwd
            mkdir output
            frog $opts --outputclass "current" --xmldir "output" --nostdout --testdir input/
            cd output
            for f in *.xml; do
                if [[ ${f%.folia.xml} == $f ]]; then
                    newf="${f%.xml}.frogged.folia.xml"
                else
                    newf="${f%.folia.xml}.frogged.folia.xml"
                fi
                mv $f ../$newf
            done
            cd ..
    
    Command exit status:
      1
    
    Command output:
      Now using node v13.3.0 (npm v6.13.4)
    
    Command error:
      frog-mbma-:	o - 0 
      frog-mbma-:	r - 0 
      frog-mbma-:	t - 0 
      frog-mbma-:	m - N  morpheme ='ma'
      frog-mbma-:	a - 0 
      frog-mbma-:	 - /  INFLECTION: de delete='a' morpheme ='t'
      frog-mbma-:	t - 0 
      frog-mbma-:	 - V  delete='jege'
      frog-mbma-:	 - 0 
      frog-mbma-:	 - /  INFLECTION: pv delete='ge'
      frog-mbma-:	 - 0 
      frog-mbma-:	z - 0 
      frog-mbma-:	o - 0 
      frog-mbma-:	c - 0  insert='ek' delete='ch'
      frog-mbma-:	h - 0 
      frog-mbma-:	t - /  INFLECTION: pv
      frog-mbma-:tag: / infl: morhemes: [sport,ma,t] description:  confidence: 0
      frog-mbma-:
      frog-mbma-:Hmm: deleting ' is impossible. (a != ').
      frog-mbma-:Reject rule: MBMA rule (qatar):
      frog-mbma-:	q - N  morpheme ='q'
      frog-mbma-:	a - /  INFLECTION: de delete='''
    executor >  local (64)
    [98/46c588] process > frog_text2folia (31) [100%] 64 of 64, failed: 64
    WARN: Killing pending tasks (63)
    Error executing process > 'frog_text2folia (37)'
    
    Caused by:
      Process `frog_text2folia (37)` terminated with an error exit status (1)
    
    Command executed:
    
      set +u
            if [ ! -z "/vol/customopt/lamachine.stable" ]; then
                source /vol/customopt/lamachine.stable/bin/activate
            fi
            set -u
      
            opts=""
            if [[ "true" == "true" ]]; then
                opts="$opts -n"
            fi
            if [ ! -z "" ]; then
      frog-mopts="$opts --skip="
      fi
      
            #move input files to separate staging directory
            mkdir input
            mv *.txt input/
      
            #output will be in cwd
            mkdir output
            frog $opts --outputclass "current" --xmldir "output" --nostdout --testdir input/
            cd output
            for f in *.xml; do
                if [[ ${f%.folia.xml} == $f ]]; then
                    newf="${f%.xml}.frogged.folia.xml"
                else
                    newf="${f%.folia.xml}.frogged.folia.xml"
                fi
                mv $f ../$newf
            done
            cd ..
    
    Command exit status:
      1
      frog-mbma-:	o - 0 
      frog-mbma-:	r - 0 
      frog-mbma-:	t - 0 
      frog-mbma-:	m - N  morpheme ='ma'
      frog-mbma-:	a - 0 
      frog-mbma-:	 - /  INFLECTION: de delete='a' morpheme ='t'
      frog-mbma-:	t - 0 
      frog-mbma-:	 - V  delete='jege'
      frog-mbma-:	 - 0 
      frog-mbma-:	 - /  INFLECTION: pv delete='ge'
      frog-mbma-:	 - 0 
      frog-mbma-:	z - 0 
      frog-mbma-:	o - 0 
      frog-mbma-:	c - 0  insert='ek' delete='ch'
      frog-mbma-:	h - 0 
      frog-mbma-:	t - /  INFLECTION: pv
      frog-mbma-:tag: / infl: morhemes: [sport,ma,t] description:  confidence: 0
      frog-mbma-:
      frog-mbma-:Hmm: deleting ' is impossible. (a != ').
      frog-mbma-:Reject rule: MBMA rule (qatar):
      frog-mbma-:	q - N  morpheme ='q'
      frog-mbma-:	a - /  INFLECTION: de delete='''
      frog-mbma-:	t - 0 
      frog-mbma-:	a - 0 
      frog-mbma-:	r - 0  INFLECTION: e
      frog-mbma-:tag: / infl: morhemes: [q] description:  confidence: 0
      frog-mbma-:
      frog-mbma-:Hmm: deleting 's is impossible. (t != ').
      frog-mbma-:Reject rule: MBMA rule (ruytse):
      frog-mbma-:	r - N  morpheme ='ruy'
      frog-mbma-:	u - 0 
      frog-mbma-:	y - 0 
      frog-mbma-:	t - /  INFLECTION: m delete=''s'
      frog-mbma-:	s - 0 
      frog-mbma-:	e - /  INFLECTION: E/P
      frog-mbma-:tag: / infl: morhemes: [ruy] description:  confidence: 0
      frog-mbma-:
      frog-mbma-:Hmm: deleting 's is impossible. (t != ').
      frog-mbma-:Reject rule: MBMA rule (duyts):
      frog-mbma-:	d - N  morpheme ='d'
      frog-mbma-:	u - N  morpheme ='uy'
      frog-mbma-:	y - 0 
      frog-mbma-:	t - /  INFLECTION: m delete=''s'
      frog-mbma-:	s - 0  INFLECTION: e
      frog-mbma-:tag: / infl: morhemes: [d,uy] description:  confidence: 0
      frog-mbma-:
      frog-:problem frogging: nlcow14ax_all_clean_martijn_36.txt
      frog-:std::bad_alloc
      frog-:Wed Jan 15 17:16:55 2020 Frog finished
      mv: cannot stat '*.xml': No such file or directory
    
    Work dir:
      /vol/tensusers2/hmueller/LAMACHINE/wd3/work/85/5e0fda647124c40fd8fd4d2846df61
    
    Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`
    
    opened by hannomuller 11
  • Frog gets progressively slower when running for hours, days

    Frog gets progressively slower when running for hours, days

    When running frog for a long time, performance decreases significantly.

    For instance, I'm processing a 2.8GB file. These are some pv outputs at several moments:

    frog: 15.8MiB  0:05:10 [39.2KiB/s] [>                                 ]  0% ETA  4:20:44:08
    frog:  545MiB  3:29:02 [62.2KiB/s] [>                                 ]  2% ETA  5:13:26:11     
    frog:  858MiB  5:23:19 [30.6KiB/s] [=>                                ]  4% ETA  5:09:10:47
    frog: 3.05GiB 73:28:54 [    0 B/s] [====>                             ] 14% ETA 17:22:57:01
    

    At first, the expected time is 4 days and 20 hours. After having run for 3 days, the expected time has gone up to almost 18 days.

    This is the script used:

    #!/usr/bin/env bash
    FILE=$1
    FILE_SIZE=`wc -c < "$FILE" | cut -f1`
    let "EXPECTED_FROG_SIZE = 8 * FILE_SIZE"
    BODY="${FILE%.*}"
    
    echo "Processing: $FILE"
    echo "Size: $FILE_SIZE"
    echo "Writing output to: ${BODY}_frog.txt"
    
    pv -s ${FILE_SIZE} -cN in ${FILE} | frog --skip=acmnp 2> ${BODY}_frog.log | pv -s ${EXPECTED_FROG_SIZE} -cN frog > ${BODY}_frog.txt
    
    help wanted 
    opened by gmjonker 11
  • Frog can't deal with tokens that contain spaces

    Frog can't deal with tokens that contain spaces

    In historical dutch, certain words may be written apart although they can be considered one token: "vol daen" (voldaan) and represented as a single <w> in FoLiA. Would the various Frog modules (mblem, mbpos etc) be able to deal with spaces in tokens?

    bug 
    opened by proycon 11
  • Error processing pre-tokenised FoLiA with untokenised parts

    Error processing pre-tokenised FoLiA with untokenised parts

    I'm running into an error when processing pre-tokenised FoLiA:

    mlp09$ frog --language=nld --skip=tmcpa /scratch/proycon/HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.folia.xml -X

    Word(class='NUMBER',generate_id='HuygensING-brieven-correspondenten-1900-1-1_02ba32f1-34da-4c4b-b839-936e03ae1642.text.1.body.1.p.6.s.1',set='passthru') creation failed: Set 'passthru' is used but has no declaration for token-annotation

    It seems the error occurs on a paragraph which contains text but no sentences/words (so it is untokenised unlike the others), when removing this paragraph, everything does process fine. It might be indicative of a more structural problem though, as the problem also occurs when I do not skip the tokeniser.

    bug 
    opened by proycon 8
  • question on accuracy

    question on accuracy

    Hello, this is not an issue, just a question. Basically on accuracy of the different NLP tasks. I'm interested in comparing different types of NLP annotators and their accuracy. How well does frog do regarding accuracy on tokenisation, parts of speech tagging, lemmatisation, morphological feature annotation, dependency parsing? Are there numbers available which are comparable to the CONLL17 shared task (for example by training the frog on Dutch data from universal dependencies and next outputting the results (for example by using the evaluation script used by the CONLL17 shared task available at https://github.com/ufal/conll2017/blob/master/evaluation_script/conll17_ud_eval.py) Are such numbers available?

    opened by jwijffels 8
  • MWU output when no Parser is selected

    MWU output when no Parser is selected

    @Irishx suggested:

    The default setting of Frog is to place mwu on 1 line as 1 token while this is actually only needed for the parser, even if you use the skip option to exclude parsing. Perhaps we should change this default setting?

    @Irishx do you intend to disable MWU detection too, when the Parser is skipped?

    This is easy to implement, but might change outcomes of older scripts. I am not sure if that would be a problem.

    question 
    opened by kosloot 7
  • Frog can't find ucto's configuration file for non-standard rules?

    Frog can't find ucto's configuration file for non-standard rules?

    There's something wrong with the installation of the historical models still. Frog can't seem to find the tokeniser settings:

    $ frog --language dum    
    frog 0.19 (c) CLTS, ILK 1998 - 2019
    CLST  - Centre for Language and Speech Technology,Radboud University
    ILK   - Induction of Linguistic Knowledge Research Group,Tilburg University
    based on [ucto 0.19, libfolia 2.4, timbl 6.4.14, ticcutils 0.23, mbt 3.5]
    removing old debug files using: 'find frog.*.debug -mtime +1 -exec rm {} \;'
    frog-:config read from: /data2/dev/share/frog/dum/frog.cfg
    frog-:Missing [[mbma]] section in config file.
    frog-:Disabled the Morhological analyzer.
    frog-:Missing [[IOB]] section in config file.
    frog-:Disabled the IOB Chunker.
    frog-:Missing [[NER]] section in config file.
    frog-:Disabled the NER.
    frog-:Missing [[mwu]] section in config file.
    frog-:Disabled the Multi Word Unit.
    frog-:Also disabled the parser.
    frog-mblem-frog-mblem-:Initiating lemmatizer...
    ucto: textcat configured from: /data2/dev/share/ucto/textcat.cfg
    frog-tok-:Language List =[dum]
    ucto: No useful settingsfile(s) could be found.
    frog-tagger-tagger-:reading subsets from /data2/dev/share/frog/dum//crmsub.cgn
    frog-tagger-tagger-:reading constraints from /data2/dev/share/frog/dum//crmconstraints.cgn
    frog-:Initialization failed for: [tokenizer] 
    frog-:fatal error: Frog init failed
    
    $ cat /data2/dev/share/frog/dum/frog.cfg | grep tok
    [[tokenizer]]
    rulesFile=tokconfig-nld-historical
    
    $ ls /data2/dev/share/ucto/*hist*
    /data2/dev/share/ucto/tokconfig-nld-historical
    
    bug ready 
    opened by proycon 7
  • Redesign Frog

    Redesign Frog

    @antalvdb and @proycon

    I think it is time for a great overhaul of Frog, to

    • reduce complexity
    • increase maintainability
    • be more flexible
    • speed things up

    The main aspect to consider: At the moment Frog uses FoLiA as it's internal data-structure. That seemed a good plan once, but with growing data-sets this now became a memory hog. Also it has some nasty MultiThreading issues. (like different FoLiA files after every run, all valid) It also makes processing line-by-line almost impossible, as the whole input file is stuffed into one FoLiA document. I think that processing smaller chunks would speed up the process, and will deliver output at a more constant pace. (not just one burst of FoLiA at the end) This would mean that we cannot directly use the Ucto 'tokenize-to-FoLiA` facility anymore. But that is a small burden, IMHO. Also there remains the problem of FoLiA output at the end of Frog. That would still mean a very large file, on large input. But it IS possible to create a 'main' file with a bunch of sub-files, using the FoLiA 'external' mechanism. In the current situation this is almost impossible.

    Also I am working on a FoLiA Builder class, which creates a FoLiA file on disk incrementally, without first creating it completely in memory. This reduces the footprint of Frog but still may produce insane large files (>10Gb or so)

    caveat: This solution does not work for FoLiA input into Frog. If you read a very large file,it stays large :) What MIGHT be possible is to do is to make a FoLiA Reader class, which reads chunks form a FoLiA file, have them processed by Frog and assemble the results back into another FoLiA file. (using the Builder for instance)

    I think other improvements might also be welcome, but this is the most intrusive one, I guess. At the and, the Frog results will not be changed, just produced faster.

    I think I would need a few months to accomplish this. It would be a nice task. But a bit to large to do 'on the side' maybe a CLARIAH+ or a CLARIN-NL project?

    Comments and additions welcome!

    low priority 
    opened by kosloot 7
  • "terminate called without an active exception"

    Forwarding bug report mailed by Alex Bransen:

    ik krijg frog maar niet aan de praat met folia als input, maar wil even
    checken of dat door mijn installatie komt, of dat jij het ook hebt. Zou je
    als je even tijd hebt het bijgevoegde folia bestand door frog willen
    gooien, om te kijken of jij ook een error krijgt? Ik krijg:

    "terminate called without an active exception" (lekker nuttige error ook)

    cmd is: frog -x BAAC_A-11-0119.xml -X frogged-BAAC.xml

    gek genoeg als ik de -x flag naar -t verander (en hij dus het xml als plain
    text gaat behandelen) doet hij het wel..

    Volgens foliavalidator is het wel valid XML trouwens.

    Input file is https://download.anaproy.nl/BAAC_A-11-0119.xml , I can reproduce the bug locally on the latest development version.

    bug 
    opened by proycon 7
  • Segfault on FoLiA in to FoLiA out (speech data with events and utterances)

    Segfault on FoLiA in to FoLiA out (speech data with events and utterances)

    Frog (libfolia) segfaults on the attached FoLiA input upon FoLiA serialisation.

    <?xml version="1.0" encoding="utf-8"?>
    <FoLiA xmlns="http://ilk.uvt.nl/folia" version="2.5" xml:id="example">
      <metadata>
          <annotations>
              <text-annotation>
                             <annotator processor="p1" />
              </text-annotation>
              <utterance-annotation>
                             <annotator processor="p1" />
              </utterance-annotation>
              <event-annotation set="speech">
                             <annotator processor="p1" />
              </event-annotation>
          </annotations>
          <provenance>
             <processor xml:id="p1" name="proycon" type="manual" />
          </provenance>
      </metadata>
      <text xml:id="example.speech">
          <event xml:id="turn.1" class="turn" src="piet.wav" begintime="00:00:00.720" endtime="00:00:53.230">
            <utt xml:id="example.utt.1" speaker="Piet">
                <t>Het is vandaag 1 januari 2019. Mijn naam is Piet voor het project Diplomatieke Getuigenissen heb ik vandaag een gesprek met Piet. Ook met ons in de kamer is Piet die voor ons het geluid en de video verzorgt. Meneer Piet misschien dat we gewoon kunnen beginnen met dat u iets over uw opleiding vertelt en hoe u bij Buitenlandse Zaken bent komen te werken?</t>
            </utt>
            <utt xml:id="example.utt.2" speaker="Piet">
                <t>Ja ik ben geboren in 1936. Volgens de boeken het heilige jaar voor de Chinezen. 1936. In 2036 is er weer zo'n heilig jaar. Ik ben ... </t>
            </utt>
          </event>
      </text>
    </FoLiA>
    

    Call: frog --skip=pac -x anon_1.folia.xml -X anon_1.out.folia.xml

    All actual processing goes fine, it is the FoLiA serialisation in the end that fails.

    gdb backtrace:

    Thread 1 "frog" received signal SIGSEGV, Segmentation fault.
    0x0000000000000000 in ?? ()
    (gdb) bt
    #0  0x0000000000000000 in ?? ()
    #1  0x00007fa4eae08999 in folia::AbstractElement::append (this=<optimized out>, this@entry=0x7fa4e700a580, child=<optimized out>, child@entry=0x7fa4e659a7f0) at folia_impl.cxx:3129
    #2  0x00007fa4eae98ee2 in folia::AbstractStructureElement::append (this=0x7fa4e700a580, child=0x7fa4e659a7f0) at folia_subclasses.cxx:784
    #3  0x00007fa4eae306fc in folia::AbstractElement::AbstractElement (this=this@entry=0x7fa4e659a7f0, __vtt_parm=__vtt_parm@entry=0x7fa4eb5abfc0 <VTT for folia::Paragraph+16>, p=..., el=el@entry=0x7fa4e700a580, __in_chrg=<optimized out>) at folia_impl.cxx:293
    #4  0x00007fa4eb4cd949 in folia::AbstractStructureElement::AbstractStructureElement (p=0x7fa4e700a580, props=..., __vtt_parm=0x7fa4eb5abfb8 <VTT for folia::Paragraph+8>, this=0x7fa4e659a7f0, __in_chrg=<optimized out>)
        at /usr/local/include/libfolia/folia_subclasses.h:59
    #5  folia::Paragraph::Paragraph (p=0x7fa4e700a580, a=..., this=0x7fa4e659a7f0, __in_chrg=<optimized out>, __vtt_parm=<optimized out>) at /usr/local/include/libfolia/folia_subclasses.h:626
    #6  folia::FoliaElement::add_child<folia::Paragraph> (args=..., this=0x7fa4e700a580) at /usr/local/include/libfolia/folia_impl.h:125
    #7  FrogAPI::handle_one_text_parent (this=0x7ffc1bc9e600, os=..., e=0x7fa4e700a580, sentence_done=<optimized out>) at FrogAPI.cxx:2567
    #8  0x00007fa4eb4ce462 in FrogAPI::run_folia_engine (this=0x7ffc1bc9e600, infilename=..., output_stream=...) at FrogAPI.cxx:2661
    #9  0x00007fa4eb4d0bf1 in FrogAPI::FrogFile (this=0x7ffc1bc9e600, infilename=...) at FrogAPI.cxx:2743
    #10 0x00007fa4eb4d3cbd in FrogAPI::run_on_files (this=0x7ffc1bc9e600) at FrogAPI.cxx:1175
    #11 0x000055c8b0feafd2 in main (argc=<optimized out>, argv=<optimized out>) at Frog.cxx:229
    frog_segfault (END)
    
    bug ready 
    opened by proycon 6
  • Praktische vragen rondom grote datasets

    Praktische vragen rondom grote datasets

    Ik heb een corpus met 25 miljard woorden die ik wil 'froggen', daarvoor heb ik een 32 core/128GB RAM computer. M'n plan is om 16 losse instanties van frog te draaien, max. 500 woorden per zin. En het lijkt dat ik dan 10k woorden per seconde kan 'froggen'.

    Maar dat roept natuurlijk wat vragen op.

    1: Hoe waarschijnlijk is het dat 128GB ergens halverwege het proces te weinig blijkt? Is het geheugenverbruik redelijk constant? Is het verstandig om m'n data op te delen in kleine 'chunks'? 2: Met 10kw/s zou m'n hele corpus ongeveer een maand aan computatie vereisen, wat ok is, maar ook voldoende lang dat ik wel wat tijd wil investering om de performance beter te krijgen (met het idee dat dat ook zinnig is voor anderen). Is er nog 'laag hangend fruit' wat betreft de performance? Mijn eerste instinct is dat er veel 'communicatie' is en dat de interne representatie van tokens en hun metadata te complex is. Maar dat is niet makkelijk om om te bouwen. 3: FoLiA is heel flexibel maar simpelweg te veel data, zelfs compressed zou het corpus dan vele terabytes zijn, gewoon niet zo handig als het niet op een normale SSD past. Ik heb even een opzetje gemaakt voor een bestandsformaat wat een vaste 8 bytes per token gebruikt. Daar moet natuurlijk wat voor inleveren (maar 2⁸ PoS tags, theoretisch kun je er ~320 maken volgens mij en maar 2²⁵ soorten tokens (ex. MWU)) , maar dan kun je wel complexe zoekopdrachten best snel doen (ik mik op bijv. 100 miljoen tokens/s). En met basic compressie van de meeste frequente woorden kan ik dan m'n hele corpus in geheugen houden. Is hier misschien al is eerder over nagedacht? Ik zou niet halfslachtig iets opnieuw willen implementeren. 4: Waar zou ik kunnen lezen over hoe frog is getraind en op welke data? Ik zou graag een inschatting maken van wat voor nauwkeurigheid ik kan verwachten op m'n dataset. En waar mogelijk iets hertrainen om beter aan daar beter op aan te sluiten.

    question 
    opened by boblucas 6
  • Simplify option and configuration handling

    Simplify option and configuration handling

    In frog there is quite a messy way of handling options and configuration details. We have FrogOptions and a TiCC::Configuration part to store information.

    This could be simplified a lot by making the Configuration internal to the configuration file parsing and storing all necessary information in the Options.

    opened by kosloot 0
  • Keep the deep_morph structure intact when resolving MWU's

    Keep the deep_morph structure intact when resolving MWU's

    When resolving MWU's (in frog_data::resolve_mwus() ) the deep_morphs structure is lost; only the deep_morph_string member is resolved. This is disadvantageous, as it is impossible to retrieve the separate deep_morphs and inflections from MWU's (without stupid split() actions). This especially a problem while creating JSON output. TODO: rework MWU resolving keeping the parts available.

    enhancement 
    opened by kosloot 0
  • Add JSON output as an alternative to 'tabbed' format

    Add JSON output as an alternative to 'tabbed' format

    The 'tabbed' format is quite rigid, and sometimes difficult to read. (especially when some modules are skipped). It might be handy to create JSON output as an alternative. This might really be useful for the SERVER mode.

    NOTE: consider JSON input for the server too then,

    testing 
    opened by kosloot 2
  • Use MBMA to split compounds

    Use MBMA to split compounds

    in --deep_morph mode, MBMA can detect al kinds of compounds, and even outputs them. it would be very useful if we could add some code to give the logical splitting of the detected compounds.

    e.g. Frog now gives for 'appeltaart' [[appel]noun[taart]noun]noun/singular NN-compound it seems doable to also give 'appel-taart'

    In practice this can become very complicated: 'appelgebak' gives: [[appel]noun[[ge][bak]noun]noun/singular]noun NN-compound

    You would like te get appel-gebak' NOTappel-ge-bakorappelge-bak` For longer compounds it gets even more difficult.

    verkeersagent [[verkeer]noun[s][[ageer]verb[ent]]noun]noun/singular NN-compound

    But still it seems worth investigating.

    enhancement MBMA 
    opened by kosloot 0
Releases(v0.26)
  • v0.26(Jan 2, 2023)

    [Ko van der Sloot]

    • fix for https://github.com/LanguageMachines/frog/issues/96
    • code improvements, readability and fixing CppCheck warnings
    • needs recent ticcutils (>=0.30)
    • needs newest Timbl (6.8) for more Unicode awarenes
    • updated GigHub action

    [Maarten van Gompel]

    • added MAINTAINERS file
    • updated codemeta.json
    Source code(tar.gz)
    Source code(zip)
    frog-0.26.tar.gz(614.83 KB)
  • v0.25(Jul 22, 2022)

    [Maarten van Gompel]

    • updated metadata (codemeta.json) following new (proposed) CLARIAH requirements (CLARIAH/clariah-plus#38)
    • added builds-deps.sh for automatically building and installing dependencies
    • added Dockerfile and instructions
    • added support for user-based configuration dirs ($XDG_CONFIG_HOME/frog), takes precedence over global data dirs
    • use frogdata 0.21, ucto 0.25

    [Ko vd Sloot]

    • updated Doxygen config file
    Source code(tar.gz)
    Source code(zip)
    frog-0.25.tar.gz(627.59 KB)
  • v0.24(Dec 15, 2021)

    [Ko vd Sloot]

    • start using the newest UTF8 aware Timbl and Mbt and Ucto
    • use NFC normalized UnicodeString more general internaly
    • added a fix in MBMA codng, to get better reproducable result on different OS/Compiler combinations
    • lots of small refactoring
    • bumped library version, because of some API changed

    [maarten van Gompel]

    • merged a patch suggested by Helmut Grohne [email protected]
      • configure.ac: Bug#993123: frog FTCBFS: hard codes the build architecture pkg-config Source: frog Version: 0.20-2 Tags: patch upstream User: [email protected] Usertags: ftcbfs frog fails to cross build from source, because configure.ac hard codes the build architecture pkg-config in one place (after correctly detecting the host architecture one). Simply using the correct substitution variable makes frog cross buildable. Please consider applying the attached patch. Helmut Signed-off-by: Maarten van Gompel [email protected]
    Source code(tar.gz)
    Source code(zip)
    frog-0.24.tar.gz(602.29 KB)
  • v0.23(Jul 12, 2021)

  • v0.22(Nov 17, 2020)

  • v0.21(Jul 22, 2020)

  • v0.20.1(Apr 15, 2020)

  • v0.20(Apr 15, 2020)

    [Ko vd Sloot]

    • added Doxygen to the build
    • added a lot of comment in Doxygen format
    • adapted to the newest ticcutils version
    • adapted to latest libfolia
    • adapted to latest ucto
    • lots of code refactorings
    • implemented --JSONin option (server only)
    • implemented --JSONout option
    • added a --allow-word-correction option which allows ucto to correct FoLiA Word nodes

    [Iris Hendrix] Documentation updates

    Source code(tar.gz)
    Source code(zip)
    frog-0.20.tar.gz(567.24 KB)
  • v0.19.1(Nov 15, 2019)

  • v0.19(Oct 21, 2019)

    • added code to use a locally installed Alpino parser

    • added code to use a remote Alpino Server

    • added code to use (remote) timblservers and mbtservers for alle modules using JSON calls. Stil experimental.

    • several code refactoring and small fixes:

      • memory leaks
      • using NER files in non-standard locations
      • bug fixes for some corner cases.
    • frog.*.debug files are cleaned up after 1 day.

    Source code(tar.gz)
    Source code(zip)
    frog-0.19.tar.gz(547.43 KB)
  • v0.18.6(Sep 17, 2019)

  • v0.18.5(Sep 16, 2019)

  • v0.18.3(Jul 22, 2019)

  • v0.18.2(Jul 15, 2019)

  • v0.18.1(Jun 19, 2019)

  • v0.18(Jun 19, 2019)

    Bug fixes and enhancements:

    • provenance uses new 'generate_id' option in libfolia:processor
    • solved problems when frogging partly tokenized FoLiA
    • solved problems when processing with --skip=t
    • small improvement in compound detection (still more to do...)
    Source code(tar.gz)
    Source code(zip)
    frog-0.18.tar.gz(532.83 KB)
  • v0.17(May 29, 2019)

  • v0.16(May 15, 2019)

    This is the last release using pre FoLiA 2.0 It includes a total rework of the Frog Internals, aiming at better maintainability and hoping for a speedup and a smaller memory footprint. This work will continue in the upcoming release for FoLiA 2.0

    Major Changes:

    • total rework. Not using a FoLiA document as the internal datastructure anymore but a FrogData structure.
    • use folia::engine for all FoLiA processing
    • -Q option is NOT supported anymore. It was unreliable anyway
    • builds on the newest ucto versions only
    • fix for https://github.com/proycon/LaMachine/issues//135 https://github.com/LanguageMachines/frog/issues/66
    • handles some corner cases in FoLiA better
    • lots of code cleanup
    • numerous small fixes ( e.g. in NER and MBMA results)
    • improved working of --languages option
    • avoid invalid FoLiA: https://github.com/LanguageMachines/frog/issues/60
    • fixed memory leaks
    • better handling of weird FoLiA

    [Maarten van Gompel]

    • added skeleton for new Frog documentation
    Source code(tar.gz)
    Source code(zip)
    frog-0.16.tar.gz(528.23 KB)
  • v0.15(May 16, 2018)

    [Ko vd Sloot]

    • ucto_tokenizer_mod: removed call of (useless) ucto:setSentenceDetection(true)
    • fix to close the server when a socket fails
    • when frogging a file, and the docID is NOT specified, use the filename as the docID (filtering out non-NCName characters)
    • fix building the documentation from TeX files
    • a lot of small code improvements

    [Maarten van Gompel]

    • added codemeta.json
    • Fixed python-frog example in documentation (closes #48)
    Source code(tar.gz)
    Source code(zip)
    frog-0.15.tar.gz(513.64 KB)
  • v0.14(Feb 19, 2018)

    • use TiCC::UniFilter now
    • use TiCC::diacritics_filter now
    • configuration modernized. OSX build supported too
    • XML (FoLiA) files are autodetected
    • some more logging and time stamps added
    • added code to NER module to override original tags (e.g. from gazeteer)
    Source code(tar.gz)
    Source code(zip)
    frog-0.14.tar.gz(508.17 KB)
  • v0.13.10(Jan 29, 2018)

  • v0.13.9(Nov 7, 2017)

  • v0.13.8(Oct 26, 2017)

    • added -t / --textredundancy option, which is passed to ucto
    • set textclass attributes on entities (folia 1.5 feature)
    • better textclass handling in general
    • multiple types of entities (setnames) are stored in different layers
    • some small provisions for 'multi word' words added. mblem may use them other modules just ignore them (seeing a multiword as multi words)
    • added --inpuclass and --outputclass options. (prefer over textclass)
    • added a --retry option, to redo complete directories, skipping what is done.
    • added a --nostdout option to suppress the tabbed output to stdout
    • refactoring and small fixes
    Source code(tar.gz)
    Source code(zip)
    frog-0.13.8.tar.gz(506.54 KB)
  • v0.13.7(Jan 23, 2017)

  • v0.13.6(Jan 5, 2017)

    • rework done on compounding in MBMA. (still work in progress)
    • lots of improvement in MBMA rule handling. (but still work in progress)
      • support for 'glue' rules added.
      • support for 'hidden' morphemes added.
      • proper CELEX tags are outputted now in the XML
      • some structure labels have better names now
    • removed exit() calls from library modules (issue #17)
    • added languages option which is handled over to ucto too.
      • detect multiple languages
      • handle a selected language an ignore the rest
    Source code(tar.gz)
    Source code(zip)
    frog-0.13.6.tar.gz(502.13 KB)
  • v0.13.5(Sep 13, 2016)

  • v0.13.4(Jul 11, 2016)

    • added long options --help and --version
    • interactive use is limited to TTY's only, so pipes from std in work
    • added a --language='name' option. it tries to read the configuration from a subdirectory with 'name' in the configdir The default is 'nl'
    • tokenizer timing is fixed at last
    • be robust against a missing clex tag
    • better warning when OpenMP is not present
    • adaptation in mbma
    • added 2 convenience functions to FragAPI: get_full_morph_analysis() and get_compound_analysis()
    • CompoundType is now in it;s own namespace
    • some code refactoring, as usual
    Source code(tar.gz)
    Source code(zip)
    frog-0.13.4.tar.gz(491.40 KB)
  • v0.13.3(Mar 10, 2016)

Owner
Language Machines
NLP Research group at Centre for Language Studies, Radboud University Nijmegen
Language Machines
Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Amazon Archives 4.4k Dec 30, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

This is the Vowpal Wabbit fast online learning code. Why Vowpal Wabbit? Vowpal Wabbit is a machine learning system which pushes the frontier of machin

Vowpal Wabbit 8.1k Dec 30, 2022
NLP-based perching trajectory generation presented in our paper "Perception-Aware Perching on Powerlines with Multirotors".

Perception-Aware Perching on Powerlines with Multirotors This repo contains the code for the NLP-based perching trajectory generation presented in our

null 24 Dec 19, 2022
Rune is a programming language developed to test ideas for improving security and efficiency.

ᚣ The Rune Programming Language Safer code for secure enclaves This is not an officially supported Google product. NOTE: Rune is an unfinished languag

Google 1.8k Dec 29, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Dec 23, 2022
Python Inference Script is a Python package that enables developers to author machine learning workflows in Python and deploy without Python.

Python Inference Script(PyIS) Python Inference Script is a Python package that enables developers to author machine learning workflows in Python and d

Microsoft 13 Nov 4, 2022
Eclipse Deeplearning4J (DL4J) ecosystem is a set of projects intended to support all the needs of a JVM based deep learning application

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learning using automatic differentiation.

Eclipse Foundation 12.7k Dec 29, 2022
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.

iNeural A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms. What is a Neural Network? Work on

Fatih Küçükkarakurt 5 Apr 5, 2022
C++ library based on tensorrt integration

3行代码实现YoloV5推理,TensorRT C++库 支持最新版tensorRT8.0,具有最新的解析器算子支持 支持静态显性batch size,和动态非显性batch size,这是官方所不支持的 支持自定义插件,简化插件的实现过程 支持fp32、fp16、int8的编译 优化代码结构,打印

手写AI 1.5k Jan 5, 2023
Deploying Deep Learning Models in C++: BERT Language Model

This repository show the code to deploy a deep learning model serialized and running in C++ backend.

null 43 Nov 14, 2022
Deep Learning in C Programming Language. Provides an easy way to create and train ANNs.

cDNN is a Deep Learning Library written in C Programming Language. cDNN provides functions that can be used to create Artificial Neural Networks (ANN)

Vishal R 12 Dec 24, 2022
Triton - a language and compiler for writing highly efficient custom Deep-Learning primitives.

Triton - a language and compiler for writing highly efficient custom Deep-Learning primitives.

OpenAI 4.6k Dec 26, 2022
Learning Programming C Language

Learning-Programming-C Learning Programming C Language Question 1 Have the computer print HI, HOW OLD ARE YOU? on one line. The user then enters his o

null 4 Dec 8, 2022
Benchmark framework of 3D integrated CIM accelerators for popular DNN inference, support both monolithic and heterogeneous 3D integration

3D+NeuroSim V1.0 The DNN+NeuroSim framework was developed by Prof. Shimeng Yu's group (Georgia Institute of Technology). The model is made publicly av

NeuroSim 11 Dec 15, 2022
Openvino tensorflow - OpenVINO™ integration with TensorFlow

English | 简体中文 OpenVINO™ integration with TensorFlow This repository contains the source code of OpenVINO™ integration with TensorFlow, designed for T

OpenVINO Toolkit 169 Dec 23, 2022
Cinder is a community-developed, free and open source library for professional-quality creative coding in C++.

Cinder 0.9.3dev: libcinder.org Cinder is a peer-reviewed, free, open source C++ library for creative coding. Please note that Cinder depends on a few

Cinder 5k Jan 7, 2023
A program developed using MPI for distributed computation of Histogram for large data and their performance anaysis on multi-core systems

mpi-histo A program developed using MPI for distributed computation of Histogram for large data and their performance anaysis on multi-core systems. T

Raj Shrestha 2 Dec 21, 2021