A C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.

Overview

libpostal: international street address NLP

Build Status Build Status License OpenCollective Sponsors OpenCollective Backers

libpostal is a C library for parsing/normalizing street addresses around the world using statistical NLP and open data. The goal of this project is to understand location-based strings in every language, everywhere. For a more comprehensive overview of the research behind libpostal, be sure to check out the (lengthy) introductory blog posts:

🇧🇷 🇫🇮 🇳🇬 🇯🇵 🇽🇰 🇧🇩 🇵🇱 🇻🇳 🇧🇪 🇲🇦 🇺🇦 🇯🇲 🇷🇺 🇮🇳 🇱🇻 🇧🇴 🇩🇪 🇸🇳 🇦🇲 🇰🇷 🇳🇴 🇲🇽 🇨🇿 🇹🇷 🇪🇸 🇸🇸 🇪🇪 🇧🇭 🇳🇱 🇨🇳 🇵🇹 🇵🇷 🇬🇧 🇵🇸

Addresses and the locations they represent are essential for any application dealing with maps (place search, transportation, on-demand/delivery services, check-ins, reviews). Yet even the simplest addresses are packed with local conventions, abbreviations and context, making them difficult to index/query effectively with traditional full-text search engines. This library helps convert the free-form addresses that humans use into clean normalized forms suitable for machine comparison and full-text indexing. Though libpostal is not itself a full geocoder, it can be used as a preprocessing step to make any geocoding application smarter, simpler, and more consistent internationally.

🇷🇴 🇬🇭 🇦🇺 🇲🇾 🇭🇷 🇭🇹 🇺🇸 🇿🇦 🇷🇸 🇨🇱 🇮🇹 🇰🇪 🇨🇭 🇨🇺 🇸🇰 🇦🇴 🇩🇰 🇹🇿 🇦🇱 🇨🇴 🇮🇱 🇬🇹 🇫🇷 🇵🇭 🇦🇹 🇱🇨 🇮🇸 🇮🇩 🇦🇪 🇸🇰 🇹🇳 🇰🇭 🇦🇷 🇭🇰

The core library is written in pure C. Language bindings for Python, Ruby, Go, Java, PHP, and NodeJS are officially supported and it's easy to write bindings in other languages.

Sponsors

If your company is using libpostal, consider asking your organization to sponsor the project. Interpreting what humans mean when they refer to locations is far from a solved problem, and sponsorships help us pursue new frontiers in geospatial NLP. As a sponsor, your company logo will appear prominently on the Github repo page along with a link to your site. Sponsorship info

Backers

Individual users can also help support open geo NLP research by making a monthly donation:

Installation (Mac/Linux)

Before you install, make sure you have the following prerequisites:

On Ubuntu/Debian

sudo apt-get install curl autoconf automake libtool pkg-config

On CentOS/RHEL

sudo yum install curl autoconf automake libtool pkgconfig

On Mac OSX

brew install curl autoconf automake libtool pkg-config

Then to install the C library:

git clone https://github.com/openvenues/libpostal
cd libpostal
./bootstrap.sh
./configure --datadir=[...some dir with a few GB of space...]
make -j4
sudo make install

# On Linux it's probably a good idea to run
sudo ldconfig

libpostal has support for pkg-config, so you can use the pkg-config to print the flags needed to link your program against it:

pkg-config --cflags libpostal         # print compiler flags
pkg-config --libs libpostal           # print linker flags
pkg-config --cflags --libs libpostal  # print both

For example, if you write a program called app.c, you can compile it like this:

gcc app.c `pkg-config --cflags --libs libpostal`

Installation (Windows)

MSys2/MinGW

For Windows the build procedure currently requires MSys2 and MinGW. This can be downloaded from http://msys2.org. Please follow the instructions on the MSys2 website for installation.

Please ensure Msys2 is up-to-date by running:

pacman -Syu

Install the following prerequisites:

pacman -S autoconf automake curl git make libtool gcc mingw-w64-x86_64-gcc

Then to build the C library:

git clone https://github.com/openvenues/libpostal
cd libpostal
cp -rf windows/* ./
./bootstrap.sh
./configure --datadir=[...some dir with a few GB of space...]
make -j4
make install

Notes: When setting the datadir, the C: drive would be entered as /c. The libpostal build script automatically add libpostal on the end of the path, so '/c' would become C:\libpostal\ on Windows.

The compiled .dll will be in the src/.libs/ directory and should be called libpostal-1.dll.

If you require a .lib import library to link this to your application. You can generate one using the Visual Studio lib.exe tool and the libpostal.def definition file:

lib.exe /def:libpostal.def /out:libpostal.lib /machine:x64

Examples of parsing

libpostal's international address parser uses machine learning (Conditional Random Fields) and is trained on over 1 billion addresses in every inhabited country on Earth. We use OpenStreetMap and OpenAddresses as sources of structured addresses, and the OpenCage address format templates at: https://github.com/OpenCageData/address-formatting to construct the training data, supplementing with containing polygons, and generating sub-building components like apartment/floor numbers and PO boxes. We also add abbreviations, drop out components at random, etc. to make the parser as robust as possible to messy real-world input.

These example parse results are taken from the interactive address_parser program that builds with libpostal when you run make. Note that the parser can handle commas vs. no commas as well as various casings and permutations of components (if the input is e.g. just city or just city/postcode).

parser

The parser achieves very high accuracy on held-out data, currently 99.45% correct full parses (meaning a 1 in the numerator for getting every token in the address correct).

Usage (parser)

Here's an example of the parser API using the Python bindings:

from postal.parser import parse_address
parse_address('The Book Club 100-106 Leonard St Shoreditch London EC2A 4RH, United Kingdom')

And an example with the C API:

#include <stdio.h>
#include <stdlib.h>
#include <libpostal/libpostal.h>

int main(int argc, char **argv) {
    // Setup (only called once at the beginning of your program)
    if (!libpostal_setup() || !libpostal_setup_parser()) {
        exit(EXIT_FAILURE);
    }

    libpostal_address_parser_options_t options = libpostal_get_address_parser_default_options();
    libpostal_address_parser_response_t *parsed = libpostal_parse_address("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA", options);

    for (size_t i = 0; i < parsed->num_components; i++) {
        printf("%s: %s\n", parsed->labels[i], parsed->components[i]);
    }

    // Free parse result
    libpostal_address_parser_response_destroy(parsed);

    // Teardown (only called once at the end of your program)
    libpostal_teardown();
    libpostal_teardown_parser();
}

Parser labels

The address parser can technically use any string labels that are defined in the training data, but these are the ones currently defined, based on the fields defined in OpenCage's address-formatting library, as well as a few added by libpostal to handle specific patterns:

  • house: venue name e.g. "Brooklyn Academy of Music", and building names e.g. "Empire State Building"
  • category: for category queries like "restaurants", etc.
  • near: phrases like "in", "near", etc. used after a category phrase to help with parsing queries like "restaurants in Brooklyn"
  • house_number: usually refers to the external (street-facing) building number. In some countries this may be a compount, hyphenated number which also includes an apartment number, or a block number (a la Japan), but libpostal will just call it the house_number for simplicity.
  • road: street name(s)
  • unit: an apartment, unit, office, lot, or other secondary unit designator
  • level: expressions indicating a floor number e.g. "3rd Floor", "Ground Floor", etc.
  • staircase: numbered/lettered staircase
  • entrance: numbered/lettered entrance
  • po_box: post office box: typically found in non-physical (mail-only) addresses
  • postcode: postal codes used for mail sorting
  • suburb: usually an unofficial neighborhood name like "Harlem", "South Bronx", or "Crown Heights"
  • city_district: these are usually boroughs or districts within a city that serve some official purpose e.g. "Brooklyn" or "Hackney" or "Bratislava IV"
  • city: any human settlement including cities, towns, villages, hamlets, localities, etc.
  • island: named islands e.g. "Maui"
  • state_district: usually a second-level administrative division or county.
  • state: a first-level administrative division. Scotland, Northern Ireland, Wales, and England in the UK are mapped to "state" as well (convention used in OSM, GeoPlanet, etc.)
  • country_region: informal subdivision of a country without any political status
  • country: sovereign nations and their dependent territories, anything with an ISO-3166 code.
  • world_region: currently only used for appending “West Indies” after the country name, a pattern frequently used in the English-speaking Caribbean e.g. “Jamaica, West Indies”

Examples of normalization

The expand_address API converts messy real-world addresses into normalized equivalents suitable for search indexing, hashing, etc.

Here's an interactive example using the Python binding:

expand

libpostal contains an OSM-trained language classifier to detect which language(s) are used in a given address so it can apply the appropriate normalizations. The only input needed is the raw address string. Here's a short list of some less straightforward normalizations in various languages.

Input Output (may be multiple in libpostal)
One-hundred twenty E 96th St 120 east 96th street
C/ Ocho, P.I. 4 calle 8 polígono industrial 4
V XX Settembre, 20 via 20 settembre 20
Quatre vingt douze R. de l'Église 92 rue de l eglise
ул Каретный Ряд, д 4, строение 7 улица каретныи ряд дом 4 строение 7
ул Каретный Ряд, д 4, строение 7 ulitsa karetnyy ryad dom 4 stroyeniye 7
Marktstraße 14 markt strasse 14

libpostal currently supports these types of normalizations in 60+ languages, and you can add more (without having to write any C).

For further reading and some bizarre address edge-cases, see: Falsehoods Programmers Believe About Addresses.

Usage (normalization)

Here's an example using the Python bindings for succinctness (most of the higher-level language bindings are similar):

from postal.expand import expand_address
expansions = expand_address('Quatre-vingt-douze Ave des Champs-Élysées')

assert '92 avenue des champs-elysees' in set(expansions)

The C API equivalent is a few more lines, but still fairly simple:

#include <stdio.h>
#include <stdlib.h>
#include <libpostal/libpostal.h>

int main(int argc, char **argv) {
    // Setup (only called once at the beginning of your program)
    if (!libpostal_setup() || !libpostal_setup_language_classifier()) {
        exit(EXIT_FAILURE);
    }

    size_t num_expansions;
    libpostal_normalize_options_t options = libpostal_get_default_options();
    char **expansions = libpostal_expand_address("Quatre-vingt-douze Ave des Champs-Élysées", options, &num_expansions);

    for (size_t i = 0; i < num_expansions; i++) {
        printf("%s\n", expansions[i]);
    }

    // Free expansions
    libpostal_expansion_array_destroy(expansions, num_expansions);

    // Teardown (only called once at the end of your program)
    libpostal_teardown();
    libpostal_teardown_language_classifier();
}

Command-line usage (expand)

After building libpostal:

cd src/

./libpostal "Quatre vingt douze Ave des Champs-Élysées"

If you have a text file or stream with one address per line, the command-line interface also accepts input from stdin:

cat some_file | ./libpostal --json

Command-line usage (parser)

After building libpostal:

cd src/

./address_parser

address_parser is an interactive shell. Just type addresses and libpostal will parse them and print the result.

Bindings

Libpostal is designed to be used by higher-level languages. If you don't see your language of choice, or if you're writing a language binding, please let us know!

Officially supported language bindings

Unofficial language bindings

Database extensions

Unofficial REST API

Libpostal REST Docker

Libpostal ZeroMQ Docker

Tests

libpostal uses greatest for automated testing. To run the tests, use:

make check

Adding test cases is easy, even if your C is rusty/non-existent, and we'd love contributions. We use mostly functional tests checking string input against string output.

libpostal also gets periodically battle-tested on millions of addresses from OSM (clean) as well as anonymized queries from a production geocoder (not so clean). During this process we use valgrind to check for memory leaks and other errors.

Data files

libpostal needs to download some data files from S3. The basic files are on-disk representations of the data structures necessary to perform expansion. For address parsing, since model training takes a few days, we publish the fully trained model to S3 and will update it automatically as new addresses get added to OSM, OpenAddresses, etc. Same goes for the language classifier model.

Data files are automatically downloaded when you run make. To check for and download any new data files, you can either run make, or run:

libpostal_data download all $YOUR_DATA_DIR/libpostal

And replace $YOUR_DATA_DIR with whatever you passed to configure during install.

Language dictionaries

libpostal contains a number of per-language dictionaries that influence expansion, the language classifier, and the parser. To explore the dictionaries or contribute abbreviations/phrases in your language, see resources/dictionaries.

Training data

In machine learning, large amounts of training data are often essential for getting good results. Many open-source machine learning projects either release only the model code (results reproducible if and only if you're Google), or a pre-baked model where the training conditions are unknown.

Libpostal is a bit different because it's trained on open data that's available to everyone, so we've released the entire training pipeline (the geodata package in this repo), as well as the resulting training data itself on the Internet Archive. It's over 100GB unzipped.

Training data are stored on archive.org by the date they were created. There's also a file stored in the main directory of this repo called current_parser_training_set which stores the date of the most recently created training set. To always point to the latest data, try something like: latest=$(cat current_parser_training_set) and use that variable in place of the date.

Parser training sets

All files can be found at https://archive.org/download/libpostal-parser-training-data-YYYYMMDD/$FILE as gzip'd tab-separated values (TSV) files formatted like:language\tcountry\taddress.

  • formatted_addresses_tagged.random.tsv.gz (ODBL): OSM addresses. Apartments, PO boxes, categories, etc. are added primarily to these examples
  • formatted_places_tagged.random.tsv.gz (ODBL): every toponym in OSM (even cities represented as points, etc.), reverse-geocoded to its parent admins, possibly including postal codes if they're listed on the point/polygon. Every place gets a base level of representation and places with higher populations get proportionally more.
  • formatted_ways_tagged.random.tsv.gz (ODBL): every street in OSM (ways with highway=*, with a few conditions), reverse-geocoded to its admins
  • geoplanet_formatted_addresses_tagged.random.tsv.gz (CC-BY): every postal code in Yahoo GeoPlanet (includes almost every postcode in the UK, Canada, etc.) and their parent admins. The GeoPlanet admins have been cleaned up and mapped to libpostal's tagset
  • openaddresses_formatted_addresses_tagged.random.tsv.gz (various licenses, mostly CC-BY): most of the address data sets from OpenAddresses, which in turn come directly from government sources
  • uk_openaddresses_formatted_addresses_tagged.random.tsv.gz (CC-BY): addresses from OpenAddresses UK

If the parser doesn't perform as well as you'd hoped on a particular type of address, the best recourse is to use grep/awk to look through the training data and try to determine if there's some pattern/style of address that's not being captured.

Features

  • Abbreviation expansion: e.g. expanding "rd" => "road" but for almost any language. libpostal supports > 50 languages and it's easy to add new languages or expand the current dictionaries. Ideographic languages (not separated by whitespace e.g. Chinese) are supported, as are Germanic languages where thoroughfare types are concatenated onto the end of the string, and may optionally be separated so Rosenstraße and Rosen Straße are equivalent.

  • International address parsing: Conditional Random Field which parses "123 Main Street New York New York" into {"house_number": 123, "road": "Main Street", "city": "New York", "state": "New York"}. The parser works for a wide variety of countries and languages, not just US/English. The model is trained on over 1 billion addresses and address-like strings, using the templates in the OpenCage address formatting repo to construct formatted, tagged traning examples for every inhabited country in the world. Many types of normalizations are performed to make the training data resemble real messy geocoder input as closely as possible.

  • Language classification: multinomial logistic regression trained (using the FTRL-Proximal method to induce sparsity) on all of OpenStreetMap ways, addr:* tags, toponyms and formatted addresses. Labels are derived using point-in-polygon tests for both OSM countries and official/regional languages for countries and admin 1 boundaries respectively. So, for example, Spanish is the default language in Spain but in different regions e.g. Catalunya, Galicia, the Basque region, the respective regional languages are the default. Dictionary-based disambiguation is employed in cases where the regional language is non-default e.g. Welsh, Breton, Occitan. The dictionaries are also used to abbreviate canonical phrases like "Calle" => "C/" (performed on both the language classifier and the address parser training sets)

  • Numeric expression parsing ("twenty first" => 21st, "quatre-vingt-douze" => 92, again using data provided in CLDR), supports > 30 languages. Handles languages with concatenated expressions e.g. milleottocento => 1800. Optionally normalizes Roman numerals regardless of the language (IX => 9) which occur in the names of many monarchs, popes, etc.

  • Fast, accurate tokenization/lexing: clocked at > 1M tokens / sec, implements the TR-29 spec for UTF8 word segmentation, tokenizes East Asian languages chracter by character instead of on whitespace.

  • UTF8 normalization: optionally decompose UTF8 to NFD normalization form, strips accent marks e.g. à => a and/or applies Latin-ASCII transliteration.

  • Transliteration: e.g. улица => ulica or ulitsa. Uses all CLDR transforms, the exact same source data as used by ICU, though libpostal doesn't require pulling in all of ICU (might conflict with your system's version). Note: some languages, particularly Hebrew, Arabic and Thai may not include vowels and thus will not often match a transliteration done by a human. It may be possible to implement statistical transliterators for some of these languages.

  • Script detection: Detects which script a given string uses (can be multiple e.g. a free-form Hong Kong or Macau address may use both Han and Latin scripts in the same address). In transliteration we can use all applicable transliterators for a given Unicode script (Greek can for instance be transliterated with Greek-Latin, Greek-Latin-BGN and Greek-Latin-UNGEGN).

Non-goals

  • Verifying that a location is a valid address
  • Actually geocoding addresses to a lat/lon (that requires a database/search index)

Raison d'être

libpostal was originally created as part of the OpenVenues project to solve the problem of venue deduping. In OpenVenues, we have a data set of millions of places derived from terabytes of web pages from the Common Crawl. The Common Crawl is published monthly, and so even merging the results of two crawls produces significant duplicates.

Deduping is a relatively well-studied field, and for text documents like web pages, academic papers, etc. there exist pretty decent approximate similarity methods such as MinHash.

However, for physical addresses, the frequent use of conventional abbreviations such as Road == Rd, California == CA, or New York City == NYC complicates matters a bit. Even using a technique like MinHash, which is well suited for approximate matches and is equivalent to the Jaccard similarity of two sets, we have to work with very short texts and it's often the case that two equivalent addresses, one abbreviated and one fully specified, will not match very closely in terms of n-gram set overlap. In non-Latin scripts, say a Russian address and its transliterated equivalent, it's conceivable that two addresses referring to the same place may not match even a single character.

As a motivating example, consider the following two equivalent ways to write a particular Manhattan street address with varying conventions and degrees of verbosity:

  • 30 W 26th St Fl #7
  • 30 West Twenty-sixth Street Floor Number 7

Obviously '30 W 26th St Fl #7 != '30 West Twenty-sixth Street Floor Number 7' in a string comparison sense, but a human can grok that these two addresses refer to the same physical location.

libpostal aims to create normalized geographic strings, parsed into components, such that we can more effectively reason about how well two addresses actually match and make automated server-side decisions about dupes.

So it's not a geocoder?

If the above sounds a lot like geocoding, that's because it is in a way, only in the OpenVenues case, we have to geocode without a UI or a user to select the correct address in an autocomplete dropdown. Given a database of source addresses such as OpenAddresses or OpenStreetMap (or all of the above), libpostal can be used to implement things like address deduping and server-side batch geocoding in settings like MapReduce or stream processing.

Now, instead of trying to bake address-specific conventions into traditional document search engines like Elasticsearch using giant synonyms files, scripting, custom analyzers, tokenizers, and the like, geocoding can look like this:

  1. Run the addresses in your database through libpostal's expand_address
  2. Store the normalized string(s) in your favorite search engine, DB, hashtable, etc.
  3. Run your user queries or fresh imports through libpostal and search the existing database using those strings

In this way, libpostal can perform fuzzy address matching in constant time relative to the size of the data set.

Why C?

libpostal is written in C for three reasons (in order of importance):

  1. Portability/ubiquity: libpostal targets higher-level languages that people actually use day-to-day: Python, Go, Ruby, NodeJS, etc. The beauty of C is that just about any programming language can bind to it and C compilers are everywhere, so pick your favorite, write a binding, and you can use libpostal directly in your application without having to stand up a separate server. We support Mac/Linux (Windows is not a priority but happy to accept patches), have a standard autotools build and an endianness-agnostic file format for the data files. The Python bindings, are maintained as part of this repo since they're needed to construct the training data.

  2. Memory-efficiency: libpostal is designed to run in a MapReduce setting where we may be limited to < 1GB of RAM per process depending on the machine configuration. As much as possible libpostal uses contiguous arrays, tries (built on contiguous arrays), bloom filters and compressed sparse matrices to keep memory usage low. It's possible to use libpostal on a mobile device with models trained on a single country or a handful of countries.

  3. Performance: this is last on the list for a reason. Most of the optimizations in libpostal are for memory usage rather than performance. libpostal is quite fast given the amount of work it does. It can process 10-30k addresses / second in a single thread/process on the platforms we've tested (that means processing every address in OSM planet in a little over an hour). Check out the simple benchmark program to test on your environment and various types of input. In the MapReduce setting, per-core performance isn't as important because everything's being done in parallel, but there are some streaming ingestion applications at Mapzen where this needs to run in-process.

C conventions

libpostal is written in modern, legible, C99 and uses the following conventions:

  • Roughly object-oriented, as much as allowed by C
  • Almost no pointer-based data structures, arrays all the way down
  • Uses dynamic character arrays (inspired by sds) for safer string handling
  • Confines almost all mallocs to name_new and all frees to name_destroy
  • Efficient existing implementations for simple things like hashtables
  • Generic containers (via klib) whenever possible
  • Data structrues take advantage of sparsity as much as possible
  • Efficient double-array trie implementation for most string dictionaries
  • Cross-platform as much as possible, particularly for *nix

Preprocessing (Python)

The geodata Python package in the libpostal repo contains the pipeline for preprocessing the various geo data sets and building training data for the C models to use. This package shouldn't be needed for most users, but for those interested in generating new types of addresses or improving libpostal's training data, this is where to look.

Address parser accuracy

On held-out test data (meaning labeled parses that the model has not seen before), the address parser achieves 99.45% full parse accuracy.

For some tasks like named entity recognition it's preferable to use something like an F1 score or variants, mostly because there's a class bias problem (most words are non-entities, and a system that simply predicted non-entity for every token would actually do fairly well in terms of accuracy). That is not the case for address parsing. Every token has a label and there are millions of examples of each class in the training data, so accuracy is preferable as it's a clean, simple and intuitive measure of performance.

Here we use full parse accuracy, meaning we only give the parser one "point" in the numerator if it gets every single token in the address correct. That should be a better measure than simply looking at whether each token was correct.

Improving the address parser

Though the current parser works quite well for most standard addresses, there is still room for improvement, particularly in making sure the training data we use is as close as possible to addresses in the wild. There are two primary ways the address parser can be improved even further (in order of difficulty):

  1. Contribute addresses to OSM. Anything with an addr:housenumber tag will be incorporated automatically into the parser next time it's trained.
  2. If the address parser isn't working well for a particular country, language or style of address, chances are that some name variations or places being missed/mislabeled during training data creation. Sometimes the fix is to update the formats at: https://github.com/OpenCageData/address-formatting, and in many other cases there are relatively simple tweaks we can make when creating the training data that will ensure the model is trained to handle your use case without you having to do any manual data entry. If you see a pattern of obviously bad address parses, the best thing to do is post an issue to Github.

Contributing

Bug reports, issues and pull requests are welcome. Please read the contributing guide before submitting your issue, bug report, or pull request.

Submit issues at: https://github.com/openvenues/libpostal/issues.

Shoutouts

Special thanks to @BenK10 for the initial Windows build and @AeroXuk for integrating it seamlessly into the project and setting up an Appveyor build.

License

The software is available as open source under the terms of the MIT License.

Comments
  • Windows support via AppVeyor

    Windows support via AppVeyor

    This pull request is based on the 'BenK10/libpostal_windows' patch and expanded apon.

    I have gone through all the files reviewing the changes and build scripts and making minor changes that will allow this library to be built on Windows via MSYS2 & MingW64. I have setup AppVeyor which from other Issue logs I saw was a requirement for the libpostal dev team to take on Windows support.

    AppVeyor has been configured to build and package the resulting binary into a .zip file along with the relevant linking files libpostal.dll, libpostal.lib, libpostal.exp & libpostal.def.

    The package built by AppVeyor can be uploaded as a Release or linked to via the following URL: https://ci.appveyor.com/api/projects/<account>/libpostal/artifacts/libpostal.zip

    The changes I have made build successfully on Travis CI & AppVeyor.

    The only program not working on this build is the example console app address_parser. This is due to not having a termios.h equivalent. So this step of the build is commented out for the Windows build (it is still compiled for the linux build).

    opened by AeroXuk 31
  • Suite/Apartment parsing is not correct

    Suite/Apartment parsing is not correct

    @daguar and I have been experimenting with @straup’s new libpostal API and finding some weird stuff with unit numbers in U.S. addresses. In most cases, libpostal misinterprets unit numbers as house numbers, and groups terms like "suite" with the road name.

    Here are some odd examples:

    • https://libpostal.mapzen.com/parse?address=123+main+street+apt+456+oakland+ca+94789&format=keys
    {
        "city": [
            "oakland"
        ],
        "house_number": [
            "123",
            "456"
        ],
        "postcode": [
            "94789"
        ],
        "road": [
            "main street apt"
        ],
        "state": [
            "ca"
        ]
    }
    
    • https://libpostal.mapzen.com/parse?address=123+main+street+suite+456+oakland+ca+94789&format=keys
    {
        "city": [
            "oakland"
        ],
        "house_number": [
            "123"
        ],
        "postcode": [
            "94789"
        ],
        "road": [
            "main street suite 456"
        ],
        "state": [
            "ca"
        ]
    }
    
    • https://libpostal.mapzen.com/parse?address=123+main+street+%23456+oakland+ca+94789&format=keys
    {
        "city": [
            "oakland"
        ],
        "house_number": [
            "123"
        ],
        "postcode": [
            "94789"
        ],
        "road": [
            "main street # 456"
        ],
        "state": [
            "ca"
        ]
    }
    
    parsing 
    opened by migurski 17
  • Parser setup fails with Docker on Windows

    Parser setup fails with Docker on Windows

    Since I'm running a Windows machine, I've been using libpostal inside a Docker container with a Debian image for the last few weeks. This worked fine until I decided to rebuild the image with the latest libpostal release on Friday April 7th, 2017. Now when I run the address parser command line tool, I get the following error: could not find parser model file of known type at address_parser_load (address_parser.c:208) errno: no such file or directory

    it does not say which file is missing.

    opened by BenK10 16
  • Installation error on RedHat 7.3

    Installation error on RedHat 7.3

    I'm working on Red Hat EL 7.3, trying to install libpostal. when I run the bootstrap.sh, I get:

    libtoolize: putting auxiliary files in '.'. libtoolize: copying file './ltmain.sh' libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'. libtoolize: copying file 'm4/libtool.m4' libtoolize: copying file 'm4/ltoptions.m4' libtoolize: copying file 'm4/ltsugar.m4' libtoolize: copying file 'm4/ltversion.m4' libtoolize: copying file 'm4/lt~obsolete.m4' libtoolize: Consider adding '-I m4' to ACLOCAL_AMFLAGS in Makefile.am. configure.ac:12: installing './missing' src/Makefile.am: installing './depcomp'

    then I run configure and then I get the following error when I run make:

    libpostal data file up to date make[1]: *** [all-local] Error 7 make[1]: Leaving directory `[my_dir]/libpostal/src' make: *** [install-recursive] Error 1

    my package versions: autoconf: 2.69  automake: 1.13 libtool: 2.4.2  pkgconfig: 0.27

    (when I do yum --showduplicates list, there is no pkgconfig 0.29 or automake 1.15)

    Any idea what's going wrong ?

    opened by sepidehsaran 15
  • libpostal installation failed

    libpostal installation failed

    Hi AI,

    I just tried to install latest libpostal and run address_parser, the installation went through but address_parser failed to start. Here are the details:

    Install libpostal on my Mac OS. I tried two Mac OS and both have same problem.

    git clone https://github.com/openvenues/libpostal cd libpostal ./bootstrap.sh ./configure --datadir=$PWD make (tried make -j4 on another Mac, same problem, guess -j4 doesn't matter much) sudo make install

    cd libpostal/src ./address_parser Loading models... ERR Error loading transliteration module, dir=(null) at libpostal_setup_datadir (libpostal.c:266) errno: No such file or directory

    I also tried one linux env and not able to start jpostal's address parser. I didn't try libpostal/src/address_parser on that linux machine because my account doesn't have necessary privilege, but I guess the root cause is same. sudo ldconfig

    Could you help take a look?

    Thanks a lot!

    Tracy

    opened by liufang94 14
  • memory leak: expand_address hangs indefinitely on CYRILLIC SMALL LETTER YERU

    memory leak: expand_address hangs indefinitely on CYRILLIC SMALL LETTER YERU

    Hi Al!

    Using the node-postal npm module, the following command hangs indefinitely:

    $ node
    > var libpostal = require('node-postal');
    undefined
    
    > libpostal.expand.expand_address('улица 40 лет Победы');
    

    I'm not sure if it's a bug in libpostal core or in the node wrapper.

    Based off master branch, current at time of writing.

    opened by missinglink 13
  • What are the possible labels?

    What are the possible labels?

    As someone trying to fit the output into a tabular datastructure it'd be good to know what the range of labels for parsed addresses are (I've tried to dig through the code, but...there's kind of a lot of it!)

    opened by Ironholds 13
  • Error loading geodb module

    Error loading geodb module

    I just installed on a Debian system, following the step-by-step instructions in the readme. Then I tried running the parser from the command line, and got this error:

    # ./address_parser
    Loading models...
    ERR   Error loading geodb module
       at libpostal_setup_parser (libpostal.c:1071) errno: None
    

    I made sure the data directory is world-readable. What am I missing?

    opened by eater 12
  • Doesn't appear to handle PO Box numbers.

    Doesn't appear to handle PO Box numbers.

    postal.parser.parse_address('PO Box 1, Seattle, WA 98103');
    [ { value: 'po', component: 'house_number' },
      { value: 'box', component: 'road' },
      { value: '1', component: 'house_number' },
      { value: 'seattle', component: 'city' },
      { value: 'wa', component: 'state' },
      { value: '98103', component: 'postcode' } ]
    
    parsing 
    opened by sporkmonger 11
  • Configurable data library directory to enable hadoop deployment

    Configurable data library directory to enable hadoop deployment

    Hi there,

    We are trying to use LibPostal (+jpostal) as part of a wider data processing pipeline based in Spark running on YARN. We have got this working on our development environment by performing a lib postal install on each of the hadoop nodes and are able to parse addresses in parallel across the hadoop cluster.

    Unfortunately we have hit a bit of an issue when trying to deploy this within our customer environments because they do not allow for the install to occur on the individual nodes on the cluster.

    From an installation perspective there are four components as we understand it:

    1. LibPostal Data Library (all the reference data load to the configured datadir).
    2. The libpostal C libraries which need to be on the LD_LIBRARY_PATH
    3. JPostal’s JNI C libraries which need to be on the java.library.path
    4. JPostal’s java library (jpostal.jar)

    The main issue we are having is that the location of the data library / directory is hard coded into libpostal C library at the time of compile, which makes the libraries non portable.

    Question: Would it be possible to make the data library/directory configurable either as an argument in the call to the library or as an environment variable? Obviously for backward compatibility, libpostal could use the "defaults" (set at compile time) but if the relevant argument / environment variable was set, could use this.

    Thanks, Jamie

    opened by jamiehutton 10
  • Error loading address parser module errno:Not enough space windows

    Error loading address parser module errno:Not enough space windows

    I compiled libpostal library and followed the instruction for windows and everything looks good and everything looks download correctly the data file that contains address_expansions, address_parser, ..etc is downloaded completely with size of 1.84 GB then i use the sample code for libpostal, after i link the library in QT IDE using

    LIBS += D:\cook\libpostal\libpostal\src\.libs\libpostal.a
    LIBS += D:\cook\libpostal\libpostal\src\.libs\libpostal.dll.a
    LIBS += D:\cook\libpostal\libpostal\src\.libs\libscanner.a
    INCLUDEPATH += D:\cook\libpostal\libpostal\src\
    

    then I use this code to set the datadir and parser datadir then simple example of parse address

    libpostal_setup_datadir("D:\\cook\\libpostal\\Data\\libpostal");
    libpostal_setup_parser_datadir("D:\\cook\\libpostal\\Data\\libpostal");
    
    
    // Setup (only called once at the beginning of your program)
    if (!libpostal_setup() || !libpostal_setup_parser()) {
        exit(EXIT_FAILURE);
    }
    
    libpostal_address_parser_options_t options = libpostal_get_address_parser_default_options();
    libpostal_address_parser_response_t *parsed = libpostal_parse_address("781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA", options);
    
    for (size_t i = 0; i < parsed->num_components; i++) {
        printf("%s: %s\n", parsed->labels[i], parsed->components[i]);
    }
    
    // Free parse result
    libpostal_address_parser_response_destroy(parsed);
    
    // Teardown (only called once at the end of your program)
    libpostal_teardown();
    libpostal_teardown_parser();
    
    return 0;
    
    

    but I got this error

    ERR   Error loading address parser module, dir=D:\cook\libpostal\Data\libpostal\address_parser
     at libpostal_setup_parser_datadir (libpostal.c:434) errno:Not enough space
    ERR   parser is not setup, call libpostal_setup_address_parser()
     at address_parser_parse (address_parser.c:1666) errno:Not enough space
    ERR   Parser returned NULL
     at libpostal_parse_address (libpostal.c:267) errno:Not enough space
    
    

    so what i did wrong, I have for sure space on my disk more than 100GB and i still got this error.

    opened by amrkamal2025 9
  • Can't run the example in README on Ubuntu 22.04.1 LTS

    Can't run the example in README on Ubuntu 22.04.1 LTS

    Hi! 👋

    What problem we had

    We actually wanted to use Libpostal through NIFs, in Erlang/Elixir. There's a library created for that SweetIQ/expostal and that library basically compiles some C functions using libpostal so we can use them in Elixir.

    However on alpine, we had some problems using that library even though it was working fine locally on macOS and the error we got from the library is:

    13:18:25.439 [error] Process #PID<0.202.0> raised an exception
    ** (MatchError) no match of right hand side value: {:error, {:load_failed, 'Failed to load NIF library: \'Error relocating /foobar/_build/dev/lib/expostal/priv/expostal.so: libpostal_setup_parser: symbol not found\''}}
        (expostal 0.2.0) lib/expostal.ex:13: Expostal.init/0
        (kernel 8.5) code_server.erl:1317: anonymous fn/1 in :code_server.handle_on_load/5
    

    What I've done so far

    After trying to find out what could be wrong, just to see if I could run the example, I've compiled & installed libpostal from source on an Ubuntu 22.04.1 LTS container and created a file which had the example code in the README. After compiling the file when I wanted to run it, I got

    [email protected]:~# gcc -I/usr/local/include/ -shared foo.c
    [email protected]:~# ./a.out
    Segmentation fault
    

    I might've compiled it wrong as it's been quite a while since I touched anything C, but I wanted to ask for help here.

    [email protected]:~# gcc -v
    Using built-in specs.
    COLLECT_GCC=gcc
    COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper
    OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
    OFFLOAD_TARGET_DEFAULT=1
    Target: x86_64-linux-gnu
    Configured with: ../src/configure -v --with-pkgversion='Ubuntu 11.2.0-19ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-11 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-11-gBFGDP/gcc-11-11.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-gBFGDP/gcc-11-11.2.0/debian/tmp-gcn/usr --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2
    Thread model: posix
    Supported LTO compression algorithms: zlib zstd
    gcc version 11.2.0 (Ubuntu 11.2.0-19ubuntu1)
    
    opened by andreyuhai 0
  • Difficulty Installing libpostal

    Difficulty Installing libpostal

    Hello!

    I am having a lot of trouble following the instructions and installing the R libpostal library.

    I was wondering - perhaps in the future, this package could be made into a "traditional R package" and installed directly (like most other R packages)?

    Thanks!

    opened by swaheera 0
  • parse_address not working well for Japanese addresses

    parse_address not working well for Japanese addresses

    Hi!

    I was checking out libpostal, and have a question: I read that this library supports Japanese addresses parsing, however, when I tried, it doesn't seem working well. So I would like to get some feedback from the awesome contributors (tried for other countries, and it works really great!)


    My country is

    US, but I'm using it for parsing Japanese addresses


    Here's how I'm using libpostal

    To help extract address from small business owner's website


    Here's what I did

    text = '〒100-8994 東京都千代田区丸ノ内2-7-2' parse_address(text)


    Here's what I got

    [('〒100-8994', 'postcode'), ('東', 'city'), ('京都千代田', 'city_district'), ('区', 'city'), ('丸ノ内', 'road'), ('2-7-2', 'house_number')]


    Here's what I was expecting

    postcode is correct, but "東京都" (means Tokyo Capital) is supposed to be city, "千代田区" is supposed to be city district

    Here are a few other examples

    Example 1 input: text = '〒550-0002 大阪府大阪市西区江戸堀1丁目18番21号' parse_address(text)

    output: [('〒550-0002', 'postcode'), ('大', 'state'), ('阪', 'city'), ('府大阪市西', 'city_district'), ('区', 'city'), ('江戸堀', 'house'), ('1丁目', 'suburb'), ('18番', 'house_number'), ('21号', 'city_district')]

    expected/correct parsing: 〒550-0002 大阪府 / 大阪市 / 西区 / 江戸堀 / 1丁目18番21号

    Example 2 input: text = '〒064-0809 北海道札幌市中央区南9条西3丁目2−5' parse_address(text)

    output: [('〒064-0809', 'postcode'), ('北', 'state'), ('海', 'city'), ('道札幌市中央区南9条西', 'road'), ('3丁目', 'suburb'), ('2-5', 'house_number')]

    expected/correct parsing: 〒064-0809 北海道 / 札幌市 / 中央区 / 南9条西 / 3丁目2−5

    Example 3 input: text = '〒604-8064 京都府京都市中京区骨屋之町560 離れ' parse_address(text)

    output: [('〒604-8064', 'postcode'), ('京', 'state'), ('都', 'city'), ('府京都市中京区', 'city_district'), ('骨屋之町', 'road'), ('560', 'house_number'), ('離れ', 'road')]

    expected/correct parsing: 〒604-8064 京都府 / 京都市 / 中京区 / 骨屋之町 / 560 離れ

    Example 4 input: text = '〒460-0031 愛知県名古屋市中区本丸1−1' parse_address(text)

    output: [('〒460-0031', 'postcode'), ('愛', 'state'), ('知県名古屋市中', 'city'), ('区', 'city_district'), ('本丸', 'suburb'), ('1-1', 'house_number')]

    expected/correct parsing: 〒460-0031 愛知県 / 名古屋市 / 中区 / 本丸 / 1−1


    For parsing issues, please answer "yes" or "no" to all that apply.

    • Does the input address exist in OpenStreetMap?
    • Do all the toponyms exist in OSM (city, state, region names, etc.)?
    • If the address uses a rare/uncommon format, does changing the order of the fields yield the correct result?
    • If the address does not contain city, region, etc., does adding those fields to the input improve the result?
    • If the address contains apartment/floor/sub-building information or uncommon formatting, does removing that help? Is there any minimum form of the address that gets the right parse?

    Here's what I think could be improved

    opened by XiushuangLi 0
  • Enabling SSE creates memory write access violations

    Enabling SSE creates memory write access violations

    There are two main issues:

    The remez9_0_log2_sse function assumes that the buffer it is handed has a size of multiples of 4 doubles. This isn't assured anywhere and make check will cause an access violation in the crf_context test because it allocates a buffer of 9.

    The posix_memalign call "almost" handles this by aligning to 16 (2 doubles) as the allocated buffer will always be multiples of the alignment. Changing the alignment from 16 to 32 resolves this problem.

    There is no such thing as a realloc for aligned memory, but vector.h tries to implement one. It is undefined whether realloc on a posix_memalign allocation even works... though from searching google it sounds like it does. But the problem is realloc doesn't take into account that the size needs to be a multiple of the alignment. So when the unit tests asks for 72, it gets 72 and the call to remez9_0_log2_sse gives an access violation.

    The safe thing here would be to not use realloc. But then you have the issue that the _aligned_realloc function doesn't know the existing size of the buffer in order to do the copy. So you have to align the size to realloc yourself and hope the C library doesn't corrupt the heap.

    There is something else going on too that I haven't figured out but my recommendation at the moment is to simply disable SSE by default.

    opened by brianmacy 0
  • make: *** No rule to make target 'install'.  Stop.

    make: *** No rule to make target 'install'. Stop.

    Hi hi team.when trying to install libpostal,it failed and told me "make: *** No rule to make target 'install'. Stop." I am installing in dockefile

    RUN git clone https://github.com/openvenues/libpostal &&
    cd libpostal &&
    make distclean &&
    /bin/bash bootstrap.sh &&
    ./configure --disable-data-download --datadir=/data/ &&
    make -j4 &
    make install &&
    libpostal_data download all /data/ &&
    ldconfig

    opened by zffocussss 3
  • Option to disable street name expansion for near dupes

    Option to disable street name expansion for near dupes

    We found that for near dupes the street name root expansion generated a lot of extra hashes but didn't give us a lot of benefits in terms of blocking, so we ended up disabling it.

    opened by koertkuipers 0
Releases(v1.1)
  • v1.1(May 9, 2018)

    This release adds three important groups of functions to libpostal's C API to support the lieu address/venue deduping project. The APIs are somewhat low-level at this point, but should still be useful in a wide range of geo applications, particularly for batch geocoding large data sets. This is the realization of some of the earlier work on address expansion.

    Near-duplicate detection

    Near-dupe hashing builds on the expand_address functionality to allow hashing a parsed address into strings suitable for direct comparison and automatic clustering. The hash keys are used to group similar records together prior to pairwise deduping so that we don't need to compare every record to every other record (i.e. N² comparisons). Instead, if we have a function that can generate the same hash key for records that are possible dupes (like "100 Main" and "100 E Main St"), while also being highly selective, we can ensure that most duplicates will be captured for further comparison downstream, and that dissimilar records can be safely considered non-dupes. In a MapReduce context, near-dupe hashes can be used as keys to ensure that possible dupes will be grouped together on the same shard for pairwise checks, and in a search/database context, they can be used as an index for quick lookups of candidate dupes before running more thorough comparisons with the few records that match the hash. This is the first step in the deduping process to identify candidate dupes, and can be thought of as the blocking function in record linkage (this is a highly selective version of a blocking function) or as a form of locally sensitive hashing in the near-duplicate detection literature. Libpostal's near-dupe hashes use a combination of several new features of the library:

    1. Address root expansions: removes tokens that are ignorable such as "Ave", "Pl", "Road", etc. in street names so that something like "West 125th St" can potentially match "W 125". This also allows for exact comparison of apartment numbers where "Apt 2" and "#2" mean the same thing. Every address component uses certain dictionaries in libpostal to determine what is ignorable or not, and although the method is rule-based and deterministic, it can also identify the correct root tokens in many complex cases like "Avenue Rd", "Avenue E", "E St SE", "E Ctr St", etc. While many of the test cases used so far are for English, libpostal's dictionary structure also allows it to work relatively well around the world, e.g. matching Spanish street names where "Calle" might be included in a government data set but is rarely used colloquially or in self-reported addresses. For street names, we additionally strip whitespace from the root expansion so "Sea Grape Ln" and "Seagrape Ln" will both normalize to "seagrape".

    2. Phonetic matching for names: the near-dupe hashes for venue/place/company names written in Latin script include a modified version of the double metaphone algorithm which can be useful for comparing misspelled human names, as well as comparing machine transliterations against human ones in languages where names might written in multiple scripts in different data sets e.g. Arabic or Japanese.

    3. Geo qualifiers: for address data sets with lat/lons, geohash tiles (with a precision of 6 characters by default) and their 8 neighbors (to avoid faultlines) are used to narrow down the comparisons to addresses/places in a similar location. If there's no lat/lon, and the data are known to be from a single country, the postal code or the city name can optionally be used as the geo qualifier. Future improvements include disambiguating toponyms and mapping them to IDs in a hierarchy, such that multiple names for cities, etc. can resolve to one or more IDs, and e.g. an NYC address that uses a neighborhood name in place of the city e.g. "Harlem, NY" could match "New York, NY" by traversing the hierarchy and outputting the city's ID instead.

    4. Acronym generation: when generating hash keys for names, we also try to guess acronyms when there are two or more tokens. While the acronym alignments downstream can be more exact, there are cases like "Brooklyn Academy of Music" vs "BAM" which would not match a single token using the previous methods, but are the same place nonetheless. Acronym generation attempts to solve this problem for venue names, human initials, etc. It is stopword-aware in various languages, but generates multiple potential forms of the acronym for cases like "Museum of Modern Art" vs. "MoMA" where the stopword is included in the acronym. The method also includes rudimentary support for sub-acronyms i.e. generating new tokens at every other non-contiguous stopword (so "University of Texas at Austin" will produce "UT"), as well as at punctuation marks like commas or colons (e.g. "University of Texas, Austin"). Acronym generation also leaves intact any known dictionary phrases that resolve to multi-word phrases so e.g. "Foo High School" and "Foo HS" can share a hash key, and also keeps any acronyms identified as part of the tokenizer from the internal period structure like "A.B.C." All of this also works across scripts for any non-ideographic languages. Though this method will not cover all valid acronyms directly, we use a double metaphone on the output in Latin scripts so with certain types of more complex acronyms like "sonar" vs. "Sound Navigation and Ranging", where more than the initial character of each word is used, the generated acronym may still be phonetically the same as the real acronym (in this case, both resolve to "SNR"), especially if the other letters used are vowels.

    5. Name quad-grams: in order to handle a variety of spelling differences, as well as languages with longer concatenated words as in German, all of the name hashes mentioned above are converted in to unique sequences of 4 characters. Simple numeric tokens consisting of digits and hyphens, etc. are included as-is, without the quadgrams or double metaphone encoding. For the words that do use quadgrams, there is no token boundary disambiguation i.e. the double metaphone for "Nationalgalerie" would be "NXNLKLR" which would then generate strings ["NXNL", "XNLK", "NLKL", "LKLR"] (as opposed to the behavior of our languages classifier features which would generate ["NXNL_", "_XNLK_", "_NLKL_", "_LKLR"], using the underscore to indicate whether the 4-gram is at the beginning, middle, or end of the word). Since the fully concatenated name without whitespace is also double metaphone encoded and hashed to 4-grams in this fashion, it should also account for variance in the phonetic output resulting from spacing/concatenation differences (vowels at the beginning of a word are usually converted to "A" in double metaphone, whereas they're ignored in the middle of a word). Overall, though this method does slightly reduce the precision of near-dupe hashing (causes us to identify more potential dupes and thus make more pairwise comparisons in the next stage), it also increases recall of the process (we don't miss as many true dupes).

    Component-wise deduping

    Once we have potential candidate dupe pairs, we provide per-component methods for comparing address/name pairs and determining if they're duplicates. Each relevant address component has it own function, with certain logic for each, including which libpostal dictionaries to use, and whether a root expansion match counts as an exact duplicate or not. For instance, in a secondary unit, "# 2", "Apt 2", and "Apt # 2" can be considered an exact match in English whereas we wouldn't want to make that kind of assumption for street names e.g. "Park Ave" and "Park Pl". In the latter case, we can still classify the street names as needing to be reviewed by a human.

    The duplicate functions return one of the following values:

    • LIBPOSTAL_NULL_DUPLICATE_STATUS
    • LIBPOSTAL_NON_DUPLICATE
    • LIBPOSTAL_POSSIBLE_DUPLICATE_NEEDS_REVIEW
    • LIBPOSTAL_LIKELY_DUPLICATE
    • LIBPOSTAL_EXACT_DUPLICATE

    The likely and exact classifications can be considered duplicates and merged automatically, whereas the needs_review response is for flagging possible duplicates.

    Having special functions for each component can also be useful down the line e.g. for deduping with compound house numbers/ranges (though this is not implemented yet).

    Since identifying the correct language is crucial to effective matching, and individual components like house_number and unit may not provide any useful information about the language, we also provide a function that returns the language(s) for an entire parsed/labeled address using all of its textual components. The returned language codes can be reused for subsequent calls.

    Fuzzy deduping for names

    For venue/street names, we also want to be able to handle inexact name matches, minor spelling differences, words out of order (see this often with human names, which can sometimes be listed as Last, First Middle), and removing tokens that may not be ignorable in terms of libpostal's dictionaries but are very common, or very common in a particular geography.

    In this release, we build on the idea of Soft-TFIDF, which blends a local similarity function (usually Jaro-Winkler in the literature, though we use a hybrid method), with global corpus statistics (TFIDF weights or other similar term weights, supplied by the user in our case, see the lieu project for details on constructing indices based on TFIDF or information gain (performs better in the venue deduping setting), and their per-geo variants.

    Here's how it works:

    1. for strings s1 and s2, each token in s1 is aligned with its most similar token in s2 in terms of a user-specified local similarity metric, provided that it meets a specified similarity threshold. This allows for small spelling mistakes in the individual words and also makes the method invariant to word-order.

    2. given a vector of scores for each string, the final similarity is, for each token t1 in s1 and its closest match t2 in s2 (if local_sim >= theta): local_sim * scores1[t1] * scores2[t2]. In our case, the final fuzzy dot product is then normalized by the product of the L2 norms of each score vector, so it can be thought of as a form of soft cosine similarity. The two main weighting schemes are information gain and TFIDF. Using TFIDF means that rare words are given more weight in the similarity metric than very common words like "Salon" or "Barbershop". Using information gain means terms are weighted by relevance i.e. how much the given word tells us about other words, and can be thought of as running a mini feature selection algorithm or organizing the branches of a small decision tree for each string. Overall information gain tends to widen the gap between high-information, core name words and simple low-information words which can be safely ignored if the rest of the tokens match. Finally, it's possible to give all words equal weight using a uniform distribution (give each token a weight of 1 / # of tokens), and this is the approach we take for fuzzy street name comparisons.

    3. Assuming the chosen token scores add up to 1, the total similarity score for the string that's between 0 and 1, and there are user-specified thresholds for when to consider the records as various classes of dupes. The default threshold is 0.9 for likely dupes and 0.7 for needs_review, but may be changed depending on tolerance for false positives.

    Note: for the lieu project we use a linear combination of information gain (or TFIDF) and a geo-specific variant score where the index is computed for a specific, roughly city-sized geohash tile, and smaller tiles are lumped in with their larger neighbors. The geo-specific scores mean that something like "San Francisco Department of Building Inspection" and "Department of Building Inspection" can match because the words "San Francisco" are very common in the region. This approach was inspired by some of the research in https://research.fb.com/publications/deduplicating-a-places-database/.

    Unique to this implementation, we use a number of different local similarity metrics to qualify a given string for inclusion in the final similarity score:

    1. Jaro-Winkler similarity: this is a string similarity metric developed for comparing names in the U.S. Census. It detects small spelling differences in words based on the number of matches and transpositions relative to the lengths of the two strings. The Winkler variant gives a more favorable score to words with a shared common prefix. This is the local similarity metric used in most of the Soft-TFIDF literature, and we use the commonly-cited value of 0.9 for the inclusion threshold, which works reasonably well in practice. Note: all of our string similarity methods use unicode characters rather than bytes in their calculations.

    2. Damerau-Levenshtein distance: the traditional edit distance metric, where transpositions of two characters count as a single edit. If a string does not meet the Jaro-Winkler threshold, but has a maximum edit distance of 1 (could be that the first character was transposed), and a minimum length of 4 (many short strings are within edit distance 1 of each other, so don't want to generate too many false positives). Note: since distances and similarities are not on the same scale, we use the Damerau-Levenshtein only as a qualifying threshold, and use the Jaro-Winkler similarity value (even though it did not meet the threshold) for the qualifying pair in the final similarity calculation.

    3. Sequence alignment with affine gap penalty and edit operation subtotals: a new, efficient method for sequence alignment and abbreviation detection. This builds on the Smith-Waterman-Gotoh algorithm with affine gap penalties, which was originally used for alignment of DNA sequences, but works well for other types of text. When we see a rare abbreviation that's not in the libpostal dictionaries, say "Service" and "Svc", the alignment would be "S--v-c-". In other words, we match "S", open a gap, extend that gap for two characters, then match "v", open another gap, extend it one character, match "c", open a gap, and extend it one more character at the end. In the original Smith-Waterman, O(mn) time and space was required to compute this alignment (where m is the length of the first string and n is the length of the second). Gotoh's improvement still needs O(mn) time and O(m) space (where m is the length of the longer string), but it does not store the sequence of operations, only a single cost where each type of edit pays a particular penalty, where the affine gap penalty is the idea that we should pay more for opening a gap than extending it. The problem with the single cost is it's not always clear what to make of that single combined score. The new method we use in libpostal stores and returns a breakdown of the counts and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return or compute the full alignment as in Needleman-Wunsch or Hirschberg's variant. Using this method we know that for "Service" and "Svc", the number of matches is equal to the length of the shorter string, regardless of how many gaps were opened, and the two share a common prefix of "S", so "Svc" can be considered a possible abbreviation for "Service". When we find one of these possible abbreviations, and none of the other thresholds are met (which can easily happen with abbreviations), it qualifies both tokens for inclusion in the final similarity, again using their Jaro-Winkler similarity as the weight in the final calculation. For strict abbreviations (match the criteria for possible abbreviations and also share a common suffix e.g. "Festival" and "Fstvl") that are greater than a certain length, an optional static score is used (if it's higher than the Jaro-Winkler score), so that cases that really look like abbreviations can be weighted as more similar than the Jaro-Winkler score might indicate. We also use the square of the larger of the two term scores in place of their product, since we're effectively considering these words an exact match, and we calculate an appropriate offset for the lower-scoring vector's L2 norm to compensate.

    4. Acronym alignments: especially prevalent in universities, non-profits, museums, government agencies, etc. We provide a language-based stopword-aware acronym alignment method which can match "Museum of Modern Art" to "moma" (no capitalization needed), "University of California Berkeley" to "UC Berkeley", etc. If tokens in the shorter string are an acronym for tokens in the longer string, all of the above are included in the similarity score with a 1.0 local similarity (so those tokens' scores will be counted as evidence for a match, not against it). For cases like "AB" vs. "A B" where the scores may be significantly different between the combined version and the separate single letters version, we take the greater of a) the acronym score in the shorter string or b) the L2 norm of the individual words for the longer string, and again use an offset for the norm of the lower-scoring vector.

    5. Known-phrase alignments: similar to acronym alignment, libpostal's dictionaries are used at deduping time as well, and it two strings contain known phrases with the same canonical, those phrase spans are given the maximum similarity (product of L2 norms of both).

    6. Multi-word phrase alignments for spacing variations: again similar to acronym alignment, but for handling spacing variations like "de la Cruz" vs. "dela Cruz" or "Sea Grape Dr" vs. "Seagrape Dr" as exact matches. In a token-based approach, the first word in a phrase might match on Jaro-Winkler similarity alone, but the subsequent words in the string with spaces would count against the similarity value. Having multiple words align to their concatenated form pays no penalty for the whitespace.

    7. Match demotion for differing sets of initials: depending on the term weighting scheme used and the type of names in the corpus, the weights for single letters (found in human initials) may have a low enough weight that they don't affect the similarity much. This is especially true for information gain, where a single letter like "A" or "B" may cooccur with many different names and end up conveying no more information than a common stopword like "the" or "of". When this is true, "A & B Jewelry" vs. "B & C Jewelry" might generate a false positive because the scores for "A" and "C" are so low compared to "Jewelry" that they are ignored. In cases where there's a likely dupe and, given sets of single letter tokens S1 and S2 in the two strings respectively, we demote the match to needs_review if there is a symmetric difference and both sides (S1 - S2 and S2 - S1) are non-empty. Note: the typical case of human names with/without a middle initial will still match under this heuristic i.e. "Yvette Clarke" matches "Yvette D Clarke" because the initial only exists in one string and there is no conflict. However, something like "J Dilla" vs. "K Dilla" would be classified as needing human review.


    The above assumes non-ideographic strings. In Chinese, Japanese, Korean, etc. we currently use the Jaccard similarity of the set of individual ideograms instead. In future versions it might be useful to weight the Jaccard similarity by TFIDF/information gain scores as well, and if we ever add a statistical word segmentation model for CJK languages, the word boundaries from that model could be used instead of ideograms.

    The fuzzy duplicate methods are currently implemented for venue names and street names, which seemed to make the most sense. The output for these methods is a struct containing the dupe classification as well as the similarity value itself.

    For fuzzy deduplication of street names, we implement a special case of "soft set containment". Here, the Soft-TFIDF/Soft-Information-Gain method reports the number of tokens that were matched (where matched means the soft notion of equality under local similarity functions, minus acronym alignments and the sequence alignment method so i.e. only Jaro Winkler and Levenshtein). When one set of tokens is contained within the other i.e. the number of matched tokens is equal to the number of tokens in the shorter of the two strings, we classify the pair as a likely duplicate regardless of the similarity value. This allows e.g. "Park" and "Park Ave" to match without allowing "Park Ave" and "Park St" to match (though they would be classified as needing review under the exact dupe method since they share a root expansion).

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Apr 7, 2017)

    The great parser-data merge is complete. Libpostal 1.0 features a better-than-ever international address parser which achieves 99.45% full-parse accuracy on held-out addresses. The release title is a reference to the TV show (libpostal was also created in Brooklyn and this was the first version of the model to surpass 99% accuracy). Check out the blog post for the details. Here's a sample of what it can do in a GIF:

    parser

    Breaking API Changes

    • Every function, struct, constant, etc. defined in the public header (libpostal.h) now uses a "libpostal_" prefix . This affects all bindings that call the C API. The bindings that are part of this Github org all have 1.0 branches.

    New tags

    Sub-building tags

    • unit: an apartment, unit, office, lot, or other secondary unit designator
    • level: expressions indicating a floor number e.g. "3rd Floor", "Ground Floor", etc.
    • staircase: numbered/lettered staircase
    • entrance: numbered/lettered entrance
    • po_box: post office box: typically found in non-physical (mail-only) addresses

    Category tags

    • category: for category queries like "restaurants", etc.
    • near: phrases like "in", "near", etc. used after a category phrase to help with parsing queries like "restaurants in Brooklyn"

    New admin tags

    • island: named islands e.g. "Maui"
    • country_region: informal subdivision of a country without any political status
    • world_region: currently only used for appending “West Indies” after the country name, a pattern frequently used in the English-speaking Caribbean e.g. “Jamaica, West Indies”

    No more accent-stripping/transliteration of input

    There's a new transliterator which only makes simple modifications to the input (HTML entity normalization, NFC unicode normalization). Latin-ASCII transliteration is no longer used at runtime. Instead, addresses are transliterated to multiple forms during training so the parser has to deal with all the variants rather than normalizing to a single variant (which previously was not even correct in cases like Finnish, Turkish, etc.) in both places.

    Trained on > 1 billion examples in every inhabited country on Earth

    The training data for libpostal's parser has been greatly expanded to include every country and dependency in OpenStreetMap. We also train on a places-only data set where every city name from OSM gets some representation even if there are no addresses (higher-population cities get examples proportional to their population). A similar training set is constructed for streets, so even places which have very few addresses but do have a road network in OSM can be included.

    1.0 also moves beyond OSM, training on most of the data sets in OpenAddresses, and postal codes + associated admins from Yahoo's GeoPlanet, which includes virtually every postcode in the UK, Canada, etc.

    Almost 100GB of public training data

    All files can be found under s3://libpostal/training_data/YYYY-MM-DD/parser/ as gzip'd tab-separated values (TSV) files formatted like:language\tcountry\taddress.

    • formatted_addresses_tagged.random.tsv.gz (ODBL): OSM addresses. Apartments, PO boxes, categories, etc. are added primarily to these examples
    • formatted_places_tagged.random.tsv.gz (ODBL): every toponym in OSM (even cities represented as points, etc.), reverse-geocoded to its parent admins, possibly including postal codes if they're listed on the point/polygon. Every place gets a base level of representation and places with higher populations get proportionally more.
    • formatted_ways_tagged.random.tsv.gz (ODBL): every street in OSM (ways with highway=*, with a few conditions), reverse-geocoded to its admins
    • geoplanet_formatted_addresses_tagged.random.tsv.gz (CC-BY): every postal code in Yahoo GeoPlanet (includes almost every postcode in the UK, Canada, etc.) and their parent admins. The GeoPlanet admins have been cleaned up and mapped to libpostal's tagset
    • openaddresses_formatted_addresses_tagged.random.tsv.gz (various licenses, mostly CC-BY): most of the address data sets from OpenAddresses, which in turn come directly from government sources
    • uk_openaddresses_formatted_addresses_tagged.random.tsv.gz (CC-BY): addresses from OpenAddresses UK

    If the parser doesn't perform as well as you'd hoped on a particular type of address, the best recourse is to use grep/awk to look through the training data and try to determine if there's some pattern/style of address that's not being captured.

    Better feature extraction

    • n-grams for the "unknown" words (occurred fewer than n times in the training set)
    • for unknown words that are hyphenated, each of the individual subwords if frequent enough, and their ngrams otherwise
    • an index of postcodes and their admin contexts built from the training data (the intuition is that something like "10001" could be a postcode or a house number, but if words like "New York", "NY", "United States", etc. are to its right or left, it's more likely to be a postcode).
    • for first words that are unknown (could be part of a venue/business name, could be a rare/misspelled street), a feature which finds the relative position of the next number and the next address phrase if present. Usually if the parser gets the first word in the string correct it will get the entire string correct.

    More powerful machine learning model (CRF)

    libpostal 1.0 uses a Conditional Random Field (CRF) instead of the greedy averaged perceptron. This more powerful machine learning method scores sequences rather than individual decisions, and can revise its previous decision if that would help a subsequent token score higher (Viterbi inference).

    Improves upon the CRFsuite implementation in terms of:

    1. performance: Viterbi inference sped up by 2x
    2. scalability: training set doesn't need to fit in memory
    3. model expressiveness: libpostal's CRF adds state-transition features which can make use of both the state of the current token and the previous tag. These act just like normal features except their weights are LxL matrices (tags we could have transitioned from by tags we could transition to) instead of L vectors.

    FTRL-Proximal optimization for the language classifier

    The language classifier now uses a multinomial version of Google's FTRL-Proximal method, which uses a combination of L1 and L2 regularization, inducing sparsity while maintaining high accuracy. This results in a model that is more accurate than the previous classifier while being 1/10th the size. The runtime classifier is now able to load either sparse or dense weights depending on the file header.

    Source code(tar.gz)
    Source code(zip)
    language_classifier.tar.gz(48.00 MB)
    libpostal_data.tar.gz(9.71 MB)
    parser.tar.gz(717.62 MB)
  • v0.3.4(Feb 9, 2017)

  • v0.3.3(Jan 9, 2017)

    This release adds functions for configuring the datadir at runtime.

    • libpostal_setup_datadir(char *dir)
    • libpostal_setup_language_classifier_datadir(char *dir)
    • libpostal_setup_parser_datadir(char *dir)
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Dec 20, 2016)

    Merged a few of the commits from the parser-data branch of libpostal into master to fix address parser training from the master branch.

    Coincides with the release of some of the parser training data generated in the parser-data branch:

    1. OSM training addresses (27GB, ODBL) This is (a much-improved version of) the original data set used to train libpostal.
    2. OSM formatted place names/admins (4GB, ODBL) Helpful for making sure all the place names (cities, suburbs, etc.) in a country are part of the training set for libpostal, even if there are no addresses for that place.
    3. GeoPlanet postal codes with admins (11GB, CC-BY) Contains many postal codes from around the world, including the 1M+ postcodes in the UK, and their associated admins. If training on master, this may or may not help because it still relies pretty heavily on GeoNames for postcodes.
    4. OpenAddresses training addresses (30GB, various licenses) By far the largest data set. It's not every source from OpenAddresses, just the ones that are suitable for ingestion into libpostal. It's heavy on North America but also contains many of the EU countries. Most of the sources only require attribution, some have share-alike clauses. See openaddresses.io for more details.

    Users are encouraged to QA the data for problematic patterns, etc. Note: while it's possible now to train per-country/language parsers using slices of the data, there will be no support offered for custom parsers.

    Release is named after the largest train station the world: https://en.wikipedia.org/wiki/Nagoya_Station.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Aug 29, 2016)

    • Data download script separates S3 files into 64MB chunks (as used in awscli) and uses a process pool
    • Various dictionary updates submitted by users
    Source code(tar.gz)
    Source code(zip)
Owner
openvenues
An open source project sponsored by Mapzen.
openvenues
A wrapper around std::variant with some helper functions

A wrapper around std::variant with some helper functions

Eyal Amir 1 Oct 30, 2021
CommonMark parsing and rendering library and program in C

cmark cmark is the C reference implementation of CommonMark, a rationalized version of Markdown syntax with a spec. (For the JavaScript reference impl

CommonMark 1.4k Sep 27, 2022
Command-line arguments parsing library.

argparse argparse - A command line arguments parsing library in C (compatible with C++). Description This module is inspired by parse-options.c (git)

Yecheng Fu 512 Sep 26, 2022
This data is a sample data created for a Qiita article and a YouTube commentary.

NiagaraSample UE4 4.27.1 English This data is a sample data created for a Qiita article and a YouTube commentary. Here is some sample data that may be

O.Y.G 10 Jun 15, 2022
Open Data Description Language

Open Data Description Language This is the reference parser for the Open Data Description Language (OpenDDL), version 3.0. The official language speci

Eric Lengyel 27 Sep 10, 2022
cavi is an open-source library that aims to provide performant utilities for closed hierarchies (i.e. all class types of the hierarchy are known at compile time).

cavi cavi is an open-source library that aims to provide performant utilities for closed hierarchies (i.e. all class types of the hierarchy are known

Baber Nawaz 5 Mar 9, 2022
C++ header-only library for generic data validation.

C++ header-only library for generic data validation.

Evgeny Sidorov 32 Sep 18, 2022
C ANSI Library to work with BER-TLV format data.

BER-TLV Challenge Library As requested a shared library(.so) were developed using C programming language to interpret and works with BER-TLV objects.

null 1 Oct 14, 2021
Orbit, the Open Runtime Binary Instrumentation Tool, is a standalone C/C++ profiler for Windows and Linux

Orbit, the Open Runtime Binary Instrumentation Tool, is a standalone C/C++ profiler for Windows and Linux. Its main purpose is to help developers visualize the execution flow of a complex application.

Google 2.8k Oct 2, 2022
AlleyWind is an advanced Win32-based and open-sourced utility that helps you to manage system's windows

AlleyWind AlleyWind is an advanced Win32-based and open-sourced utility that helps you to manage system's windows. AlleyWind could: Displays a graphic

KNSoft 21 Aug 28, 2022
WinMerge is an Open Source differencing and merging tool for Windows.

WinMerge is an Open Source differencing and merging tool for Windows. WinMerge can compare both folders and files, presenting differences in a visual text format that is easy to understand and handle.

null 3.4k Oct 2, 2022
Example of transferring file data over BLE using an Arduino Nano Sense and WebBLE

BLE File Transfer Example of transferring file data over BLE to an Arduino Nano Sense using WebBLE. Overview This is an example of how to use Bluetoot

Pete Warden 28 Aug 23, 2022
Integrate PhysFS with raylib, allowing to load images, audio and fonts from data archives.

raylib-physfs Integrate the virtual file system PhysicsFS with raylib, allowing to load images, audio, and fonts from data archives. Features Load the

Rob Loach 22 Oct 3, 2022
The goal of arrowvctrs is to wrap the Arrow Data C API and Arrow Stream C API to provide lightweight Arrow support for R packages

The goal of arrowvctrs is to wrap the Arrow Data C API and Arrow Stream C API to provide lightweight Arrow support for R packages to consume and produce streams of data in Arrow format. Right now it’s just a fun way for me to learn about Arrow!

Dewey Dunnington 30 Aug 5, 2022
PANDA open source project

PANDA (Protocol And Network Datapath Acceleration) Protocol and Network Datapath Acceleration, or PANDA, is a software programming model, framework, s

null 40 Sep 7, 2022
Open-CMSIS-Pack development tools - C++

CMSIS-Pack Development Tools and Libraries This repository contains the source code of command line tools and library components for processing meta i

Open-CMSIS-Pack 28 Sep 23, 2022
KeyScan is a C++ open source explanation tool targeting windows operating system.

KeyScan is a C++ open source explanation tool targeting windows operating system. it allows you to send keyboard events, mouse events and capture keystrokes (keylogger).!

null 15 Sep 21, 2022
An open source re-implementation of LEGO Rock Raiders 🪨⛏

OpenLRR An open source re-implementation of LEGO Rock Raiders (PC). This is created by slowly implementing and replacing game functionality, while rel

Robert Jordan 40 Oct 5, 2022
Open Source iOS 15 Jailbreak Project

Fugu Fugu is the first open source jailbreak tool based on the checkm8 exploit. UPDATE: Fugu will now install Sileo, SSH and Substitute automatically!

epeth0mus 172 Oct 5, 2022