Skip to content

Latest commit

 

History

History
366 lines (238 loc) · 13.2 KB

README.md

File metadata and controls

366 lines (238 loc) · 13.2 KB

A modular, open-source search engine for our world.

Pelias is a geocoder powered completely by open data, available freely to everyone.

Local Installation · Cloud Webservice · Documentation · Community Chat

What is Pelias?
Pelias is a search engine for places worldwide, powered by open data. It turns addresses and place names into geographic coordinates, and turns geographic coordinates into places and addresses. With Pelias, you’re able to turn your users’ place searches into actionable geodata and transform your geodata into real places.

We think open data, open source, and open strategy win over proprietary solutions at any part of the stack and we want to ensure the services we offer are in line with that vision. We believe that an open geocoder improves over the long-term only if the community can incorporate truly representative local knowledge.

Pelias coarse geocoder

This repository provides all the code & geographic data you'll need to run your own coarse geocoder.

Read our An (almost) one line coarse geocoder with Docker blog post for a quick start guide and check out our demo.

This service is intended to be run as part of the Pelias Gecoder but can just as easily be run independently as it has no external dependencies.

Natural language parser for geographic text

The engine takes unstructured input text, such as 'Neutral Bay North Sydney New South Wales' and attempts to deduce the geographic area the user is referring to.

Human beings (familiar with Australian geography) are able to quickly scan the text and establish that there 3 distinct token groups: 'Neutral Bay', 'North Sydney' & 'New South Wales'.

The engine uses a similar technique to our brains, scanning across the text, cycling through a dictionary of learned terms and then trying to establish logical token groups.

Once token groups have been established, a reductive algorithm is used to ensure that the token groups are logical in a geographic context. We don't want to return New York City for a term such as 'nyc france', so we need to only return things called 'nyc' inside places called 'france'.

The engine starts from the rightmost group, and works to the left, ensuring token groups represent geographic entities contained within those which came before. This process is repeated until it either runs out of groups, or would return 0 results.

The best estimation is then returned, either as a set of integers representing the ids of those regions, or as a JSON structure which also contains additional information such as population counts etc.

The data is sourced from the whosonfirst project, this project also includes different language translations of place names.

Placeholder supports searching on and retrieving tokens in different languages and also offers support for synonyms and abbreviations.

The engine includes a rudimentary language detection algorithm which attempts to detect right-to-left languages and languages which write their addresses in major-to-minor format. It will then reverse the tokens to re-order them in to minor-to-major ordering.


Requirements

Placeholder requires Node.js and SQLite

See Pelias software requirements for required and recommended versions.

Install

$ git clone [email protected]:pelias/placeholder.git && cd placeholder
$ npm install

Download the required database files

Data hosting is provided by Geocode Earth. Other Pelias related downloads are available at https://geocode.earth/data.

$ mkdir data
$ curl -s https://data.geocode.earth/placeholder/store.sqlite3.gz | gunzip > data/store.sqlite3;

Confirm the build was successful

$ npm test
$ npm run cli -- san fran

> [email protected] cli
> node cmd/cli.js "san" "fran"

san fran

took: 3ms
 - 85922583	locality 	San Francisco

Run server

$ PORT=6100 npm start;

Configuration via Environment Variables

The service supports additional environment variables that affect its operation:

Environment Variable Default Description
HOST undefined The network address that the placeholder service will bind to. Defaults to whatever the current Node.js default is, which is currently to listen on 0.0.0.0 (all interfaces). See the Node.js Net documentation for more information.
PORT 3000 The TCP port that the placeholder service will use for incoming network connections
PLACEHOLDER_DATA ../data/ Path to the directory where the placeholder service will find the store.sqlite3 database file.

Open browser

the server should now be running and you should be able to access the http API:

http://localhost:6100/

try the following paths:

/demo
/parser/search?text=london
/parser/findbyid?ids=101748479
/parser/query?text=london
/parser/tokenize?text=sydney new south wales

Changing languages

the /parser/search endpoint accepts a ?lang=xxx property which can be used to vary the language of data returned.

for example, the following urls will return strings in Japanese / Russian where available:

/parser/search?text=germany&lang=jpn
/parser/search?text=germany&lang=rus

documents returned by /parser/search contain a boolean property named languageDefaulted which indicates if the service was able to find a translation in the language you request (false) or whether it returned the default language (true).

The /parser/findbyid endpoint also accepts a ?lang=xxx property which will return the selected lang if the translation exists and all translations otherwise.

for example, the following url will return strings in French / Korean where available:

/parser/findbyid?ids=85633147,102191581,85862899&lang=fra
/parser/findbyid?ids=85633147,102191581,85862899&lang=kor

the demo is also able to serve responses in different languages by providing the language code in the URL anchor:

/demo#jpn
/demo#chi
/demo#eng
/demo#fra
... etc.

Filtering by placetype

the /parser/search endpoint accepts a ?placetype=xxx parameter which can be used to control the placetype of records which are returned.

the API does not provide any performance benefits, it is simply a convenience API to filter by a whitelist.

you may specify multiple placetypes using a comma to separate them, such as ?placetype=xxx,yyy, these are matched as OR conditions. eg: (xxx OR yyy)

for example:

the query search?text=luxemburg will return results for the country, region, locality etc.

you can use the placetype filter to control which records are returned:

# all matching results
search?text=luxemburg

# only return matching country records
search?text=luxemburg&placetype=country

# return matching country or region records
search?text=luxemburg&placetype=country,region

Live mode (BETA)

the /parser/search endpoint accepts a ?mode=live parameter pair which can be used to enable an autocomplete-style API.

in this mode the final token of each input text is considered as 'incomplete', meaning that the user has potentially only typed part of a token.

this mode is currently in BETA, the interface and behaviour may change over time.

Configuring the rtree threshold

the default matching strategy uses the lineage table to ensure that token pairs represent a valid child->parent relationship. this ensures that queries like 'London France' do not match, because there is no entry in the lineage table linking those two places together.

in some cases it's preferable to fall back to a matching strategy which considers geographically nearby places with a matching name, even if that relationship does not explicitly exist in the lineage table.

for example, 'Basel France' will return 'Basel Switzerland'. this is useful for handling user input errors and errors and omissions from the lineage table.

in the example above, 'Basel France' only matches because the bounding box of 'Basel' overlaps the bounding box of 'France' and no other valid entry for 'Basel France' exists.

the definition of what is 'nearby' is configurable, the bbox for the minor term (left token) is expanded by a threshold (the threshold is added or subtracted to each of the bbox vertices).

by default the threshold is set as 0.2 (degrees), any float value between 0 and 1 may be specified via the enviornment variable RTREE_THRESHOLD.

a setting of less than 0 will disable the rtree functionality completely. disabling the rtree will result in nearby queries such as 'Basel France' returning 'France' instead of 'Basel Switzerland'.


Run the interactive shell

$ npm run repl

> [email protected] repl
> node cmd/repl.js

placeholder >

try the following commands:

placeholder > london on
 - 101735809	locality 	London

placeholder > search london on
 - 101735809	locality 	London

placeholder > tokenize sydney new south wales
 [ [ 'sydney', 'new south wales' ] ]

placeholder > token kelburn
 [ 1729339019 ]

placeholder > id 1729339019
 { name: 'Kelburn',
   placetype: 'neighbourhood',
   lineage:
    { continent_id: 102191583,
      country_id: 85633345,
      county_id: 102079339,
      locality_id: 101915529,
      neighbourhood_id: 1729339019,
      region_id: 85687233 },
   names: { eng: [ 'Kelburn' ] } }

Configuration for pelias API

While Placeholder can be used as a stand-alone application or included with other geographic software / search engines, it is designed for the Pelias geocoder.

To connect Placeholder service to the Pelias API, configure the pelias config file with the port that placeholder is running on.


Tests

run the test suite

$ npm test

Run the functional cases

there are more exhaustive test cases included in test/cases/.

to run all the test cases:

$ npm run funcs

Generate a ~500,000 line test file

this command requires the data/wof.extract file mentioned below in the 'building the database' section.

$ npm run gentests

once complete you can find the generated test cases in test/cases/generated.txt.


Docker

Build the service image

$ docker-compose build

Run the service in the background

$ docker-compose up -d

Building the database

Prerequisites

  • jq 1.5+ must be installed
    • on ubuntu: sudo apt-get install jq
    • on mac: brew install jq
  • Who's on First data download

Steps

the database is created from geographic data sourced from the whosonfirst project.

the whosonfirst project is distributed as geojson files, so in order to speed up development we first extract the relevant data in to a file: data/wof.extract.

the following command will iterate over all the geojson files under the WOF_DIR path, extracting the relevant properties in to the file data/wof.extract.

this process can take 30-60 minutes to run and consumes ~350MB of disk space, you will only need to run this command once, or when your local whosonfirst-data files are updated.

$ WOF_DIR=/data/whosonfirst-data/data npm run extract

now you can rebuild the data/store.json file with the following command:

this should take 2-3 minutes to run:

$ npm run build

Using the Docker image

Rebuild the image

you can rebuild the image on any system with the following command:

$ docker build -t pelias/placeholder .

Download pre-built image

Up to date Docker images are built and automatically pushed to Docker Hub from our continuous integration pipeline

You can pull the latest stable image with

$ docker pull pelias/placeholder

Download custom image tags

We publish each commit and the latest of each branch to separate tags

A list of all available tags to download can be found at https://hub.docker.com/r/pelias/placeholder/tags/