-
Notifications
You must be signed in to change notification settings - Fork 507
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
5 changed files
with
259 additions
and
20 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
238 changes: 238 additions & 0 deletions
238
_analyzers/character-filters/pattern-replace-character-filter.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,238 @@ | ||
--- | ||
layout: default | ||
title: Pattern replace | ||
parent: Character filters | ||
nav_order: 130 | ||
--- | ||
|
||
# Pattern replace character filter | ||
|
||
The `pattern_replace` character filter allows you to use regular expressions to define patterns for matching and replacing characters in the input text. It is a flexible tool for advanced text transformations, especially when dealing with complex string patterns. | ||
|
||
This filter replaces all instances of a pattern with a specified replacement string, allowing for easy substitutions, deletions, or complex modifications of the input text. You can use it to normalize the input before tokenization. | ||
|
||
## Example | ||
|
||
To standardize phone numbers, you'll use the regular expression `[\\s()-]+`: | ||
|
||
- `[ ]`: Defines a **character class**, meaning it will match **any one** of the characters inside the brackets. | ||
- `\\s`: Matches any **white space** character, such as a space, tab, or newline. | ||
- `()`: Matches literal **parentheses** (`(` or `)`). | ||
- `-`: Matches a literal **hyphen** (`-`). | ||
- `+`: Specifies that the pattern should match **one or more** occurrences of the preceding characters. | ||
|
||
The pattern `[\\s()-]+` will match any sequence of one or more white space characters, parentheses, or hyphens and remove it from the input text. This ensures that the phone numbers are normalized and contain only digits. | ||
|
||
The following request standardizes phone numbers by removing spaces, dashes, and parentheses: | ||
|
||
```json | ||
GET /_analyze | ||
{ | ||
"tokenizer": "standard", | ||
"char_filter": [ | ||
{ | ||
"type": "pattern_replace", | ||
"pattern": "[\\s()-]+", | ||
"replacement": "" | ||
} | ||
], | ||
"text": "(555) 123-4567" | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
The response contains the generated token: | ||
|
||
```json | ||
{ | ||
"tokens": [ | ||
{ | ||
"token": "5551234567", | ||
"start_offset": 1, | ||
"end_offset": 14, | ||
"type": "<NUM>", | ||
"position": 0 | ||
} | ||
] | ||
} | ||
``` | ||
|
||
## Parameters | ||
|
||
The `pattern_replace` character filter must be configured with the following parameters. | ||
|
||
| Parameter | Required/Optional | Data type | Description | | ||
|:---|:---| | ||
| `pattern` | Required | String | A regular expression used to match parts of the input text. The filter identifies and matches this pattern to perform replacement. | | ||
| `replacement` | Optional | String | The string that replaces pattern matches. Use an empty string (`""`) to remove the matched text. Default is an empty string (`""`). | | ||
|
||
## Creating a custom analyzer | ||
|
||
The following request creates an index with a custom analyzer configured with a `pattern_replace` character filter. The filter removes currency signs and thousands separators (both European `.` and American `,`) from numbers: | ||
|
||
```json | ||
PUT /my_index | ||
{ | ||
"settings": { | ||
"analysis": { | ||
"analyzer": { | ||
"my_analyzer": { | ||
"tokenizer": "standard", | ||
"char_filter": [ | ||
"pattern_char_filter" | ||
] | ||
} | ||
}, | ||
"char_filter": { | ||
"pattern_char_filter": { | ||
"type": "pattern_replace", | ||
"pattern": "[$€,.]", | ||
"replacement": "" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
{% include copy-curl.html %} | ||
|
||
Use the following request to examine the tokens generated using the analyzer: | ||
|
||
```json | ||
POST /my_index/_analyze | ||
{ | ||
"analyzer": "my_analyzer", | ||
"text": "Total: $ 1,200.50 and € 1.100,75" | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
The response contains the generated tokens: | ||
|
||
```json | ||
{ | ||
"tokens": [ | ||
{ | ||
"token": "Total", | ||
"start_offset": 0, | ||
"end_offset": 5, | ||
"type": "<ALPHANUM>", | ||
"position": 0 | ||
}, | ||
{ | ||
"token": "120050", | ||
"start_offset": 9, | ||
"end_offset": 17, | ||
"type": "<NUM>", | ||
"position": 1 | ||
}, | ||
{ | ||
"token": "and", | ||
"start_offset": 18, | ||
"end_offset": 21, | ||
"type": "<ALPHANUM>", | ||
"position": 2 | ||
}, | ||
{ | ||
"token": "110075", | ||
"start_offset": 24, | ||
"end_offset": 32, | ||
"type": "<NUM>", | ||
"position": 3 | ||
} | ||
] | ||
} | ||
``` | ||
|
||
## Using capturing groups | ||
|
||
You can use capturing groups in the `replacement` parameter. For example, the following request creates a custom analyzer that uses a `pattern_replace` character filter to replace hyphens with dots in phone numbers: | ||
|
||
```json | ||
PUT /my_index | ||
{ | ||
"settings": { | ||
"analysis": { | ||
"analyzer": { | ||
"my_analyzer": { | ||
"tokenizer": "standard", | ||
"char_filter": [ | ||
"pattern_char_filter" | ||
] | ||
} | ||
}, | ||
"char_filter": { | ||
"pattern_char_filter": { | ||
"type": "pattern_replace", | ||
"pattern": "(\\d+)-(?=\\d)", | ||
"replacement": "$1." | ||
} | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
Use the following request to examine the tokens generated using the analyzer: | ||
|
||
```json | ||
POST /my_index/_analyze | ||
{ | ||
"analyzer": "my_analyzer", | ||
"text": "Call me at 555-123-4567 or 555-987-6543" | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
The response contains the generated tokens: | ||
|
||
```json | ||
{ | ||
"tokens": [ | ||
{ | ||
"token": "Call", | ||
"start_offset": 0, | ||
"end_offset": 4, | ||
"type": "<ALPHANUM>", | ||
"position": 0 | ||
}, | ||
{ | ||
"token": "me", | ||
"start_offset": 5, | ||
"end_offset": 7, | ||
"type": "<ALPHANUM>", | ||
"position": 1 | ||
}, | ||
{ | ||
"token": "at", | ||
"start_offset": 8, | ||
"end_offset": 10, | ||
"type": "<ALPHANUM>", | ||
"position": 2 | ||
}, | ||
{ | ||
"token": "555.123.4567", | ||
"start_offset": 11, | ||
"end_offset": 23, | ||
"type": "<NUM>", | ||
"position": 3 | ||
}, | ||
{ | ||
"token": "or", | ||
"start_offset": 24, | ||
"end_offset": 26, | ||
"type": "<ALPHANUM>", | ||
"position": 4 | ||
}, | ||
{ | ||
"token": "555.987.6543", | ||
"start_offset": 27, | ||
"end_offset": 39, | ||
"type": "<NUM>", | ||
"position": 5 | ||
} | ||
] | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,34 +30,34 @@ Word tokenizers parse full text into words. | |
|
||
Tokenizer | Description | Example | ||
:--- | :--- | :--- | ||
`standard` | - Parses strings into tokens at word boundaries <br> - Removes most punctuation | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`] | ||
`letter` | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`It`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `to`, `OpenSearch`] | ||
`lowercase` | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Converts terms to lowercase | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`] | ||
`whitespace` | - Parses strings into tokens at white space characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`] | ||
`uax_url_email` | - Similar to the standard tokenizer <br> - Unlike the standard tokenizer, leaves URLs and email addresses as single terms | `It’s fun to contribute a brand-new PR or 2 to OpenSearch [email protected]!` <br>becomes<br> [`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`, `[email protected]`] | ||
`classic` | - Parses strings into tokens on: <br>   - Punctuation characters that are followed by a white space character <br>   - Hyphens if the term does not contain numbers <br> - Removes punctuation <br> - Leaves URLs and email addresses as single terms | `Part number PA-35234, single-use product (128.32)` <br>becomes<br> [`Part`, `number`, `PA-35234`, `single`, `use`, `product`, `128.32`] | ||
`thai` | - Parses Thai text into terms | `สวัสดีและยินดีต` <br>becomes<br> [`สวัสด`, `และ`, `ยินดี`, `ต`] | ||
[`standard`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/standard/) | - Parses strings into tokens at word boundaries <br> - Removes most punctuation | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`] | ||
[`letter`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/letter/) | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`It`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `to`, `OpenSearch`] | ||
[`lowercase`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/lowercase/) | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Converts terms to lowercase | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`] | ||
[`whitespace`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/whitespace/) | - Parses strings into tokens at white space characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`] | ||
[`uax_url_email`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/uax-url-email/) | - Similar to the standard tokenizer <br> - Unlike the standard tokenizer, leaves URLs and email addresses as single terms | `It’s fun to contribute a brand-new PR or 2 to OpenSearch [email protected]!` <br>becomes<br> [`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`, `[email protected]`] | ||
[`classic`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/classic/) | - Parses strings into tokens on: <br>   - Punctuation characters that are followed by a white space character <br>   - Hyphens if the term does not contain numbers <br> - Removes punctuation <br> - Leaves URLs and email addresses as single terms | `Part number PA-35234, single-use product (128.32)` <br>becomes<br> [`Part`, `number`, `PA-35234`, `single`, `use`, `product`, `128.32`] | ||
[`thai`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/thai/) | - Parses Thai text into terms | `สวัสดีและยินดีต` <br>becomes<br> [`สวัสด`, `และ`, `ยินดี`, `ต`] | ||
|
||
### Partial word tokenizers | ||
|
||
Partial word tokenizers parse text into words and generate fragments of those words for partial word matching. | ||
|
||
Tokenizer | Description | Example | ||
:--- | :--- | :--- | ||
`ngram`| - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates n-grams of each word | `My repo` <br>becomes<br> [`M`, `My`, `y`, `y `, <code> </code>, <code> r</code>, `r`, `re`, `e`, `ep`, `p`, `po`, `o`] <br> because the default n-gram length is 1--2 characters | ||
`edge_ngram` | - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates edge n-grams of each word (n-grams that start at the beginning of the word) | `My repo` <br>becomes<br> [`M`, `My`] <br> because the default n-gram length is 1--2 characters | ||
[`ngram`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/ngram/)| - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates n-grams of each word | `My repo` <br>becomes<br> [`M`, `My`, `y`, `y `, <code> </code>, <code> r</code>, `r`, `re`, `e`, `ep`, `p`, `po`, `o`] <br> because the default n-gram length is 1--2 characters | ||
[`edge_ngram`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/edge-n-gram/) | - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates edge n-grams of each word (n-grams that start at the beginning of the word) | `My repo` <br>becomes<br> [`M`, `My`] <br> because the default n-gram length is 1--2 characters | ||
|
||
### Structured text tokenizers | ||
|
||
Structured text tokenizers parse structured text, such as identifiers, email addresses, paths, or ZIP Codes. | ||
|
||
Tokenizer | Description | Example | ||
:--- | :--- | :--- | ||
`keyword` | - No-op tokenizer <br> - Outputs the entire string unchanged <br> - Can be combined with token filters, like lowercase, to normalize terms | `My repo` <br>becomes<br> `My repo` | ||
`pattern` | - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms <br> - Uses [Java regular expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) | `https://opensearch.org/forum` <br>becomes<br> [`https`, `opensearch`, `org`, `forum`] because by default the tokenizer splits terms at word boundaries (`\W+`)<br> Can be configured with a regex pattern | ||
`simple_pattern` | - Uses a regular expression pattern to return matching text as terms <br> - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html) <br> - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | Returns an empty array by default <br> Must be configured with a pattern because the pattern defaults to an empty string | ||
`simple_pattern_split` | - Uses a regular expression pattern to split the text on matches rather than returning the matches as terms <br> - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html) <br> - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default<br> Must be configured with a pattern | ||
`char_group` | - Parses on a set of configurable characters <br> - Faster than tokenizers that run regular expressions | No-op by default<br> Must be configured with a list of characters | ||
`path_hierarchy` | - Parses text on the path separator (by default, `/`) and returns a full path to each component in the tree hierarchy | `one/two/three` <br>becomes<br> [`one`, `one/two`, `one/two/three`] | ||
[`keyword`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/keyword/) | - No-op tokenizer <br> - Outputs the entire string unchanged <br> - Can be combined with token filters, like lowercase, to normalize terms | `My repo` <br>becomes<br> `My repo` | ||
[`pattern`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/pattern/) | - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms <br> - Uses [Java regular expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) | `https://opensearch.org/forum` <br>becomes<br> [`https`, `opensearch`, `org`, `forum`] because by default the tokenizer splits terms at word boundaries (`\W+`)<br> Can be configured with a regex pattern | ||
[`simple_pattern`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/simple-pattern/) | - Uses a regular expression pattern to return matching text as terms <br> - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html) <br> - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | Returns an empty array by default <br> Must be configured with a pattern because the pattern defaults to an empty string | ||
[`simple_pattern_split`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/simple-pattern-split/) | - Uses a regular expression pattern to split the text on matches rather than returning the matches as terms <br> - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html) <br> - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default<br> Must be configured with a pattern | ||
[`char_group`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/character-group/) | - Parses on a set of configurable characters <br> - Faster than tokenizers that run regular expressions | No-op by default<br> Must be configured with a list of characters | ||
[`path_hierarchy`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/path-hierarchy/) | - Parses text on the path separator (by default, `/`) and returns a full path to each component in the tree hierarchy | `one/two/three` <br>becomes<br> [`one`, `one/two`, `one/two/three`] | ||
|
||
|