Skip to content

Commit

Permalink
Merge branch 'main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
Naarcha-AWS authored Jan 6, 2025
2 parents d0c6022 + a66d54e commit 94c7448
Show file tree
Hide file tree
Showing 5 changed files with 259 additions and 20 deletions.
2 changes: 1 addition & 1 deletion _analyzers/character-filters/html-character-filter.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ nav_order: 100

The `html_strip` character filter removes HTML tags, such as `<div>`, `<p>`, and `<a>`, from the input text and renders plain text. The filter can be configured to preserve certain tags or decode specific HTML entities, such as `&nbsp;`, into spaces.

## Example: HTML analyzer
## Example

The following request applies an `html_strip` character filter to the provided text:

Expand Down
6 changes: 3 additions & 3 deletions _analyzers/character-filters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,6 @@ Unlike token filters, which operate on tokens (words or terms), character filter

Use cases for character filters include:

- **HTML stripping:** Removes HTML tags from content so that only the plain text is indexed.
- **Pattern replacement:** Replaces or removes unwanted characters or patterns in text, for example, converting hyphens to spaces.
- **Custom mappings:** Substitutes specific characters or sequences with other values, for example, to convert currency symbols into their textual equivalents.
- **HTML stripping**: The [`html_strip`]({{site.url}}{{site.baseurl}}/analyzers/character-filters/html-character-filter/) character filter removes HTML tags from content so that only the plain text is indexed.
- **Pattern replacement**: The [`pattern_replace`]({{site.url}}{{site.baseurl}}/analyzers/character-filters/pattern-replace-character-filter/) character filter replaces or removes unwanted characters or patterns in text, for example, converting hyphens to spaces.
- **Custom mappings**: The [`mapping`]({{site.url}}{{site.baseurl}}/analyzers/character-filters/mapping-character-filter/) character filter substitutes specific characters or sequences with other values, for example, to convert currency symbols into their textual equivalents.
3 changes: 2 additions & 1 deletion _analyzers/character-filters/mapping-character-filter.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ GET /_analyze
"text": "I have III apples and IV oranges"
}
```
{% include copy-curl.html %}

The response contains a token where Roman numerals have been replaced with Arabic numerals:

Expand All @@ -52,7 +53,6 @@ The response contains a token where Roman numerals have been replaced with Arabi
]
}
```
{% include copy-curl.html %}

## Parameters

Expand Down Expand Up @@ -106,6 +106,7 @@ GET /text-index/_analyze
"text": "FYI, updates to the workout schedule are posted. IDK when it takes effect, but we have some details. BTW, the finalized schedule will be released Monday."
}
```
{% include copy-curl.html %}

The response shows that the abbreviations were replaced:

Expand Down
238 changes: 238 additions & 0 deletions _analyzers/character-filters/pattern-replace-character-filter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
---
layout: default
title: Pattern replace
parent: Character filters
nav_order: 130
---

# Pattern replace character filter

The `pattern_replace` character filter allows you to use regular expressions to define patterns for matching and replacing characters in the input text. It is a flexible tool for advanced text transformations, especially when dealing with complex string patterns.

This filter replaces all instances of a pattern with a specified replacement string, allowing for easy substitutions, deletions, or complex modifications of the input text. You can use it to normalize the input before tokenization.

## Example

To standardize phone numbers, you'll use the regular expression `[\\s()-]+`:

- `[ ]`: Defines a **character class**, meaning it will match **any one** of the characters inside the brackets.
- `\\s`: Matches any **white space** character, such as a space, tab, or newline.
- `()`: Matches literal **parentheses** (`(` or `)`).
- `-`: Matches a literal **hyphen** (`-`).
- `+`: Specifies that the pattern should match **one or more** occurrences of the preceding characters.

The pattern `[\\s()-]+` will match any sequence of one or more white space characters, parentheses, or hyphens and remove it from the input text. This ensures that the phone numbers are normalized and contain only digits.

The following request standardizes phone numbers by removing spaces, dashes, and parentheses:

```json
GET /_analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type": "pattern_replace",
"pattern": "[\\s()-]+",
"replacement": ""
}
],
"text": "(555) 123-4567"
}
```
{% include copy-curl.html %}

The response contains the generated token:

```json
{
"tokens": [
{
"token": "5551234567",
"start_offset": 1,
"end_offset": 14,
"type": "<NUM>",
"position": 0
}
]
}
```

## Parameters

The `pattern_replace` character filter must be configured with the following parameters.

| Parameter | Required/Optional | Data type | Description |
|:---|:---|
| `pattern` | Required | String | A regular expression used to match parts of the input text. The filter identifies and matches this pattern to perform replacement. |
| `replacement` | Optional | String | The string that replaces pattern matches. Use an empty string (`""`) to remove the matched text. Default is an empty string (`""`). |

## Creating a custom analyzer

The following request creates an index with a custom analyzer configured with a `pattern_replace` character filter. The filter removes currency signs and thousands separators (both European `.` and American `,`) from numbers:

```json
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"pattern_char_filter"
]
}
},
"char_filter": {
"pattern_char_filter": {
"type": "pattern_replace",
"pattern": "[$€,.]",
"replacement": ""
}
}
}
}
}
```

{% include copy-curl.html %}

Use the following request to examine the tokens generated using the analyzer:

```json
POST /my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Total: $ 1,200.50 and € 1.100,75"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "Total",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "120050",
"start_offset": 9,
"end_offset": 17,
"type": "<NUM>",
"position": 1
},
{
"token": "and",
"start_offset": 18,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "110075",
"start_offset": 24,
"end_offset": 32,
"type": "<NUM>",
"position": 3
}
]
}
```

## Using capturing groups

You can use capturing groups in the `replacement` parameter. For example, the following request creates a custom analyzer that uses a `pattern_replace` character filter to replace hyphens with dots in phone numbers:

```json
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"pattern_char_filter"
]
}
},
"char_filter": {
"pattern_char_filter": {
"type": "pattern_replace",
"pattern": "(\\d+)-(?=\\d)",
"replacement": "$1."
}
}
}
}
}
```
{% include copy-curl.html %}

Use the following request to examine the tokens generated using the analyzer:

```json
POST /my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "Call me at 555-123-4567 or 555-987-6543"
}
```
{% include copy-curl.html %}

The response contains the generated tokens:

```json
{
"tokens": [
{
"token": "Call",
"start_offset": 0,
"end_offset": 4,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "me",
"start_offset": 5,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "at",
"start_offset": 8,
"end_offset": 10,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "555.123.4567",
"start_offset": 11,
"end_offset": 23,
"type": "<NUM>",
"position": 3
},
{
"token": "or",
"start_offset": 24,
"end_offset": 26,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "555.987.6543",
"start_offset": 27,
"end_offset": 39,
"type": "<NUM>",
"position": 5
}
]
}
```
30 changes: 15 additions & 15 deletions _analyzers/tokenizers/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,34 +30,34 @@ Word tokenizers parse full text into words.

Tokenizer | Description | Example
:--- | :--- | :---
`standard` | - Parses strings into tokens at word boundaries <br> - Removes most punctuation | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`]
`letter` | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`It`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `to`, `OpenSearch`]
`lowercase` | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Converts terms to lowercase | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`]
`whitespace` | - Parses strings into tokens at white space characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`]
`uax_url_email` | - Similar to the standard tokenizer <br> - Unlike the standard tokenizer, leaves URLs and email addresses as single terms | `It’s fun to contribute a brand-new PR or 2 to OpenSearch [email protected]!` <br>becomes<br> [`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`, `[email protected]`]
`classic` | - Parses strings into tokens on: <br> &emsp; - Punctuation characters that are followed by a white space character <br> &emsp; - Hyphens if the term does not contain numbers <br> - Removes punctuation <br> - Leaves URLs and email addresses as single terms | `Part number PA-35234, single-use product (128.32)` <br>becomes<br> [`Part`, `number`, `PA-35234`, `single`, `use`, `product`, `128.32`]
`thai` | - Parses Thai text into terms | `สวัสดีและยินดีต` <br>becomes<br> [`สวัสด`, `และ`, `ยินดี`, ``]
[`standard`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/standard/) | - Parses strings into tokens at word boundaries <br> - Removes most punctuation | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`]
[`letter`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/letter/) | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`It`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `to`, `OpenSearch`]
[`lowercase`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/lowercase/) | - Parses strings into tokens on any non-letter character <br> - Removes non-letter characters <br> - Converts terms to lowercase | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`it`, `s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `pr`, `or`, `to`, `opensearch`]
[`whitespace`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/whitespace/) | - Parses strings into tokens at white space characters | `It’s fun to contribute a brand-new PR or 2 to OpenSearch!` <br>becomes<br> [`It’s`, `fun`, `to`, `contribute`, `a`,`brand-new`, `PR`, `or`, `2`, `to`, `OpenSearch!`]
[`uax_url_email`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/uax-url-email/) | - Similar to the standard tokenizer <br> - Unlike the standard tokenizer, leaves URLs and email addresses as single terms | `It’s fun to contribute a brand-new PR or 2 to OpenSearch [email protected]!` <br>becomes<br> [`It’s`, `fun`, `to`, `contribute`, `a`,`brand`, `new`, `PR`, `or`, `2`, `to`, `OpenSearch`, `[email protected]`]
[`classic`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/classic/) | - Parses strings into tokens on: <br> &emsp; - Punctuation characters that are followed by a white space character <br> &emsp; - Hyphens if the term does not contain numbers <br> - Removes punctuation <br> - Leaves URLs and email addresses as single terms | `Part number PA-35234, single-use product (128.32)` <br>becomes<br> [`Part`, `number`, `PA-35234`, `single`, `use`, `product`, `128.32`]
[`thai`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/thai/) | - Parses Thai text into terms | `สวัสดีและยินดีต` <br>becomes<br> [`สวัสด`, `และ`, `ยินดี`, ``]

### Partial word tokenizers

Partial word tokenizers parse text into words and generate fragments of those words for partial word matching.

Tokenizer | Description | Example
:--- | :--- | :---
`ngram`| - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates n-grams of each word | `My repo` <br>becomes<br> [`M`, `My`, `y`, `y `, <code>&nbsp;</code>, <code>&nbsp;r</code>, `r`, `re`, `e`, `ep`, `p`, `po`, `o`] <br> because the default n-gram length is 1--2 characters
`edge_ngram` | - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates edge n-grams of each word (n-grams that start at the beginning of the word) | `My repo` <br>becomes<br> [`M`, `My`] <br> because the default n-gram length is 1--2 characters
[`ngram`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/ngram/)| - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates n-grams of each word | `My repo` <br>becomes<br> [`M`, `My`, `y`, `y `, <code>&nbsp;</code>, <code>&nbsp;r</code>, `r`, `re`, `e`, `ep`, `p`, `po`, `o`] <br> because the default n-gram length is 1--2 characters
[`edge_ngram`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/edge-n-gram/) | - Parses strings into words on specified characters (for example, punctuation or white space characters) and generates edge n-grams of each word (n-grams that start at the beginning of the word) | `My repo` <br>becomes<br> [`M`, `My`] <br> because the default n-gram length is 1--2 characters

### Structured text tokenizers

Structured text tokenizers parse structured text, such as identifiers, email addresses, paths, or ZIP Codes.

Tokenizer | Description | Example
:--- | :--- | :---
`keyword` | - No-op tokenizer <br> - Outputs the entire string unchanged <br> - Can be combined with token filters, like lowercase, to normalize terms | `My repo` <br>becomes<br> `My repo`
`pattern` | - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms <br> - Uses [Java regular expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) | `https://opensearch.org/forum` <br>becomes<br> [`https`, `opensearch`, `org`, `forum`] because by default the tokenizer splits terms at word boundaries (`\W+`)<br> Can be configured with a regex pattern
`simple_pattern` | - Uses a regular expression pattern to return matching text as terms <br> - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html) <br> - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | Returns an empty array by default <br> Must be configured with a pattern because the pattern defaults to an empty string
`simple_pattern_split` | - Uses a regular expression pattern to split the text on matches rather than returning the matches as terms <br> - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html) <br> - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default<br> Must be configured with a pattern
`char_group` | - Parses on a set of configurable characters <br> - Faster than tokenizers that run regular expressions | No-op by default<br> Must be configured with a list of characters
`path_hierarchy` | - Parses text on the path separator (by default, `/`) and returns a full path to each component in the tree hierarchy | `one/two/three` <br>becomes<br> [`one`, `one/two`, `one/two/three`]
[`keyword`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/keyword/) | - No-op tokenizer <br> - Outputs the entire string unchanged <br> - Can be combined with token filters, like lowercase, to normalize terms | `My repo` <br>becomes<br> `My repo`
[`pattern`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/pattern/) | - Uses a regular expression pattern to parse text into terms on a word separator or to capture matching text as terms <br> - Uses [Java regular expressions](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) | `https://opensearch.org/forum` <br>becomes<br> [`https`, `opensearch`, `org`, `forum`] because by default the tokenizer splits terms at word boundaries (`\W+`)<br> Can be configured with a regex pattern
[`simple_pattern`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/simple-pattern/) | - Uses a regular expression pattern to return matching text as terms <br> - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html) <br> - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | Returns an empty array by default <br> Must be configured with a pattern because the pattern defaults to an empty string
[`simple_pattern_split`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/simple-pattern-split/) | - Uses a regular expression pattern to split the text on matches rather than returning the matches as terms <br> - Uses [Lucene regular expressions](https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/util/automaton/RegExp.html) <br> - Faster than the `pattern` tokenizer because it uses a subset of the `pattern` tokenizer regular expressions | No-op by default<br> Must be configured with a pattern
[`char_group`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/character-group/) | - Parses on a set of configurable characters <br> - Faster than tokenizers that run regular expressions | No-op by default<br> Must be configured with a list of characters
[`path_hierarchy`]({{site.url}}{{site.baseurl}}/analyzers/tokenizers/path-hierarchy/) | - Parses text on the path separator (by default, `/`) and returns a full path to each component in the tree hierarchy | `one/two/three` <br>becomes<br> [`one`, `one/two`, `one/two/three`]


0 comments on commit 94c7448

Please sign in to comment.