Skip to content

Commit

Permalink
Split Search and Query (#4279)
Browse files Browse the repository at this point in the history
Co-authored-by: Dr. Ernie Prabhakar <[email protected]>
  • Loading branch information
drernie and drernie authored Jan 8, 2025
1 parent 0321d2a commit 030ef7a
Show file tree
Hide file tree
Showing 8 changed files with 323 additions and 184 deletions.
8 changes: 4 additions & 4 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ git checkout -B new-branch-name

## Local package development

### Environment
### Python Environment

Use `pip` to install `quilt` locally (including development dependencies):

Expand All @@ -42,7 +42,7 @@ install](https://pip.pypa.io/en/stable/reference/pip_install/#editable-installs)
of `quilt`, allowing you to modify the code and test your changes
right away.

### Testing
### Python Testing

All new code contributions are expected to have complete unit test
coverage, and to pass all preexisting tests.
Expand All @@ -62,7 +62,7 @@ catalog if you already have a catalog deployed to AWS, because the
catalog relies on certain services (namely, AWS Lambda and the AWS
Elasticsearch Service) which cannot be run locally.

### Environment
### Catalog Environment

Use `npm` to install the catalog (`quilt-navigator`) dependencies locally:

Expand Down Expand Up @@ -152,7 +152,7 @@ Make sure that any images you check into the repository are
[optimized](https://kinsta.com/blog/optimize-images-for-web/) at
check-in time.

### Testing
### Catalog Testing

To run the catalog unit tests:

Expand Down
59 changes: 59 additions & 0 deletions docs/Catalog/Query.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
<!-- markdownlint-disable-next-line first-line-h1 -->
[Amazon Athena](https://aws.amazon.com/athena/) is an interactive query service
that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is
serverless, so there is no infrastructure to manage, and you pay only for the
queries that you run.

The Catalog's Queries tab allows you to run Athena queries against your S3
buckets, and any other data sources your users have access to. There are
prebuilt tables for packages and objects, and you can create your own tables and
views. See, for example, [Tabulator](advanced-features/tabulator.md).

NOTE: This page describes how to use Athena for precise querying of specific
tables and fields. For full-text searching using Elasticsearch, see the
[Search](Search.md) page.

## Basics

"Run query" executes the selected query and waits for the result.

![ui](../imgs/athena-ui.png)

Individual users will also see their past queries, and easily re-run them.

![history](../imgs/athena-history.png)

## Example: query package-level metadata

Suppose we wish to find all packages produced by algorithm version 1.3 with a
cell index of 5.

```sql
SELECT * FROM "YOUR-BUCKET_packages-view"
-- extract and query package-level metadata
WHERE json_extract_scalar(meta,
'$.user_meta.nucmembsegmentationalgorithmversion') LIKE '1.3%'
AND json_array_contains(json_extract(meta, '$.user_meta.cellindex'), '5');
```

## Example: query object-level metadata

Suppose we wish to find all .tiff files produced by algorithm version 1.3
with a cell index of 5.

```sql
SELECT * FROM "YOUR-BUCKET_objects-view"
WHERE substr(logical_key, -5) = '.tiff'
-- extract and query object-level metadata
AND json_extract_scalar(meta,
'$.user_meta.nucmembsegmentationalgorithmversion') LIKE '1.3%'
AND json_array_contains(json_extract(meta, '$.user_meta.cellindex'), '5');
```

## Configuration

Athena queries saved from the AWS Console for a given workgroup will be
available in the Quilt Catalog for all users to run.

Administrators can hide the "Queries" tab by setting `ui > nav > queries: false`
([learn more](./Preferences.md)).
94 changes: 28 additions & 66 deletions docs/Catalog/SearchQuery.md → docs/Catalog/Search.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,21 @@
<!-- markdownlint-disable -->
Quilt provides support for queries in the Elasticsearch DSL, as
well as SQL queries in Athena.
<!-- markdownlint-disable MD013 -->
<!-- markdownlint-disable-next-line first-line-h1 -->
Each Quilt stack includes an Elasticsearch cluster that indexes objects and
packages as documents. The objects in Amazon S3 buckets connected to Quilt are
synchronized to an Elasticsearch cluster, which provides Quilt's search and
package listing features.

## Elasticsearch
NOTE: This page is about full-text searching using Elasticsearch. For precise querying of specific fields, see the [Queries](Query.md) page.

The objects in Amazon S3 buckets connected to Quilt are synchronized to
an Elasticsearch cluster, which provides Quilt's search features.
## Indexing

Quilt uses Elasticsearch 6.7
([docs](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/index.html)).

### Indexing
Quilt maintains a near-realtime index of the objects in your S3
bucket in Elasticsearch. Each bucket corresponds to one or more
Elasticsearch indexes. As objects are mutated in S3, Quilt uses an
event-driven system (via SNS and SQS) to update Elasticsearch.

There are two types of indexing in Quilt:

* *shallow* indexing includes object metadata (such as the file name and size)
* *deep* indexing includes object contents. Quilt supports deep
indexing for the following file extensions:
Expand All @@ -28,24 +27,18 @@ indexing for the following file extensions:
* .pptx
* .xls, .xlsx

> By default, Quilt indexes a limited number of bytes per document for specified file
formats (100KB). Both the max number of bytes per document and which file formats
to deep index can be customized per Bucket in the Catalog Admin settings.

![Example of Admin Bucket indexing options](../imgs/elastic-search-indexing-options.png)

### Search Bar

The search bar on every page in the catalog provides a convenient
shortcut for searching objects and packages in an Amazon S3
bucket.

> Quilt uses Elasticsearch 6.7 [query string
> syntax](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl-query-string-query.html#query-string-syntax).
NOTE: Quilt uses Elasticsearch 6.7 [query string
syntax](https://www.elastic.co/guide/en/elasticsearch/reference/6.7/query-dsl-query-string-query.html#query-string-syntax).

The following are all valid search parameters:

**Fields**
#### Fields

| Syntax | Description | Example |
|- | - | - |
Expand All @@ -65,7 +58,7 @@ The following are all valid search parameters:
| `package_stats.total_bytes` | Package total bytes | `package_stats.total_bytes:<100` |
| `workflow.id` | Package workflow ID | `workflow.id:verify-metadata` |

**Logical operators and grouping**
#### Logical operators and grouping

| Syntax | Description | Example |
|- | - | - |
Expand All @@ -75,66 +68,35 @@ The following are all valid search parameters:
| `_exists_` | Matches any non-null value for the given field | `_exists_: content` |
| `()` | Group terms | `(a AND b) NOT c` |

**Wildcard and regular expressions**
#### Wildcard and regular expressions

| Syntax | Description | Example |
|- | - | - |
| `*` | Zero or more characters, avoid leading `*` (slows performance) | `ext:config.y*ml` |
| `?` | Exactly one character | `ext:React.?sx` |
| `//` | Regular expression (slows performance) | `content:/lmnb[12]/` |

### QUERIES > ELASTICSEARCH tab
### ELASTICSEARCH tab

![](../imgs/catalog-es-queries-default.png)
When you click into a specific bucket, you can access the Elasticsearch tab to
run more complex queries. The Elasticsearch tab provides a more powerful search
interface than the search bar, allowing you to specify the Elasticsearch index
and query parameters.

![catalog-es-queries-default](../imgs/catalog-es-queries-default.png)

Quilt Elasticsearch queries support the following keys:
- `index` — comma-separated list of indexes to search ([learn

* `index` — comma-separated list of indexes to search ([learn
more](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/multi-index.html))
- `filter_path` — to reducing response nesting, ([learn
* `filter_path` — to reducing response nesting, ([learn
more](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/common-options.html#common-options-response-filtering))
- `_source` — boolean that adds or removes the `_source` field, or
* `_source` — boolean that adds or removes the `_source` field, or
a list of fields to return ([learn
more](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-source-filtering.html))
- `size` — limits the number of hits ([learn
* `size` — limits the number of hits ([learn
more](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-uri-request.html))
- `from` — starting offset for pagination ([learn
* `from` — starting offset for pagination ([learn
more](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-uri-request.html))
- `body` — the search query body as a JSON dictionary ([learn
* `body` — the search query body as a JSON dictionary ([learn
more](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-body.html))

#### Saved queries
You can provide pre-canned queries for your users by providing a configuration file
at `s3://YOUR_BUCKET/.quilt/queries/config.yaml`:

```yaml
version: "1"
queries:
query-1:
name: My first query
description: Optional description
url: s3://BUCKET/.quilt/queries/query-1.json
query-2:
name: Second query
url: s3://BUCKET/.quilt/queries/query-2.json
```
The Quilt catalog displays your saved queries in a drop-down for your users to
select, edit, and execute.
## Athena
You can park reusable Athena Queries in the Quilt catalog so that your users can
run them. You must first set up you an Athena workgroup and Saved queries per
[AWS's Athena documentation](https://docs.aws.amazon.com/athena/latest/ug/getting-started.html).
### Configuration
You can hide the "Queries" tab by setting `ui > nav > queries: false`.
It is also possible to set the default workgroup in `ui > athena > defaultWorkgroup: 'your-default-workgroup'`.
[Learn more](./Preferences.md).
The tab will remember the last workgroup, catalog name and database that was selected.
### Basics
"Run query" executes the selected query and waits for the result.
![Athena page](../imgs/athena-ui.png)
Loading

0 comments on commit 030ef7a

Please sign in to comment.