Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEST Only, do not merge #1431

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
34 changes: 34 additions & 0 deletions .github/scripts/release.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
## this script is to update for major/minor updates
## the input version format should be like: 1.0 or 0.12
VERSION=$1

echo "Processing variables"
cp variables/variables-nightly.ts variables/variables-$VERSION.ts
sed -i "s/greptimedbVersion: 'v[^']*'/greptimedbVersion: 'v$VERSION.0'/" variables/variables-$VERSION.ts
sed -i "s/greptimedbVersion: 'v[^']*'/greptimedbVersion: 'v$VERSION.0'/" variables/variables-nightly.ts

echo "Processing localized sidebars"
cp i18n/zh/docusaurus-plugin-content-docs/current.json i18n/zh/docusaurus-plugin-content-docs/version-$VERSION.json
jq 'del(.["version.label"])' version-$VERSION.json > temp.json && mv temp.json version-$VERSION.json

echo "Removing greptimecloud content from current version"
CURRENT_VERSION=$(ls -1 versioned_docs | sort | head -n 1)
rm -rf versioned_docs/$CURRENT_VERSION/greptimecloud
rm -rf i18n/zh/docusaurus-plugin-content-docs/$CURRENT_VERSION/greptimecloud
jq 'del(.docs[] | select(.label == "GreptimeCloud"))' versioned_sidebars/$CURRENT_VERSION-sidebars.json > temp.json && mv temp.json versioned_sidebars/$CURRENT_VERSION-sidebars.json
sed -i '/^- \[GreptimeCloud\]/d' versioned_docs/$CURRENT_VERSION/index.md
sed -i '/^- \[GreptimeCloud\]/d' i18n/zh/docusaurus-plugin-content-docs/$CURRENT_VERSION/index.md

echo "Generating new version"
npm run docusaurus docs:version $VERSION

echo "Removing oldest version"
OLDEST_VERSION=$(ls -1 versioned_docs | sort -V | head -n 1)
rm -rf versioned_docs/$OLDEST_VERSION
rm -rf i18n/zh/docusaurus-plugin-content-docs/$OLDEST_VERSION/
rm i18n/zh/docusaurus-plugin-content-docs/$OLDEST_VERSION.json
rm versioned_sidebars/$OLDEST_VERSION-sidebars.json
jq '.[:-1]' versions.json > temp.json && mv temp.json versions.json

# echo "Set new default"
# npm run docusaurus docs:use-version $VERSION
39 changes: 39 additions & 0 deletions .github/workflows/bump-version.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: Version Docs

on:
workflow_dispatch:
inputs:
version:
description: 'Version number without patch (e.g., 1.0 or 0.12)'
required: true
type: string

jobs:
update-docs:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Node.js
uses: actions/setup-node@v3
with:
node-version: 22

- name: Install dependencies
run: npm install

- name: Call update scripts
run: |
VERSION=${{ github.event.inputs.version }}
.github/scripts/release.sh $VERSION

- name: Create Pull Request
uses: peter-evans/create-pull-request@v5
with:
commit-message: "Version docs to ${{ github.event.inputs.version }}"
title: "Version docs to ${{ github.event.inputs.version }}"
body: "This PR updates the docs to version ${{ github.event.inputs.version }}."
branch: "version-docs-${{ github.event.inputs.version }}"
base: main
delete-branch: true
6 changes: 0 additions & 6 deletions docusaurus.config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -135,12 +135,6 @@ const config: Config = {
current: {
label: 'nightly',
path: 'nightly',
},
'0.8': {
path: 'v0.8'
},
'0.7': {
path: 'v0.7'
}
},
remarkPlugins: [
Expand Down
194 changes: 194 additions & 0 deletions i18n/zh/docusaurus-plugin-content-docs/version-0.12.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
{
"version.label": {
"message": "nightly",
"description": "The label for version current"
},
"sidebar.docs.category.Getting Started": {
"message": "立即开始",
"description": "The label for category Getting Started in sidebar docs"
},
"sidebar.docs.category.Installation": {
"message": "安装",
"description": "The label for category Installation in sidebar docs"
},
"sidebar.docs.category.User Guide": {
"message": "用户指南",
"description": "The label for category User Guide in sidebar docs"
},
"sidebar.docs.category.Concepts": {
"message": "概念",
"description": "The label for category Concepts in sidebar docs"
},
"sidebar.docs.category.Migrate to GreptimeDB": {
"message": "迁移到 GreptimeDB",
"description": "The label for category Migrate to GreptimeDB in sidebar docs"
},
"sidebar.docs.category.Write Data": {
"message": "写入数据",
"description": "The label for category Write Data in sidebar docs"
},
"sidebar.docs.category.Query Data": {
"message": "读取数据",
"description": "The label for category Query Data in sidebar docs"
},
"sidebar.docs.category.Flow Computation": {
"message": "流计算",
"description": "The label for category Flow Computation in sidebar docs"
},
"sidebar.docs.category.Logs": {
"message": "日志",
"description": "The label for category Logs in sidebar docs"
},
"sidebar.docs.category.Client Libraries": {
"message": "客户端库",
"description": "The label for category Client Libraries in sidebar docs"
},
"sidebar.docs.category.Administration": {
"message": "管理",
"description": "The label for category Operations in sidebar docs"
},
"sidebar.docs.category.Authentication": {
"message": "鉴权",
"description": "The label for category Authentication in sidebar docs"
},
"sidebar.docs.category.Deployments": {
"message": "部署",
"description": "The label for category Deployments in sidebar docs"
},
"sidebar.docs.category.Deploy on Kubernetes": {
"message": "部署到 Kubernetes",
"description": "The label for category Deploy on Kubernetes in sidebar docs"
},
"sidebar.docs.category.Manage GreptimeDB Operator": {
"message": "管理 GreptimeDB Operator",
"description": "The label for category Deploy on Kubernetes in sidebar docs"
},
"sidebar.docs.category.Disaster Recovery": {
"message": "灾难恢复",
"description": "The label for category Disaster Recovery in sidebar docs"
},
"sidebar.docs.category.Remote WAL": {
"message": "Remote WAL",
"description": "The label for category Remote WAL in sidebar docs"
},
"sidebar.docs.category.GreptimeCloud": {
"message": "GreptimeCloud",
"description": "The label for category GreptimeCloud in sidebar docs"
},
"sidebar.docs.category.Integrations": {
"message": "集成",
"description": "The label for category Integrations in sidebar docs"
},
"sidebar.docs.category.Prometheus": {
"message": "Prometheus",
"description": "The label for category Prometheus in sidebar docs"
},
"sidebar.docs.category.SDK Libraries": {
"message": "SDK Libraries",
"description": "The label for category SDK Libraries in sidebar docs"
},
"sidebar.docs.category.Migrate to GreptimeCloud": {
"message": "迁移到 GreptimeCloud",
"description": "The label for category Migrate to GreptimeCloud in sidebar docs"
},
"sidebar.docs.category.Usage & Billing": {
"message": "用量及费用",
"description": "The label for category Usage & Billing in sidebar docs"
},
"sidebar.docs.category.Tutorials": {
"message": "教程",
"description": "The label for category Tutorials in sidebar docs"
},
"sidebar.docs.category.Monitor Host Metrics": {
"message": "监控 Host Metrics",
"description": "The label for category Monitor Host Metrics in sidebar docs"
},
"sidebar.docs.category.GreptimeDB Enterprise": {
"message": "GreptimeDB 企业版",
"description": "The label for category GreptimeDB Enterprise in sidebar docs"
},
"sidebar.docs.category.Reference": {
"message": "Reference",
"description": "The label for category Reference in sidebar docs"
},
"sidebar.docs.category.SQL": {
"message": "SQL",
"description": "The label for category SQL in sidebar docs"
},
"sidebar.docs.category.Functions": {
"message": "Functions",
"description": "The label for category Functions in sidebar docs"
},
"sidebar.docs.category.Information Schema": {
"message": "Information Schema",
"description": "The label for category Information Schema in sidebar docs"
},
"sidebar.docs.category.Contributor Guide": {
"message": "贡献者指南",
"description": "The label for category Contributor Guide in sidebar docs"
},
"sidebar.docs.category.Frontend": {
"message": "Frontend",
"description": "The label for category Frontend in sidebar docs"
},
"sidebar.docs.category.Datanode": {
"message": "Datanode",
"description": "The label for category Datanode in sidebar docs"
},
"sidebar.docs.category.Metasrv": {
"message": "Metasrv",
"description": "The label for category Metasrv in sidebar docs"
},
"sidebar.docs.category.Flownode": {
"message": "Flownode",
"description": "The label for category Flownode in sidebar docs"
},
"sidebar.docs.category.Tests": {
"message": "测试",
"description": "The label for category Tests in sidebar docs"
},
"sidebar.docs.category.How To": {
"message": "指南",
"description": "The label for category How To in sidebar docs"
},
"sidebar.docs.category.FAQ and Others": {
"message": "常见问题及其他",
"description": "The label for category FAQ and Others in sidebar docs"
},
"sidebar.docs.link.Release Notes": {
"message": "Release Notes",
"description": "The label for link Release Notes in sidebar docs, linking to /release-notes"
},
"sidebar.docs.category.Ingest Data": {
"message": "写入数据",
"description": "The label for category Ingest Data in sidebar docs"
},
"sidebar.docs.category.For Observerbility": {
"message": "可观测场景",
"description": "The label for category For Observerbility in sidebar docs"
},
"sidebar.docs.category.For IoT": {
"message": "物联网(IoT)场景",
"description": "The label for category For IoT in sidebar docs"
},
"sidebar.docs.category.gRPC SDKs": {
"message": "gRPC SDKs",
"description": "The label for category gRPC SDKs in sidebar docs"
},
"sidebar.docs.category.Manage Data": {
"message": "管理数据",
"description": "The label for category Manage Data in sidebar docs"
},
"sidebar.docs.category.Protocols": {
"message": "协议",
"description": "The label for category Manage Data in sidebar docs"
},
"sidebar.docs.category.Monitoring": {
"message": "监控",
"description": "The label for category Monitoring in sidebar docs"
},
"sidebar.docs.category.Vector Storage": {
"message": "向量存储",
"description": "The label for category Vector Storage in sidebar docs"
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
keywords: [数据持久化, 索引机制, SST 文件, 倒排索引]
description: 介绍了 GreptimeDB 的数据持久化和索引机制,包括 SST 文件格式、数据持久化过程和倒排索引的实现。
---

# 数据持久化与索引

与所有类似 LSMT 的存储引擎一样,MemTables 中的数据被持久化到耐久性存储,例如本地磁盘文件系统或对象存储服务。GreptimeDB 采用 [Apache Parquet][1] 作为其持久文件格式。

## SST 文件格式

Parquet 是一种提供快速数据查询的开源列式存储格式,已经被许多项目采用,例如 Delta Lake。

Parquet 具有层次结构,类似于“行组-列-数据页”。Parquet 文件中的数据被水平分区为行组(row group),在其中相同列的所有值一起存储以形成数据页(data pages)。数据页是最小的存储单元。这种结构极大地提高了性能。

首先,数据按列聚集,这使得文件扫描更加高效,特别是当查询只涉及少数列时,这在分析系统中非常常见。

其次,相同列的数据往往是同质的(比如具备近似的值),这有助于在采用字典和 Run-Length Encoding(RLE)等技术进行压缩。

<img src="/parquet-file-format.png" alt="Parquet file format" width="500"/>

## 数据持久化

GreptimeDB 提供了 `storage.flush.global_write_buffer_size` 的配置项来设置全局的 Memtable 大小阈值。当数据库所有 MemTable 中的数据量之和达到阈值时将自动触发持久化操作,将 MemTable 的数据 flush 到 SST 文件中。


## SST 文件中的索引数据

Apache Parquet 文件格式在列块和数据页的头部提供了内置的统计信息,用于剪枝和跳过。

<img src="/column-chunk-header.png" alt="Column chunk header" width="350"/>

例如,在上述 Parquet 文件中,如果你想要过滤 `name` 等于 `Emily` 的行,你可以轻松跳过行组 0,因为 `name` 字段的最大值是 `Charlie`。这些统计信息减少了 IO 操作。


## 索引文件

对于每个 SST 文件,GreptimeDB 不但维护 SST 文件内部索引,还会单独生成一个文件用于存储针对该 SST 文件的索引结构。

索引文件采用 [Puffin][3] 格式,这种格式具有较大的灵活性,能够存储更多的元数据,并支持更多的索引结构。

![Puffin](/puffin.png)

目前,倒排索引是 GreptimeDB 第一个支持的单独索引结构,以 Blob 的形式存储在索引文件中。


## 倒排索引

在 v0.7 版本中,GreptimeDB 引入了倒排索引(Inverted Index)来加速查询。

倒排索引是一种常见的用于全文搜索的索引结构,它将文档中的每个单词映射到包含该单词的文档列表,GreptimeDB 把这项源自于搜索引擎的技术应用到了时间序列数据库中。

搜索引擎和时间序列数据库虽然运行在不同的领域,但是应用的倒排索引技术背后的原理是相似的。这种相似性需要一些概念上的调整:
1. 单词:在 GreptimeDB 中,指时间线的列值。
2. 文档:在 GreptimeDB 中,指包含多个时间线的数据段。

倒排索引的引入,使得 GreptimeDB 可以跳过不符合查询条件的数据段,从而提高扫描效率。

![Inverted index searching](/inverted-index-searching.png)

例如,上述查询使用倒排索引来定位数据段,数据段满足条件:`job` 等于 `apiserver`,`handler` 符合正则匹配 `.*users` 及 `status` 符合正则匹配 `4..`,然后扫描这些数据段以产生满足所有条件的最终结果,从而显着减少 IO 操作的次数。

### 倒排索引格式

![Inverted index format](/inverted-index-format.png)

GreptimeDB 按列构建倒排索引,每个倒排索引包含一个 FST 和多个 Bitmap。

FST(Finite State Transducer)允许 GreptimeDB 以紧凑的格式存储列值到 Bitmap 位置的映射,并且提供了优秀的搜索性能和支持复杂搜索(例如正则表达式匹配);Bitmap 则维护了数据段 ID 列表,每个位表示一个数据段。


### 索引数据段

GreptimeDB 把一个 SST 文件分割成多个索引数据段,每个数据段包含相同行数的数据。这种分段的目的是通过只扫描符合查询条件的数据段来优化查询性能。

例如,当数据段的行数为 1024,如果查询条件应用倒排索引后,得到的数据段列表为 `[0, 2]`,那么只需扫描 SST 文件中的第 0 和第 2 个数据段(即第 0 行到第 1023 行和第 2048 行到第 3071 行)即可。

数据段的行数由引擎选项 `index.inverted_index.segment_row_count` 控制,默认为 `1024`。较小的值意味着更精确的索引,往往会得到更好的查询性能,但会增加索引存储成本。通过调整该选项,可以在存储成本和查询性能之间进行权衡。


## 统一数据访问层:OpenDAL

GreptimeDB使用 [OpenDAL][2] 提供统一的数据访问层,因此,存储引擎无需与不同的存储 API 交互,数据可以无缝迁移到基于云的存储,如 AWS S3。

[1]: https://parquet.apache.org
[2]: https://github.com/datafuselabs/opendal
[3]: https://iceberg.apache.org/puffin-spec
Loading
Loading