Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.1.0-20240802 release #140

Merged
merged 69 commits into from
Aug 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
b492801
Bugfix: a case that files' encodings can not be detected by chardet (…
Ceceliachenen Jun 12, 2024
ef7090b
Bugfix: connection error for longtime upload tasks (#62)
wwxxzz Jun 13, 2024
ba1132a
Add file: file_utils.py (#63)
wwxxzz Jun 13, 2024
daba1f5
Remove local storage and enable Elasticsearch hybrid query mode (#60)
moria97 Jun 13, 2024
0b136e4
Modify async upload to sync (#64)
wwxxzz Jun 14, 2024
6c5eee4
Fix faiss_path not effective in retrieval (#65)
moria97 Jun 14, 2024
435a8b5
Add API to support upload local files (#67)
wwxxzz Jun 19, 2024
1ca838f
add docker image timezone for China (#68)
paradiseHIT Jun 19, 2024
5e4b667
load data pipeline supports read config (#70)
paradiseHIT Jun 19, 2024
cdeb9e4
Add gpu docker image timezone for China (#74)
paradiseHIT Jun 20, 2024
c674f2b
Add fast bm25 (#66)
moria97 Jun 20, 2024
7e98945
Update readme and configuration (#77)
paradiseHIT Jun 21, 2024
7c2467e
Update docker.yml
moria97 Jun 21, 2024
9e68ac6
Enable multiple workers to improve perf (#75)
moria97 Jun 24, 2024
d779dd6
Add guides for env and docker (#81)
wwxxzz Jun 27, 2024
8c3118b
Add config guide cn&en (#82)
aero-xi Jun 27, 2024
905ccd4
Add doc reference for rag query (#84)
wwxxzz Jul 1, 2024
1e7ba21
Support evaluation for generated and open datasets (#83)
wwxxzz Jul 2, 2024
75010f0
Fix oss url for miracl dataset (#86)
wwxxzz Jul 2, 2024
eb5cad9
fix ui es upload (#85)
aero-xi Jul 2, 2024
7d0e1f6
Fix eas LLM (#88)
wwxxzz Jul 2, 2024
5cb96d9
Milvus support sparse search (#87)
moria97 Jul 3, 2024
e0ed40f
Upload multiple files in single API call (#89)
moria97 Jul 3, 2024
93a4e67
Add client default timeout limitation and support UI interactive (#90)
wwxxzz Jul 3, 2024
1f64f40
Fix ui issue (#91)
moria97 Jul 4, 2024
e4ce133
Fix deps and add gpu ci tests (#92)
moria97 Jul 4, 2024
2424ee6
Fix empty response for empty knowledge base (#93)
wwxxzz Jul 4, 2024
7795ab1
Fix dup nodes (#94)
moria97 Jul 4, 2024
9a9840f
Add error handling (#96)
moria97 Jul 5, 2024
0d6368a
fix data_loader (#95)
Ceceliachenen Jul 5, 2024
d54ff9e
Set proper log levels (#98)
moria97 Jul 5, 2024
477fd03
Adjust config instruction and add es instruction (#99)
aero-xi Jul 9, 2024
3864e72
Log stacktrace for failed requests (#100)
moria97 Jul 9, 2024
20ca2bd
Load milvus collection by default (#101)
moria97 Jul 9, 2024
9d4bf3c
Rename & Relocate figures in md (#102)
aero-xi Jul 10, 2024
6d1456e
针对windows平台修改docker启动命令 (#104)
CharlieKoo Jul 12, 2024
b2b4fa5
download models from oss automatically (#97)
Ceceliachenen Jul 13, 2024
9af0f06
Fix bug in downloading models (#106)
moria97 Jul 15, 2024
4c3cf7d
Add markdown reader (#105)
moria97 Jul 15, 2024
c401d4d
fix pdf reader (#107)
Ceceliachenen Jul 15, 2024
723b4a3
Personal/ranxia/pdf table summary fix (#109)
Ceceliachenen Jul 15, 2024
e14077e
FiAddage number to file_name (#110)
wwxxzz Jul 15, 2024
06e3af9
Support stream response for LLM (PaiEAS && DashScope) (#112)
wwxxzz Jul 17, 2024
5c0f26b
Add image node processor (#114)
moria97 Jul 18, 2024
db3df96
Fix bug (#115)
moria97 Jul 18, 2024
18e001f
Fix bugs for chinese escaped string in API header (#117)
wwxxzz Jul 19, 2024
98384d2
Fix bidi version (#119)
moria97 Jul 22, 2024
101e436
Update streaming response to body field use server sent events (#120)
moria97 Jul 22, 2024
e75bef2
Support simple-weighted-reranker and similarity-threshold (#116)
wwxxzz Jul 22, 2024
57cffd8
jsonl reader (#124)
Ceceliachenen Jul 25, 2024
b4864c4
Support function_calling with booking demo tools (#122)
wwxxzz Jul 25, 2024
fa271d8
Add nodes enhancement by raptor (#111)
aero-xi Jul 26, 2024
8f192cd
Add weather tool (#125)
zhangdingchu Jul 26, 2024
da268b8
Don't use parallel when data size is big (#108)
moria97 Jul 26, 2024
cd95b4e
Add opensearch (#127)
moria97 Jul 26, 2024
4742dd9
update docker's readme (#126)
paradiseHIT Jul 27, 2024
054587e
Create ci.yml (#131)
moria97 Jul 30, 2024
bf30354
Update CI & PR pipelines (#132)
moria97 Jul 30, 2024
f0fc85c
Fix a few ui bugs (#133)
moria97 Jul 30, 2024
43b1c17
Support RDS postgres vector store (#134)
zt2645802240 Jul 31, 2024
cd4c0b8
Fix minor bugs (#135)
moria97 Jul 31, 2024
179d6b2
Fix empty response for score_threshold (#136)
wwxxzz Jul 31, 2024
714bc03
fix table_reader in pdf_reader (#128)
Ceceliachenen Jul 31, 2024
b313fe8
add "enable_ocr" and "enable_table_summary" (#138)
Ceceliachenen Aug 1, 2024
25e2ffb
Add release pipeline and fix some bugs (#137)
moria97 Aug 1, 2024
b37a44d
Fix llm config (#139)
moria97 Aug 1, 2024
b6d5b4e
Fix toml merge bug (#142)
moria97 Aug 2, 2024
5ad9579
Fix configuration conflict (#143)
moria97 Aug 5, 2024
7b3eea7
Fix space outage in github runner (#144)
moria97 Aug 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
name: PAI-RAG CI Build

on:
push:
# Sequence of patterns matched against refs/heads
branches:
- main
- feature
- "releases/**"

permissions:
contents: read
pull-requests: write

jobs:
build:
name: Build and Test
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up Python 3.10
# This is the version of the action for setting up Python, not the Python version.
uses: actions/setup-python@v5
with:
# Semantic version range syntax or exact version of a Python version
python-version: "3.10"
# Optional - x64 or x86 architecture, defaults to x64
architecture: "x64"

- name: Install Dependencies
run: |
python -m pip install --upgrade pip setuptools wheel
pip install poetry
poetry install
env:
POETRY_VIRTUALENVS_CREATE: false

- name: Install pre-commit
shell: bash
run: poetry run pip install pre-commit

- name: Run Linter
shell: bash
run: poetry run make lint

- name: Run Tests
run: |
make coveragetest
env:
DASHSCOPE_API_KEY: ${{ secrets.TESTDASHSCOPEKEY }}
IS_PAI_RAG_CI_TEST: true
PAIRAG_RAG__embedding__source: "DashScope"
PAIRAG_RAG__llm__source: "DashScope"
PAIRAG_RAG__llm__name: "qwen-turbo"
51 changes: 35 additions & 16 deletions .github/workflows/docker.yml
Original file line number Diff line number Diff line change
@@ -1,15 +1,14 @@
#
name: Create and publish a Docker image
name: Publish Docker image

# Configures this workflow to run every time a change is pushed to the branch called `release`.
on:
workflow_dispatch:
push:
branches: ["feature"]

# Defines two custom environment variables for the workflow. These are used for the Container registry domain, and a name for the Docker image that this workflow builds.
env:
REGISTRY: registry.cn-beijing.aliyuncs.com
REGISTRY_HZ: registry.cn-hangzhou.aliyuncs.com

# There is a single job in this workflow. It's configured to run on the latest available version of Ubuntu.
jobs:
Expand All @@ -19,6 +18,23 @@ jobs:
- name: Checkout repository
uses: actions/checkout@v4

- name: Check disk space
run: df . -h

- name: Free disk space
run: |
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
sudo rm -rf \
/usr/share/dotnet /usr/local/lib/android /opt/ghc \
/usr/local/share/powershell /usr/share/swift /usr/local/.ghcup \
/usr/lib/jvm || true

- name: Extract version
run: |
pip install poetry
VERSION_TAG=$(poetry version --short)
echo "VERSION_TAG=$VERSION_TAG" >> $GITHUB_ENV

# Uses the `docker/login-action` action to log in to the Container registry registry using the account and password that will publish the packages. Once published, the packages are scoped to the account defined here.
- name: Login to ACR Beijing region
uses: docker/login-action@v1
Expand All @@ -27,27 +43,30 @@ jobs:
username: ${{ secrets.ACR_USER }}
password: ${{ secrets.ACR_PASSWORD }}

- name: Login to ACR Hangzhou region
uses: docker/login-action@v1
with:
registry: ${{ env.REGISTRY_HZ }}
username: ${{ secrets.ACR_USER }}
password: ${{ secrets.ACR_PASSWORD }}

- name: Build and push base image
env:
IMAGE_TAG: 0.0.2
IMAGE_TAG: ${{env.VERSION_TAG}}
run: |
docker build -t ${{ env.REGISTRY }}/mybigpai/pairag:$IMAGE_TAG .
docker tag ${{ env.REGISTRY }}/mybigpai/pairag:$IMAGE_TAG ${{ env.REGISTRY_HZ }}/mybigpai/pairag:$IMAGE_TAG
docker push ${{ env.REGISTRY }}/mybigpai/pairag:$IMAGE_TAG
docker push ${{ env.REGISTRY_HZ }}/mybigpai/pairag:$IMAGE_TAG

- name: Build and push UI image
env:
IMAGE_TAG: ${{env.VERSION_TAG}}-ui
run: |
docker build -t ${{ env.REGISTRY }}/mybigpai/pairag:$IMAGE_TAG -f Dockerfile_ui .
docker push ${{ env.REGISTRY }}/mybigpai/pairag:$IMAGE_TAG

- name: Build and push nginx image
env:
IMAGE_TAG: ${{env.VERSION_TAG}}-nginx
run: |
docker build -t ${{ env.REGISTRY }}/mybigpai/pairag:$IMAGE_TAG -f Dockerfile_nginx .
docker push ${{ env.REGISTRY }}/mybigpai/pairag:$IMAGE_TAG

- name: Build and push GPU image
env:
IMAGE_TAG: 0.0.2_gpu
IMAGE_TAG: ${{env.VERSION_TAG}}-gpu
run: |
docker build -t ${{ env.REGISTRY }}/mybigpai/pairag:$IMAGE_TAG -f Dockerfile_gpu .
docker tag ${{ env.REGISTRY }}/mybigpai/pairag:$IMAGE_TAG ${{ env.REGISTRY_HZ }}/mybigpai/pairag:$IMAGE_TAG
docker push ${{ env.REGISTRY }}/mybigpai/pairag:$IMAGE_TAG
docker push ${{ env.REGISTRY_HZ }}/mybigpai/pairag:$IMAGE_TAG
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: PAI-RAG CI
name: PR Build

on:
pull_request:
Expand Down
56 changes: 56 additions & 0 deletions .github/workflows/main_gpu.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
name: PR Build (GPU)

on:
pull_request:
# Sequence of patterns matched against refs/heads
branches:
- main
- feature
- "releases/**"

permissions:
contents: read
pull-requests: write

jobs:
build:
name: Build and Test GPU version
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
- name: Set up Python 3.10
# This is the version of the action for setting up Python, not the Python version.
uses: actions/setup-python@v5
with:
# Semantic version range syntax or exact version of a Python version
python-version: "3.10"
# Optional - x64 or x86 architecture, defaults to x64
architecture: "x64"

- name: Install Dependencies
run: |
mv pyproject_gpu.toml pyproject.toml && rm poetry.lock
python -m pip install --upgrade pip setuptools wheel
pip install poetry
poetry install
env:
POETRY_VIRTUALENVS_CREATE: false

- name: Install pre-commit
shell: bash
run: poetry run pip install pre-commit

- name: Run Linter
shell: bash
run: poetry run make lint

- name: Run Tests
run: |
make coveragetest
env:
DASHSCOPE_API_KEY: ${{ secrets.TESTDASHSCOPEKEY }}
IS_PAI_RAG_CI_TEST: true
PAIRAG_RAG__embedding__source: "DashScope"
PAIRAG_RAG__llm__source: "DashScope"
PAIRAG_RAG__llm__name: "qwen-turbo"
91 changes: 91 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
name: Release image

# Configures this workflow to run every time a change is pushed to the branch called `release`.
on:
workflow_dispatch:
push:
branches: ["main", "release_test"]

# Defines two custom environment variables for the workflow. These are used for the Container registry domain, and a name for the Docker image that this workflow builds.
env:
REGISTRY: mybigpai-public-registry.cn-beijing.cr.aliyuncs.com

# There is a single job in this workflow. It's configured to run on the latest available version of Ubuntu.
jobs:
build-and-push-image:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4

- uses: actions/setup-python@v4
with:
python-version: "3.11"

- name: Check disk space
run: df . -h

- name: Free disk space
run: |
sudo docker rmi $(docker image ls -aq) >/dev/null 2>&1 || true
sudo rm -rf \
/usr/share/dotnet /usr/local/lib/android /opt/ghc \
/usr/local/share/powershell /usr/share/swift /usr/local/.ghcup \
/usr/lib/jvm || true

- name: Extract version
run: |
pip install poetry
VERSION_TAG=$(poetry version --short)
SPECIFIC_VERSION_TAG="$VERSION_TAG-$(date +'%Y%m%d')"
echo "VERSION_TAG=$VERSION_TAG" >> $GITHUB_ENV
echo "SPECIFIC_VERSION_TAG=$SPECIFIC_VERSION_TAG" >> $GITHUB_ENV
echo "version:$SPECIFIC_VERSION_TAG\ncommit_id:$(git rev-parse HEAD)" > __build_version.cfg

# Uses the `docker/login-action` action to log in to the Container registry registry using the account and password that will publish the packages. Once published, the packages are scoped to the account defined here.
- name: Login to ACR region
uses: docker/login-action@v1
with:
registry: ${{ env.REGISTRY }}
username: ${{ secrets.ACR_USER }}
password: ${{ secrets.ACR_PUBLIC_PASSWORD }}

- name: Build and push base image
env:
IMAGE_TAG: ${{env.VERSION_TAG}}
SPECIFIC_IMAGE_TAG: ${{env.SPECIFIC_VERSION_TAG}}
run: |
docker build -t ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.IMAGE_TAG }} .
docker push ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.IMAGE_TAG }}
docker tag ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.IMAGE_TAG }} ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.SPECIFIC_IMAGE_TAG }}
docker push ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.SPECIFIC_IMAGE_TAG }}

- name: Build and push UI image
env:
IMAGE_TAG: ${{env.VERSION_TAG}}-ui
SPECIFIC_IMAGE_TAG: ${{env.SPECIFIC_VERSION_TAG}}-ui
run: |
docker build -t ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.IMAGE_TAG }} -f Dockerfile_ui .
docker push ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.IMAGE_TAG }}
docker tag ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.IMAGE_TAG }} ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.SPECIFIC_IMAGE_TAG }}
docker push ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.SPECIFIC_IMAGE_TAG }}

- name: Build and push nginx image
env:
IMAGE_TAG: ${{env.VERSION_TAG}}-nginx
SPECIFIC_IMAGE_TAG: ${{env.SPECIFIC_VERSION_TAG}}-nginx
run: |
docker build -t ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.IMAGE_TAG }} -f Dockerfile_nginx .
docker push ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.IMAGE_TAG }}
docker tag ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.IMAGE_TAG }} ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.SPECIFIC_IMAGE_TAG }}
docker push ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.SPECIFIC_IMAGE_TAG }}

- name: Build and push GPU image
env:
IMAGE_TAG: ${{env.VERSION_TAG}}-gpu
SPECIFIC_IMAGE_TAG: ${{env.SPECIFIC_VERSION_TAG}}-gpu
run: |
docker build -t ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.IMAGE_TAG }} -f Dockerfile_gpu .
docker push ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.IMAGE_TAG }}
docker tag ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.IMAGE_TAG }} ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.SPECIFIC_IMAGE_TAG }}
docker push ${{ env.REGISTRY }}/mybigpai/pairag:${{ env.SPECIFIC_IMAGE_TAG }}
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -222,3 +222,6 @@ output
*.local.toml

localdata/
model_repository/
*.tmp
__*
2 changes: 2 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,8 @@ repos:
(\/.*?\.[\w:]+)/poetry.lock
args:
[
"--exclude",
"./data/tokenization/qwen.tiktoken",
"--ignore-words-list",
"astroid,gallary,momento,narl,ot,rouge,nin,gere",
]
Expand Down
5 changes: 4 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ COPY . .
RUN poetry install && rm -rf $POETRY_CACHE_DIR

FROM python:3.10-slim AS prod

RUN rm -rf /etc/localtime && ln -s /usr/share/zoneinfo/Asia/Harbin /etc/localtime

ENV VIRTUAL_ENV=/app/.venv \
PATH="/app/.venv/bin:$PATH"

Expand All @@ -21,4 +24,4 @@ RUN apt-get update && apt-get install -y libgl1 libglib2.0-0
WORKDIR /app
COPY . .
COPY --from=builder ${VIRTUAL_ENV} ${VIRTUAL_ENV}
ENTRYPOINT ["pai_rag", "run"]
CMD ["pai_rag", "run"]
5 changes: 4 additions & 1 deletion Dockerfile_gpu
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ RUN mv pyproject_gpu.toml pyproject.toml \
RUN poetry install && rm -rf $POETRY_CACHE_DIR

FROM python:3.10-slim AS prod

RUN rm -rf /etc/localtime && ln -s /usr/share/zoneinfo/Asia/Harbin /etc/localtime

ENV VIRTUAL_ENV=/app/.venv \
PATH="/app/.venv/bin:$PATH"

Expand All @@ -23,4 +26,4 @@ RUN apt-get update && apt-get install -y libgl1 libglib2.0-0
WORKDIR /app
COPY . .
COPY --from=builder ${VIRTUAL_ENV} ${VIRTUAL_ENV}
ENTRYPOINT ["pai_rag", "run"]
CMD ["pai_rag", "run"]
3 changes: 3 additions & 0 deletions Dockerfile_nginx
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
FROM nginx:latest
COPY ./nginx/default.conf etc/nginx/conf.d/default.conf
COPY ./nginx/nginx.conf etc/nginx/nginx.conf
27 changes: 27 additions & 0 deletions Dockerfile_ui
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
FROM python:3.10-slim AS builder

RUN pip3 install poetry

ENV POETRY_NO_INTERACTION=1 \
POETRY_VIRTUALENVS_IN_PROJECT=1 \
POETRY_VIRTUALENVS_CREATE=1 \
POETRY_CACHE_DIR=/tmp/poetry_cache

WORKDIR /app
COPY . .

RUN poetry install && rm -rf $POETRY_CACHE_DIR

FROM python:3.10-slim AS prod

RUN rm -rf /etc/localtime && ln -s /usr/share/zoneinfo/Asia/Harbin /etc/localtime

ENV VIRTUAL_ENV=/app/.venv \
PATH="/app/.venv/bin:$PATH"

RUN apt-get update && apt-get install -y libgl1 libglib2.0-0

WORKDIR /app
COPY . .
COPY --from=builder ${VIRTUAL_ENV} ${VIRTUAL_ENV}
CMD ["pai_rag", "ui"]
Loading
Loading