Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM Finetune with PEFT #99

Merged
merged 21 commits into from
May 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added llm-litgpt-finetuning/.assets/model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 9 additions & 0 deletions llm-litgpt-finetuning/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
*
!/pipelines/**
!/steps/**
!/materializers/**
!/evaluate/**
!/finetune/**
!/generate/**
!/lit_gpt/**
!/scripts/**
15 changes: 15 additions & 0 deletions llm-litgpt-finetuning/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Apache Software License 2.0

Copyright (c) ZenML GmbH 2024. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
128 changes: 128 additions & 0 deletions llm-litgpt-finetuning/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# ☮️ Fine-tuning open source LLMs using MLOps pipelines with LitGPT

Welcome to your newly generated "ZenML LLM LitGPT Finetuning project" project! This is
a great way to get hands-on with ZenML using production-like template.
The project contains a collection of ZenML steps, pipelines and other artifacts
and useful resources that can serve as a solid starting point for finetuning open-source LLMs using ZenML.

Using these pipelines, we can run the data-preparation and model finetuning with a single command while using YAML files for [configuration](https://docs.zenml.io/user-guide/production-guide/configure-pipeline) and letting ZenML take care of tracking our metadata and [containerizing our pipelines](https://docs.zenml.io/user-guide/advanced-guide/infrastructure-management/containerize-your-pipeline).

<div align="center">
<br/>
<a href="https://cloud.zenml.io">
<img alt="Model version metadata" src=".assets/model.png">
</a>
<br/>
</div>

## :earth_americas: Inspiration and Credit

This project heavily relies on the [Lit-GPT project](https://github.com/Lightning-AI/litgpt) of the amazing people at Lightning AI. We used [this blogpost](https://lightning.ai/pages/community/lora-insights/#toc14) to get started with LoRA and QLoRA and modified the commands they recommend to make them work using ZenML.

## 🏃 How to run

In this project we provide a few predefined configuration files for finetuning models on the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) dataset. Before we're able to run any pipeline, we need to set up our environment as follows:

```bash
# Set up a Python virtual environment, if you haven't already
python3 -m venv .venv
source .venv/bin/activate

# Install requirements
pip install -r requirements.txt
```

### Combined feature engineering and finetuning pipeline

The easiest way to get started with just a single command is to run the finetuning pipeline with the `finetune-alpaca.yaml` configuration file, which will do both feature engineering and finetuning:

```shell
python run.py --finetuning-pipeline --config finetune-alpaca.yaml
```

When running the pipeline like this, the trained adapter will be stored in the ZenML artifact store. You can optionally upload the adapter, the merged model or both by specifying the `adapter_output_repo` and `merged_output_repo` parameters in the configuration file.


### Evaluation pipeline

Before running this pipeline, you will need to fill in the `adapter_repo` in the `eval.yaml` configuration file. This should point to a huggingface repository that contains the finetuned adapter you got by running the finetuning pipeline.

```shell
python run.py --eval-pipeline --config eval.yaml
```

### Merging pipeline

In case you have trained an adapter using the finetuning pipeline, you can merge it with the base model by filling in the `adapter_repo` and `output_repo` parameters in the `merge.yaml` file, and then running:

```shell
python run.py --merge-pipeline --config merge.yaml
```

### Feature Engineering followed by Finetuning

If you want to finetune your model on a different dataset, you can do so by running the feature engineering pipeline followed by the finetuning pipeline. To define your dataset, take a look at the `scripts/prepare_*` scripts and set the dataset name in the `feature-alpaca.yaml` config file.

```shell
python run.py --feature-pipeline --config feature-alpaca.yaml
python run.py --finetuning-pipeline --config finetune-from-dataset.yaml
```

## ☁️ Running with a remote stack

To finetune an LLM on remote infrastructure, you can either use a remote orchestrator or a remote step operator. Follow these steps to set up a complete remote stack:
- Register the [orchestrator](https://docs.zenml.io/stacks-and-components/component-guide/orchestrators) (or [step operator](https://docs.zenml.io/stacks-and-components/component-guide/step-operators)) and make sure to configure it in a way so that the finetuning step has access to a GPU with at least 24GB of VRAM. Check out our docs for more [details](https://docs.zenml.io/stacks-and-components/component-guide).
- To access GPUs with this amount of VRAM, you might need to increase your GPU quota ([AWS](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html), [GCP](https://console.cloud.google.com/iam-admin/quotas), [Azure](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-quotas?view=azureml-api-2#request-quota-and-limit-increases)).
- The GPU instance that your finetuning will be running on will have CUDA drivers of a specific version installed. If that CUDA version is not compatible with the one provided by the default Docker image of the finetuning pipeline, you will need to modify it in the configuration file. See [here](https://hub.docker.com/r/pytorch/pytorch/tags) for a list of available PyTorch images.
- If you're running out of memory, you can experiment with quantized LoRA (QLoRA) by setting a different value for the `quantize` parameter in the configuration, or reduce the `global_batch_size`/`micro_batch_size`.
- Register a remote [artifact store](https://docs.zenml.io/stacks-and-components/component-guide/artifact-stores) and [container registry](https://docs.zenml.io/stacks-and-components/component-guide/container-registries).
- Register a stack with all these components
```shell
zenml stack register llm-finetuning-stack -o <ORCHESTRATOR_NAME> \
-a <ARTIFACT_STORE_NAME> \
-c <CONTAINER_REGISTRY_NAME> \
[-s <STEP_OPERATOR_NAME>]
```

## 💾 Running with custom data

To finetune a model with your custom data, you will need to convert it to a CSV file with the columns described
[here](https://github.com/Lightning-AI/litgpt/blob/main/tutorials/prepare_dataset.md#preparing-custom-datasets-from-a-csv-file).

Next, update the `configs/feature-custom.yaml` file and set the value of the `csv_path` parameter to that CSV file.
With all that in place, you can now run the feature engineering pipeline to convert your CSV into the correct format for training and then run the finetuning pipeline as follows:
```shell
python run.py --feature-pipeline --config feature-custom.yaml
python run.py --finetuning-pipeline --config finetune-from-dataset.yaml
```

## 📜 Project Structure

The project loosely follows [the recommended ZenML project structure](https://docs.zenml.io/user-guide/starter-guide/follow-best-practices):

```
.
├── configs # pipeline configuration files
│ ├── eval.yaml # configuration for the evaluation pipeline
│ ├── feature-alpaca.yaml # configuration for the feature engineering pipeline
│ ├── feature-custom.yaml # configuration for the feature engineering pipeline
│ ├── finetune-alpaca.yaml # configuration for the finetuning pipeline
│ ├── finetune-from-dataset.yaml # configuration for the finetuning pipeline
│ └── merge.yaml # configuration for the merging pipeline
├── pipelines # `zenml.pipeline` implementations
│ ├── evaluate.py # Evaluation pipeline
│ ├── feature_engineering.py # Feature engineering pipeline
│ ├── finetuning.py # Finetuning pipeline
│ └── merge.py # Merging pipeline
├── steps # logically grouped `zenml.steps` implementations
│ ├── evaluate.py # evaluate model performance
│ ├── feature_engineering.py # preprocess data
│ ├── finetune.py # finetune a model
│ ├── merge.py # merge model and adapter
│ ├── params.py # shared parameters for steps
│ └── utils.py # utility functions
├── .dockerignore
├── README.md # this file
├── requirements.txt # extra Python dependencies
└── run.py # CLI tool to run pipelines on ZenML Stack
```
File renamed without changes.
16 changes: 16 additions & 0 deletions llm-litgpt-finetuning/materializers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Apache Software License 2.0
#
# Copyright (c) ZenML GmbH 2024. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
71 changes: 71 additions & 0 deletions llm-litgpt-finetuning/materializers/directory_materializer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Apache Software License 2.0
#
# Copyright (c) ZenML GmbH 2024. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import os
from pathlib import Path
from tempfile import mkdtemp
from typing import Any, ClassVar, Tuple, Type

from zenml.enums import ArtifactType
from zenml.io import fileio
from zenml.materializers.base_materializer import BaseMaterializer


class DirectoryMaterializer(BaseMaterializer):
"""Materializer to store local directories in the artifact store."""

ASSOCIATED_TYPES: ClassVar[Tuple[Type[Any], ...]] = (Path,)
ASSOCIATED_ARTIFACT_TYPE: ClassVar[ArtifactType] = ArtifactType.DATA

def load(self, data_type: Type[Any]) -> Any:
"""Copy the artifact files to a local temp directory.

Args:
data_type: Unused.

Returns:
Path to the local directory that contains the artifact files.
"""
directory = mkdtemp(prefix="zenml-artifact")
self._copy_directory(src=self.uri, dst=directory)
return Path(directory)

def save(self, data: Any) -> None:
"""Store the directory in the artifact store.

Args:
data: Path to a local directory to store.
"""
assert isinstance(data, Path)
self._copy_directory(src=str(data), dst=self.uri)

@staticmethod
def _copy_directory(src: str, dst: str) -> None:
"""Recursively copy a directory.

Args:
src: The directory to copy.
dst: Where to copy the directory to.
"""
for src_dir, _, files in fileio.walk(src):
dst_dir = os.path.join(dst, os.path.relpath(src_dir, src))
fileio.makedirs(dst_dir)

for file in files:
src_file = os.path.join(src_dir, file)
dst_file = os.path.join(dst_dir, file)
fileio.copy(src_file, dst_file)
17 changes: 17 additions & 0 deletions llm-litgpt-finetuning/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
zenml
torch>=2.2.0
lightning @ git+https://github.com/Lightning-AI/lightning@ed367ca675861cdf40dbad2e4d66f7eee2ec50af
jsonargparse[signatures] # CLI
bitsandbytes==0.41.0 # quantization
scipy # required by bitsandbytes
sentencepiece # llama-based models
tokenizers # pythia, falcon, redpajama
datasets # eval
requests # scripts/prepare_*
zstandard # scripts/prepare_redpajama.py, scripts/prepare_starcoder.py
pandas # scripts/prepare_csv.py, scripts/prepare_starcoder.py
pyarrow # scripts/prepare_starcoder.py
# eval
git+https://github.com/EleutherAI/lm-evaluation-harness.git@115206dc89dad67b8beaa90051fb52db77f0a529
# scripts/prepare_slimpajama.py, scripts/prepare_starcoder.py, pretrain/tinyllama.py
lightning[data] @ git+https://github.com/Lightning-AI/lightning@ed367ca675861cdf40dbad2e4d66f7eee2ec50af
131 changes: 131 additions & 0 deletions llm-litgpt-finetuning/run.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Apache Software License 2.0
#
# Copyright (c) ZenML GmbH 2024. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import os
from typing import Optional

import click
from pipelines import (
llm_lora_evaluation,
llm_lora_feature_engineering,
llm_lora_finetuning,
llm_lora_merging,
)
from zenml.logger import get_logger

logger = get_logger(__name__)


@click.command(
help="""
ZenML LLM Finetuning project CLI v0.1.0.

Run the ZenML LLM Finetuning project LLM LoRA finetuning pipelines.

Examples:

\b
# Run the feature feature engineering pipeline
python run.py --feature-pipeline

\b
# Run the finetuning pipeline
python run.py --finetuning-pipeline

\b
# Run the merging pipeline
python run.py --merging-pipeline

\b
# Run the evaluation pipeline
python run.py --eval-pipeline
"""
)
@click.option(
"--config",
type=str,
default=None,
help="Path to the YAML config file.",
)
@click.option(
"--feature-pipeline",
is_flag=True,
default=False,
help="Whether to run the pipeline that creates the dataset.",
)
@click.option(
"--finetuning-pipeline",
is_flag=True,
default=False,
help="Whether to run the pipeline that finetunes the model.",
)
@click.option(
"--merging-pipeline",
is_flag=True,
default=False,
help="Whether to run the pipeline that merges the model and adapter.",
)
@click.option(
"--eval-pipeline",
is_flag=True,
default=False,
help="Whether to run the pipeline that evaluates the model.",
)
@click.option(
"--no-cache",
is_flag=True,
default=False,
help="Disable caching for the pipeline run.",
)
def main(
config: Optional[str] = None,
feature_pipeline: bool = False,
finetuning_pipeline: bool = False,
merging_pipeline: bool = False,
eval_pipeline: bool = False,
no_cache: bool = False,
):
"""Main entry point for the pipeline execution.

Args:
no_cache: If `True` cache will be disabled.
"""
config_folder = os.path.join(
os.path.dirname(os.path.realpath(__file__)),
"configs",
)
pipeline_args = {"enable_cache": not no_cache}
if not config:
raise RuntimeError("Config file is required to run a pipeline.")

pipeline_args["config_path"] = os.path.join(config_folder, config)

if feature_pipeline:
llm_lora_feature_engineering.with_options(**pipeline_args)()

if finetuning_pipeline:
llm_lora_finetuning.with_options(**pipeline_args)()

if merging_pipeline:
llm_lora_merging.with_options(**pipeline_args)()

if eval_pipeline:
llm_lora_evaluation.with_options(**pipeline_args)()


if __name__ == "__main__":
main()
Loading
Loading