diff --git a/.github/actions/nlp_template_test/action.yml b/.github/actions/nlp_template_test/action.yml index 8b78d56..600e7a5 100644 --- a/.github/actions/nlp_template_test/action.yml +++ b/.github/actions/nlp_template_test/action.yml @@ -69,13 +69,14 @@ runs: - name: Concatenate requirements shell: bash run: | - zenml integration export-requirements -o ./local_checkout/integration-requirements.txt sklearn mlflow s3 kubernetes kubeflow slack evidently + zenml integration export-requirements -o ./local_checkout/integration-requirements.txt mlflow s3 kubernetes kubeflow discord aws huggingface pytorch cat ./local_checkout/requirements.txt ./local_checkout/test-requirements.txt ./local_checkout/integration-requirements.txt >> ./local_checkout/all-requirements.txt - name: Install requirements shell: bash run: | pip install -r ./local_checkout/all-requirements.txt + pip install accelerate torchvision - name: Run pytests shell: bash @@ -83,3 +84,8 @@ runs: ZENML_STACK_NAME: ${{ inputs.stack-name }} run: | pytest ./local_checkout/tests + + - name: Clean-up + shell: bash + run: | + rm -rf ./local_checkout diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 8b0da42..29fa847 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -4,7 +4,7 @@ on: workflow_dispatch: workflow_call: push: - branches: ["main", "develop"] + branches: ["main"] paths-ignore: ["README.md"] pull_request: paths-ignore: ["README.md"] @@ -35,3 +35,5 @@ jobs: with: stack-name: ${{ matrix.stack-name }} python-version: ${{ matrix.python-version }} + ref-zenml: develop + ref-template: ${{ github.ref }} diff --git a/.github/workflows/image-optimizer.yml b/.github/workflows/image-optimizer.yml new file mode 100644 index 0000000..dddbd1e --- /dev/null +++ b/.github/workflows/image-optimizer.yml @@ -0,0 +1,26 @@ +name: Compress Images +on: + pull_request: + # Run Image Actions when JPG, JPEG, PNG or WebP files are added or changed. + # See https://help.github.com/en/actions/automating-your-workflow-with-github-actions/workflow-syntax-for-github-actions#onpushpull_requestpaths for reference. + paths: + - '**.jpg' + - '**.jpeg' + - '**.png' + - '**.webp' +jobs: + build: + # Only run on non-draft PRs within the same repository. + if: github.event.pull_request.head.repo.full_name == github.repository && github.event.pull_request.draft == false + name: calibreapp/image-actions + runs-on: ubuntu-latest + steps: + - name: Checkout Repo + uses: actions/checkout@v3 + + - name: Compress Images + uses: calibreapp/image-actions@main + with: + # The `GITHUB_TOKEN` is automatically generated by GitHub and scoped only to the repository that is currently running the action. By default, the action canβt update Pull Requests initiated from forked repositories. + # See https://docs.github.com/en/actions/reference/authentication-in-a-workflow and https://help.github.com/en/articles/virtual-environments-for-github-actions#token-permissions + githubToken: ${{ secrets.GITHUB_TOKEN }} diff --git a/README.md b/README.md index bc47aca..b27591c 100644 --- a/README.md +++ b/README.md @@ -43,7 +43,7 @@ The template can be configured using the following parameters: | Deploy to HuggingFace | Whether to deploy to HuggingFace Hub | False | | Deploy to SkyPilot | Whether to deploy to SkyPilot | False | | Dataset | The dataset to use from HuggingFace Datasets | airline_reviews | -| Model | The model to use from HuggingFace Models | roberta-base | +| Model | The model to use from HuggingFace Models | distilbert-base-uncased | | Cloud Provider | The cloud provider to use (AWS or GCP) | aws | | Metric-Based Promotion | Whether to promote models based on metrics | True | | Notifications on Failure | Whether to notify about pipeline failures | True | @@ -66,6 +66,10 @@ For more details, check the `README.md` file in the generated project directory. This NLP project template includes three main pipelines: +
+ +
+ ### Training Pipeline The training pipeline is designed to handle the end-to-end process of training an NLP model. It includes steps for data loading, tokenization, model training, and model registration. The pipeline is parameterized to allow for customization of the training process, such as sequence length, batch size, and learning rate. @@ -113,24 +117,17 @@ The training pipeline is the heart of the NLP project. It is responsible for pre The training pipeline is configured using the `{{product_name}}_training_pipeline` function, which includes steps for data loading, tokenization, model training, and model registration. The pipeline can be customized with parameters such as `lower_case`, `padding`, `max_seq_length`, and others to tailor the tokenization and training process to your specific NLP use case. -### Training Pipeline: Data and Tokenization +### Training Pipeline -[π Code folder](template/steps/data_tokenization/) +[π Code folder](template/steps/model_training/)- +
The first stage of the training pipeline involves loading the dataset and preparing it for the model. The `data_loader` step fetches the dataset, which is then passed to the `tokenizer_loader` and `tokenization_step` to convert the raw text data into a format suitable for the NLP model. Tokenization is a critical step in NLP pipelines, as it converts text into tokens that the model can understand. The tokenizer can be configured to handle case sensitivity, padding strategies, and sequence lengths, ensuring that the input data is consistent and optimized for training. -### Training Pipeline: Model Training - -[π Code folder](template/steps/model_training/) -- -
- Once the data is tokenized, the `model_trainer` step takes over to train the NLP model. This step utilizes the tokenized dataset and the tokenizer itself to fine-tune the model on the specific task, such as sentiment analysis, text classification, or named entity recognition. The model training step can be configured with parameters like `train_batch_size`, `eval_batch_size`, `num_epochs`, `learning_rate`, and `weight_decay` to control the training process. After training, the model is evaluated, and if it meets the quality criteria, it is registered in the model registry with a unique name. @@ -139,7 +136,7 @@ The model training step can be configured with parameters like `train_batch_size [π Code folder](template/steps/promotion/)- +
The promotion pipeline is responsible for promoting the best model to the chosen stage, such as Production or Staging. The pipeline can be configured to promote models based on metric comparison or simply promote the latest model version. @@ -150,7 +147,7 @@ The `{{product_name}}_promote_pipeline` function orchestrates the promotion proc [π Code folder](template/steps/deployment/)- +
The deployment pipeline handles the deployment of the model to various environments. It can be configured to deploy locally, to HuggingFace Hub, or to SkyPilot, depending on the project's needs. diff --git a/assets/deploy_pipeline.png b/assets/deploy_pipeline.png new file mode 100644 index 0000000..4bd3a6e Binary files /dev/null and b/assets/deploy_pipeline.png differ diff --git a/assets/full_template.png b/assets/full_template.png new file mode 100644 index 0000000..fe685f7 Binary files /dev/null and b/assets/full_template.png differ diff --git a/assets/promote_pipeline.png b/assets/promote_pipeline.png new file mode 100644 index 0000000..7621aa0 Binary files /dev/null and b/assets/promote_pipeline.png differ diff --git a/assets/training_pipeline.png b/assets/training_pipeline.png new file mode 100644 index 0000000..eb7ec06 Binary files /dev/null and b/assets/training_pipeline.png differ diff --git a/copier.yml b/copier.yml index e073e48..975b345 100644 --- a/copier.yml +++ b/copier.yml @@ -64,7 +64,11 @@ accelerator: choices: - gpu - cpu - default: gpu + default: cpu +sample_rate: + type: bool + help: "Whether to use a sample of the dataset for quick iteration" + default: False deploy_locally: type: bool help: "Whether to deploy locally" @@ -91,8 +95,8 @@ model: choices: - bert-base-uncased - roberta-base - - distilbert-base-cased - default: roberta-base + - distilbert-base-uncased + default: distilbert-base-uncased cloud_of_choice: type: str help: "Whether to use AWS cloud provider or GCP" diff --git a/template/config.yaml b/template/config.yaml index 9de1bb5..bde6901 100644 --- a/template/config.yaml +++ b/template/config.yaml @@ -26,7 +26,7 @@ settings: - zenml[server] extra: - mlflow_model_name: nlp_use_case_model + mlflow_model_name: sentiment_analysis {%- if target_environment == 'production' %} target_env: production {%- else %} diff --git a/template/pipelines/training.py b/template/pipelines/training.py index 5406fdc..ab7ca70 100644 --- a/template/pipelines/training.py +++ b/template/pipelines/training.py @@ -87,7 +87,7 @@ def {{product_name}}_training_pipeline( register_model( model=model, tokenizer=tokenizer, - mlflow_model_name="{{product_name}}_model", + mlflow_model_name="sentiment_analysis", ) notify_on_success(after=["register_model"]) diff --git a/template/run.py b/template/run.py index 537db3a..2ded9ce 100644 --- a/template/run.py +++ b/template/run.py @@ -186,7 +186,6 @@ def main( name=zenml_model_name, license="{{open_source_license}}", description="Show case Model Control Plane.", - create_new_model_version=True, delete_new_version_on_failure=True, tags=["sentiment_analysis", "huggingface"], ) @@ -202,7 +201,7 @@ def main( # Execute Promoting Pipeline if promoting_pipeline: run_args_promoting = {} - model_config = ModelConfig(name=zenml_model_name) + model_config = ModelConfig(name=zenml_model_name, version=ModelStages.LATEST) pipeline_args["model_config"] = model_config pipeline_args[ "run_name" diff --git a/template/steps/dataset_loader/data_loader.py b/template/steps/dataset_loader/data_loader.py index 9c6cda7..529da76 100644 --- a/template/steps/dataset_loader/data_loader.py +++ b/template/steps/dataset_loader/data_loader.py @@ -4,6 +4,9 @@ from datasets import load_dataset, DatasetDict from zenml import step from zenml.logger import get_logger +{%- if sample_rate %} +import numpy as np +{%- endif %} logger = get_logger(__name__) @@ -41,6 +44,19 @@ def data_loader( dataset = dataset.remove_columns(["airline_sentiment_confidence","negativereason_confidence"]) {%- endif %} + {%- if sample_rate %} + # Sample 20% of the data randomly for the demo + def sample_dataset(dataset, sample_rate=0.2): + sampled_dataset = DatasetDict() + for split in dataset.keys(): + split_size = len(dataset[split]) + indices = np.random.choice(split_size, int(split_size * sample_rate), replace=False) + sampled_dataset[split] = dataset[split].select(indices) + return sampled_dataset + + dataset = sample_dataset(dataset) + {%- endif %} + # Log the dataset and sample examples logger.info(dataset) logger.info(f"Sample Example 1 : {dataset['train'][0]['text']} with label {dataset['train'][0]['label']}") diff --git a/template/steps/deploying/save_model.py b/template/steps/deploying/save_model.py index 49c525a..2970a06 100644 --- a/template/steps/deploying/save_model.py +++ b/template/steps/deploying/save_model.py @@ -3,7 +3,6 @@ from zenml import get_step_context, step from zenml.client import Client -from zenml.enums import ModelStages from zenml.logger import get_logger # Initialize logger diff --git a/template/steps/deploying/{% if deploy_locally %}local_deployment.py{% endif %} b/template/steps/deploying/{% if deploy_locally %}local_deployment.py{% endif %} index 82c095a..9fb332d 100644 --- a/template/steps/deploying/{% if deploy_locally %}local_deployment.py{% endif %} +++ b/template/steps/deploying/{% if deploy_locally %}local_deployment.py{% endif %} @@ -40,7 +40,7 @@ def deploy_locally( The process ID of the Gradio app. """ ### ADD YOUR OWN CODE HERE - THIS IS JUST AN EXAMPLE ### - def start_gradio_app(command: list[str]) -> int: + def start_gradio_app(command: List[str]) -> int: """ Start the Gradio app in a separate process. diff --git a/template/steps/promotion/{% if not metric_compare_promotion %}promote_current.py{% endif %} b/template/steps/promotion/{% if not metric_compare_promotion %}promote_current.py{% endif %} index 5d36808..79a2349 100644 --- a/template/steps/promotion/{% if not metric_compare_promotion %}promote_current.py{% endif %} +++ b/template/steps/promotion/{% if not metric_compare_promotion %}promote_current.py{% endif %} @@ -20,6 +20,7 @@ def promote_current(): """ ### ADD YOUR OWN CODE HERE - THIS IS JUST AN EXAMPLE ### + pipeline_extra = get_step_context().pipeline_run.config.extra logger.info(f"Promoting current model version") model_config = get_step_context().model_config model_version = model_config._get_model_version() diff --git a/template/steps/registrer/model_log_register.py b/template/steps/registrer/model_log_register.py index eba013c..c8e22c1 100644 --- a/template/steps/registrer/model_log_register.py +++ b/template/steps/registrer/model_log_register.py @@ -32,7 +32,7 @@ def register_model( model: PreTrainedModel, tokenizer: PreTrainedTokenizerBase, - mlflow_model_name: Optional[str] = "model", + mlflow_model_name: Optional[str] = "sentiment_analysis", ): """ Register model to MLFlow. diff --git a/template/steps/training/model_trainer.py b/template/steps/training/model_trainer.py index d5b5143..fd9142e 100644 --- a/template/steps/training/model_trainer.py +++ b/template/steps/training/model_trainer.py @@ -46,7 +46,7 @@ def model_trainer( load_best_model_at_end: Optional[bool] = True, eval_batch_size: Optional[int] = 16, weight_decay: Optional[float] = 0.01, - mlflow_model_name: Optional[str] = "model", + mlflow_model_name: Optional[str] = "sentiment_analysis", ) -> Tuple[Annotated[PreTrainedModel, "model", ModelArtifactConfig(overwrite=True)], Annotated[PreTrainedTokenizerBase, "tokenizer", ModelArtifactConfig(overwrite=True)]]: """ Configure and train a model on the training dataset. @@ -105,7 +105,7 @@ def model_trainer( evaluation_strategy='steps', save_strategy='steps', save_steps=1000, - eval_steps=200, + eval_steps=100, logging_steps=logging_steps, save_total_limit=5, report_to="mlflow", diff --git a/template/utils/misc.py b/template/utils/misc.py index 98c8b5f..7125f99 100644 --- a/template/utils/misc.py +++ b/template/utils/misc.py @@ -1,12 +1,12 @@ # {% include 'template/license_header' %} -from typing import Dict +from typing import Dict, Tuple, List import numpy as np from datasets import load_metric -def compute_metrics(eval_pred: tuple[np.ndarray, np.ndarray]) -> Dict[str, float]: +def compute_metrics(eval_pred: Tuple[np.ndarray, np.ndarray]) -> Dict[str, float]: """Compute the metrics for the model. Args: @@ -34,7 +34,7 @@ def compute_metrics(eval_pred: tuple[np.ndarray, np.ndarray]) -> Dict[str, float } -def find_max_length(dataset: list[str]) -> int: +def find_max_length(dataset: List[str]) -> int: """Find the maximum length of the dataset. The dataset is a list of strings which are the text samples. diff --git a/tests/conftest.py b/tests/conftest.py new file mode 100644 index 0000000..aaec33c --- /dev/null +++ b/tests/conftest.py @@ -0,0 +1,102 @@ +# Copyright (c) ZenML GmbH 2023. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at: +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express +# or implied. See the License for the specific language governing +# permissions and limitations under the License. + + +import os +import shutil +from typing import Generator + +import pytest +from zenml.client import Client +from zenml.config.global_config import GlobalConfiguration +from zenml.constants import ENV_ZENML_CONFIG_PATH +from zenml.enums import StackComponentType + + +def configure_stack(): + stack_name = os.environ.get("ZENML_STACK_NAME", "local") + zenml_client = Client() + + if stack_name == "local": + components = {} + for component in [ + ("mlflow_local", "mlflow", StackComponentType.EXPERIMENT_TRACKER), + ("mlflow_local", "mlflow", StackComponentType.MODEL_REGISTRY), + ("local", "local", StackComponentType.ORCHESTRATOR), + ("local", "local", StackComponentType.ARTIFACT_STORE), + ]: + zenml_client.create_stack_component(*component, {}) + components[component[2]] = component[0] + zenml_client.create_stack("local", components=components) + zenml_client.activate_stack("local") + else: + raise RuntimeError(f"Stack {stack_name} not supported") + + +@pytest.fixture(scope="module") +def clean_zenml_client( + tmp_path_factory: pytest.TempPathFactory, +) -> Generator[Client, None, None]: + """Context manager to initialize and use a clean local default ZenML client. + + This context manager creates a clean ZenML client with its own global + configuration and local database. + + Args: + tmp_path_factory: A pytest fixture that provides a temporary directory. + + Yields: + A clean ZenML client. + """ + # save the current global configuration and client singleton instances + # to restore them later, then reset them + orig_cwd = os.getcwd() + original_config = GlobalConfiguration.get_instance() + original_client = Client.get_instance() + orig_config_path = os.getenv("ZENML_CONFIG_PATH") + + GlobalConfiguration._reset_instance() + Client._reset_instance() + + # change the working directory to a fresh temp path + tmp_path = tmp_path_factory.mktemp("pytest-clean-client") + os.chdir(tmp_path) + + os.environ[ENV_ZENML_CONFIG_PATH] = str(tmp_path / "zenml") + os.environ["ZENML_ANALYTICS_OPT_IN"] = "false" + + # initialize the global config client and store at the new path + gc = GlobalConfiguration() + gc.analytics_opt_in = False + client = Client() + _ = client.zen_store + + # prepare stack configuration + configure_stack() + + yield client + + # restore the global configuration path + if orig_config_path: + os.environ[ENV_ZENML_CONFIG_PATH] = orig_config_path + else: + del os.environ[ENV_ZENML_CONFIG_PATH] + + # restore the global configuration and the client + GlobalConfiguration._reset_instance(original_config) + Client._reset_instance(original_client) + + # remove all traces, and change working directory back to base path + os.chdir(orig_cwd) + shutil.rmtree(str(tmp_path)) diff --git a/tests/test_template.py b/tests/test_template.py new file mode 100644 index 0000000..43574d7 --- /dev/null +++ b/tests/test_template.py @@ -0,0 +1,210 @@ +# Copyright (c) ZenML GmbH 2023. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at: +# +# https://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express +# or implied. See the License for the specific language governing +# permissions and limitations under the License. + + +import os +import pathlib +import platform +import shutil +import subprocess +import sys +from typing import Optional + +import pytest +from copier import Worker +from zenml.client import Client +from zenml.enums import ExecutionStatus + +TEMPLATE_DIRECTORY = str(pathlib.Path.joinpath(pathlib.Path(__file__).parent.parent)) + + +def generate_and_run_project( + tmp_path_factory: pytest.TempPathFactory, + open_source_license: Optional[str] = "apache", + product_name: str = "nlp_case_pytest", + metric_compare_promotion: bool = True, + target_environment: str = "staging", + notify_on_failures: bool = True, + notify_on_successes: bool = False, + sample_rate: bool = True, + model: str = "distilbert-base-uncased", + zenml_server_url: str = "", + accelerator: str = "cpu", + deploy_locally: bool = True, + deploy_to_huggingface: bool = False, + deploy_to_skypilot: bool = False, + cloud_of_choice: str = "gcp", + dataset: str = "airline_reviews", + zenml_model_name: str = "sentiment_analysis", + +): + """Generate and run the starter project with different options.""" + + answers = { + "project_name": "Pytest Templated Project", + "version": "0.0.1", + "open_source_license": str(open_source_license).lower(), + "product_name": product_name, + "metric_compare_promotion": metric_compare_promotion, + "target_environment": target_environment, + "notify_on_failures": notify_on_failures, + "notify_on_successes": notify_on_successes, + "zenml_server_url": zenml_server_url, + "sample_rate": sample_rate, + "model": model, + "accelerator": accelerator, + "deploy_locally": deploy_locally, + "deploy_to_huggingface": deploy_to_huggingface, + "deploy_to_skypilot": deploy_to_skypilot, + "cloud_of_choice": cloud_of_choice, + "dataset": dataset, + } + if open_source_license: + answers["email"] = "pytest@zenml.io" + answers["full_name"] = "Pytest" + + # generate the template in a temp path + current_dir = os.getcwd() + dst_path = tmp_path_factory.mktemp("pytest-template") + print("TEMPLATE_DIR:", TEMPLATE_DIRECTORY) + print("dst_path:", dst_path) + print("current_dir:", current_dir) + os.chdir(str(dst_path)) + with Worker( + src_path=TEMPLATE_DIRECTORY, + dst_path=str(dst_path), + data=answers, + unsafe=True, + vcs_ref="HEAD", + ) as worker: + worker.run_copy() + + # MLFlow Deployer not supported on Windows + # MLFlow `service daemon is not running` error on MacOS + if platform.system().lower() not in ["windows"]: + # run the project + call = [sys.executable, "run.py"] + + try: + subprocess.check_output( + call, + cwd=str(dst_path), + env=os.environ.copy(), + stderr=subprocess.STDOUT, + ) + except subprocess.CalledProcessError as e: + raise RuntimeError( + f"Failed to run project generated with parameters: {answers}\n" + f"{e.output.decode()}" + ) from e + + # check the pipeline run is successful + for pipeline_suffix in ["_training_pipeline", "_promote_pipeline"]: + pipeline = Client().get_pipeline(product_name + pipeline_suffix) + assert pipeline + runs = pipeline.runs + assert len(runs) == 1 + assert runs[0].status == ExecutionStatus.COMPLETED + + # clean up + Client().delete_pipeline(product_name + pipeline_suffix) + Client().delete_model(zenml_model_name) + Client().active_stack.model_registry.delete_model(zenml_model_name) + + os.chdir(current_dir) + shutil.rmtree(dst_path) + + +@pytest.mark.parametrize("open_source_license", ["mit", None], ids=["oss", "css"]) +def test_generate_license( + clean_zenml_client, + tmp_path_factory: pytest.TempPathFactory, + open_source_license: Optional[str], +): + """Test generating licenses.""" + + generate_and_run_project( + tmp_path_factory=tmp_path_factory, + open_source_license=open_source_license, + ) + + +def test_custom_product_name( + clean_zenml_client, + tmp_path_factory: pytest.TempPathFactory, +): + """Test using custom pipeline name.""" + + generate_and_run_project( + tmp_path_factory=tmp_path_factory, + product_name="custom_product_name", + ) + + +def test_latest_promotion( + clean_zenml_client, + tmp_path_factory: pytest.TempPathFactory, +): + """Test using latest promotion.""" + + generate_and_run_project( + tmp_path_factory=tmp_path_factory, metric_compare_promotion=False + ) + +def test_production_environment( + clean_zenml_client, + tmp_path_factory: pytest.TempPathFactory, +): + """Test deploying to production stage.""" + + generate_and_run_project( + tmp_path_factory=tmp_path_factory, + target_environment="production", + ) + + +def test_no_notify_on_failure( + clean_zenml_client, + tmp_path_factory: pytest.TempPathFactory, +): + """Test skipping notification on failure.""" + + generate_and_run_project( + tmp_path_factory=tmp_path_factory, + notify_on_failures=False, + ) + + +def test_notify_on_success( + clean_zenml_client, + tmp_path_factory: pytest.TempPathFactory, +): + """Test skipping notification on success.""" + + generate_and_run_project( + tmp_path_factory=tmp_path_factory, + notify_on_successes=True, + ) + + +def test_custom_zenml_server_url( + clean_zenml_client, + tmp_path_factory: pytest.TempPathFactory, +): + """Test deploying to production stage.""" + + generate_and_run_project( + tmp_path_factory=tmp_path_factory, + zenml_server_url="foo", + )