Merge pull request #86 from dudeperf3ct/feature/zencoder-huggingface-…

…model-deployer HuggingFace Endpoint Inference Model Deployer
zenml-io · Jan 30, 2024 · c254ccc · c254ccc
2 parents 7d1fa76 + ec4f16a
commit c254ccc
Show file tree

Hide file tree

Showing 11 changed files with 914 additions and 185 deletions.
diff --git a/llm-finetuning/README.md b/llm-finetuning/README.md
@@ -78,13 +78,51 @@ python run.py --training-pipeline --config finetune_gcp.yaml
 
 # Deployment
 python run.py --deployment-pipeline --config <NAME_OF_CONFIG_IN_CONFIGS_FOLDER>
-python run.py --deployment-pipeline --config finetune_gcp.yaml
+python run.py --deployment-pipeline --config deployment_a100.yaml
 ```
 
 The `feature_engineering` and `deployment` pipeline can be run simply with the `default` stack, but the training pipelines [stack](https://docs.zenml.io/user-guide/production-guide/understand-stacks) will depend on the config.
 
 The `deployment` pipelines relies on the `training_pipeline` to have run before.
 
+## :cloud: Deployment
+
+We have create a custom zenml model deployer for deploying models on the huggingface inference endpoint. The code for custom deployer is in [huggingface](./huggingface/) folder.
+
+For running deployment pipeline, we create a custom zenml stack. As we are using a custom model deployer, we will have to register the flavor and model deployer. We update the stack to use this custom model deployer for running deployment pipeline.
+
+```bash
+zenml init
+zenml stack register zencoder_hf_stack -o default -a default
+zenml stack set zencoder_hf_stack
+export HUGGINGFACE_USERNAME=<here>
+export HUGGINGFACE_TOKEN=<here>
+export NAMESPACE=<here>
+zenml secret create huggingface_creds --username=$HUGGINGFACE_USERNAME --token=$HUGGINGFACE_TOKEN
+zenml model-deployer flavor register huggingface.hf_model_deployer_flavor.HuggingFaceModelDeployerFlavor
+```
+
+Afterward, you should see the new flavor in the list of available flavors:
+
+```bash
+zenml model-deployer flavor list
+```
+
+Register model deployer component into the current stack
+
+```bash
+zenml model-deployer register hfendpoint --flavor=hfendpoint --token=$HUGGINGFACE_TOKEN --namespace=$NAMESPACE
+zenml stack update zencoder_hf_stack -d hfendpoint
+```
+
+Run the deployment pipeline using the CLI:
+
+```shell
+# Deployment
+python run.py --deployment-pipeline --config <NAME_OF_CONFIG_IN_CONFIGS_FOLDER>
+python run.py --deployment-pipeline --config deployment_a100.yaml
+```
+
 ## 🥇Recent developments
 
 A working prototype has been trained and deployed as of Jan 19 2024. The model is using minimal data and finetuned using QLoRA and PEFT. The model was trained using 1 A100 GPU on the cloud:
@@ -114,16 +152,17 @@ This project recently did a [call of volunteers](https://www.linkedin.com/feed/u
 - [x] Create a functioning training pipeline.
 - [ ] Curate a set of 5-10 repositories that are using the ZenML latest syntax and use data generation pipeline to push dataset to HuggingFace.
 - [ ] Create a Dockerfile for the training pipeline with all requirements installed including ZenML, torch, CUDA etc. CUrrently I am having trouble creating this in this [config file](configs/finetune_local.yaml). Probably might make sense to create a docker imag with the right CUDA and requirements including ZenML. See here: https://sdkdocs.zenml.io/0.54.0/integration_code_docs/integrations-aws/#zenml.integrations.aws.flavors.sagemaker_step_operator_flavor.SagemakerStepOperatorSettings
+
 - [ ] Tests trained model on various metrics
 - [ ] Create a custom [model deployer](https://docs.zenml.io/stacks-and-components/component-guide/model-deployers) that deploys a huggingface model from the hub to a huggingface inference endpoint. This would involve creating a [custom model deployer](https://docs.zenml.io/stacks-and-components/component-guide/model-deployers/custom) and editing the [deployment pipeline accordingly](pipelines/deployment.py)
 
 ## :bulb: More Applications
 
 While the work here is solely based on the task of finetuning the model for the ZenML library, the pipeline can be changed with minimal effort to point to any set of repositories on GitHub. Theoretically, one could extend this work to point to proprietary codebases to learn from them for any use-case.
 
-For example, see how [VMWare fine-tuned StarCoder to learn their style](https://octo.vmware.com/fine-tuning-starcoder-to-learn-vmwares-coding-style/). 
+For example, see how [VMWare fine-tuned StarCoder to learn their style](https://octo.vmware.com/fine-tuning-starcoder-to-learn-vmwares-coding-style/).
 
 Also, make sure to join our <a href="https://zenml.io/slack" target="_blank">
     <img width="15" src="https://cdn3.iconfinder.com/data/icons/logos-and-brands-adobe/512/306_Slack-512.png" alt="Slack"/>
-    <b>Slack Community</b> 
-</a> to become part of the ZenML family!
+    <b>Slack Community</b>
+</a> to become part of the ZenML family!
diff --git a/llm-finetuning/configs/deployment_a10.yaml b/llm-finetuning/configs/deployment_a10.yaml
@@ -10,21 +10,22 @@ model:
 steps:
   deploy_model_to_hf_hub:
     parameters:
-      framework: pytorch
-      task: text-generation
-      accelerator: gpu
-      vendor: aws
-      region: us-east-1
-      max_replica: 1
-      instance_size: xxlarge
-      instance_type: g5.12xlarge
-      namespace: zenml
-      custom_image:
-        health_route: /health
-        env:
-          MAX_BATCH_PREFILL_TOKENS: "2048"
-          MAX_INPUT_LENGTH: "1024"
-          MAX_TOTAL_TOKENS: "1512"
-          QUANTIZE: bitsandbytes
-          MODEL_ID: /repository
-        url: registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sha-564f2a3
+      hf_endpoint_cfg:
+        framework: pytorch
+        task: text-generation
+        accelerator: gpu
+        vendor: aws
+        region: us-east-1
+        max_replica: 1
+        instance_size: xxlarge
+        instance_type: g5.12xlarge
+        namespace: zenml
+        custom_image:
+          health_route: /health
+          env:
+            MAX_BATCH_PREFILL_TOKENS: "2048"
+            MAX_INPUT_LENGTH: "1024"
+            MAX_TOTAL_TOKENS: "1512"
+            QUANTIZE: bitsandbytes
+            MODEL_ID: /repository
+          url: registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sha-564f2a3
diff --git a/llm-finetuning/configs/deployment_a100.yaml b/llm-finetuning/configs/deployment_a100.yaml
@@ -10,21 +10,22 @@ model:
 steps:
   deploy_model_to_hf_hub:
     parameters:
-      framework: pytorch
-      task: text-generation
-      accelerator: gpu
-      vendor: aws
-      region: us-east-1
-      max_replica: 1
-      instance_size: xlarge
-      instance_type: p4de
-      namespace: zenml
-      custom_image:
-        health_route: /health
-        env:
-          MAX_BATCH_PREFILL_TOKENS: "2048"
-          MAX_INPUT_LENGTH: "1024"
-          MAX_TOTAL_TOKENS: "1512"
-          QUANTIZE: bitsandbytes
-          MODEL_ID: /repository
-        url: registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sha-564f2a3
+      hf_endpoint_cfg:
+        framework: pytorch
+        task: text-generation
+        accelerator: gpu
+        vendor: aws
+        region: us-east-1
+        max_replica: 1
+        instance_size: xlarge
+        instance_type: p4de
+        namespace: zenml
+        custom_image:
+          health_route: /health
+          env:
+            MAX_BATCH_PREFILL_TOKENS: "2048"
+            MAX_INPUT_LENGTH: "1024"
+            MAX_TOTAL_TOKENS: "1512"
+            QUANTIZE: bitsandbytes
+            MODEL_ID: /repository
+          url: registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sha-564f2a3
diff --git a/llm-finetuning/configs/deployment_t4.yaml b/llm-finetuning/configs/deployment_t4.yaml
@@ -10,21 +10,22 @@ model:
 steps:
   deploy_model_to_hf_hub:
     parameters:
-      framework: pytorch
-      task: text-generation
-      accelerator: gpu
-      vendor: aws
-      region: us-east-1
-      max_replica: 1
-      instance_size: large
-      instance_type: g4dn.12xlarge
-      namespace: zenml
-      custom_image:
-        health_route: /health
-        env:
-          MAX_BATCH_PREFILL_TOKENS: "2048"
-          MAX_INPUT_LENGTH: "1024"
-          MAX_TOTAL_TOKENS: "1512"
-          QUANTIZE: bitsandbytes
-          MODEL_ID: /repository
-        url: registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sha-564f2a3
+      hf_endpoint_cfg:
+        framework: pytorch
+        task: text-generation
+        accelerator: gpu
+        vendor: aws
+        region: us-east-1
+        max_replica: 1
+        instance_size: large
+        instance_type: g4dn.12xlarge
+        namespace: zenml
+        custom_image:
+          health_route: /health
+          env:
+            MAX_BATCH_PREFILL_TOKENS: "2048"
+            MAX_INPUT_LENGTH: "1024"
+            MAX_TOTAL_TOKENS: "1512"
+            QUANTIZE: bitsandbytes
+            MODEL_ID: /repository
+          url: registry.internal.huggingface.tech/api-inference/community/text-generation-inference:sha-564f2a3
diff --git a/llm-finetuning/huggingface/__init__.py b/llm-finetuning/huggingface/__init__.py
diff --git a/llm-finetuning/huggingface/hf_deployment_base_config.py b/llm-finetuning/huggingface/hf_deployment_base_config.py
@@ -0,0 +1,25 @@
+from pydantic import BaseModel
+from typing import Optional, Dict
+from zenml.utils.secret_utils import SecretField
+
+
+class HuggingFaceBaseConfig(BaseModel):
+    """Huggingface Inference Endpoint configuration."""
+
+    endpoint_name: Optional[str] = ""
+    repository: Optional[str] = None
+    framework: Optional[str] = None
+    accelerator: Optional[str] = None
+    instance_size: Optional[str] = None
+    instance_type: Optional[str] = None
+    region: Optional[str] = None
+    vendor: Optional[str] = None
+    token: Optional[str] = None
+    account_id: Optional[str] = None
+    min_replica: Optional[int] = 0
+    max_replica: Optional[int] = 1
+    revision: Optional[str] = None
+    task: Optional[str] = None
+    custom_image: Optional[Dict] = None
+    namespace: Optional[str] = None
+    endpoint_type: str = "public"