microsoft · priyal1508 · Sep 18, 2024 · Sep 19, 2024 · Sep 19, 2024 · Sep 23, 2024
@@ -1,3 +1,70 @@
+## Contributing to this Repository
+Welcome ! We greatly appreciate your interest in contributing to our project. Please follow the guidelines below to ensure a smooth and successful contribution process.
+
+## Fork the Repository
+To get started, fork the dstoolkit-text2sql-and-imageprocessing
+ main repository to your own GitHub account by clicking on the "Fork" button at the top right corner of the repository page. This will create a copy of the repository under your account, which you can freely make changes to.
+
+## Clone the Repository
+Next, clone the forked repository to your local machine using the following command:
+
+```
+git clone https://github.com/[your-github-username]/[your-repository-name].git
+```
+
+Make sure to replace [your-github-username] with your actual GitHub username and [your-repository-name] with the name of your forked repository.
+
+## Set Up Access to Relevant Services
+Please ensure that you have the appropriate permissions and credentials to avoid any issues during the contribution process. This includes Azure DevOps project, repository, pipelines and Azure Subscription.  
+
+## Install Dependencies and Validate Environment
+Before making changes, ensure that you have installed all the dependencies required for the project. This include Conda, Python 3.8 (ideal) and tools. Validate that your development environment is set up correctly and meets the project's requirements.
+
+## Create a Branch
+Create a new branch for your contribution. It's important to create a new branch for each contribution to keep the main branch clean and stable. You can create a new branch using the following command:
+```
+git checkout -b [branch-name]
+```
+
+Replace [branch-name] with a descriptive name for your branch that indicates the purpose of your contribution.
+
+## Make Changes
+Now it's time to make your changes! Follow the coding style and guidelines of the project, and thoroughly test your changes in your local environment. Ensure that your changes do not introduce any errors or break the existing functionality. Be sure to add appropriate comments and documentation as needed.
+
+## Validate code changes
+Before submitting your contribution, it's crucial to validate your changes by building and testing the project on your environment. This includes running code quality checks, linting, unit tests including training scripts, executing tests, or other validation processes. Make sure that your changes do not cause any build failures or test errors. 
+
+## Commit and Push Changes
+Once you're confident with your changes, commit your changes and push them to your forked repository using the following commands:
+
+```
+git add .
+git commit -m "Your commit message here"
+git push origin [branch-name]
+```
+Replace [branch-name] with the name of your branch.
+
+## Create a Pull Request
+Go to the original [Your Repository Name] repository on GitHub and click on the "New Pull Request" button. Select your branch from the base and compare branches drop-down menus. Review your changes and provide a descriptive title and detailed description for your pull request. Include relevant information, such as the purpose of your contribution, the changes made, and any necessary context. Click on the "Create Pull Request" button to submit your contribution.
+
+## Validate Builds and Tests
+After PR is created, build validation must pass before the code can be merged on the target develop branch. Any feedback from build validation must be addressed or else the PR will not get merged to target develop branch. 
+
+## Review and Address Feedback
+Your pull request will be reviewed by the repository maintainers, and they may provide feedback or request changes. Be sure to monitor your pull request and address any feedback in a timely manner. This may involve making additional changes, providing clarification, or addressing any issues raised during the review process.
+
+## Follow Code of Conduct
+As a contributor, it's important to adhere to the project's code of conduct. Make sure to follow the project's guidelines, respect the contributions of others, and avoid any inappropriate behavior. Additionally, ensure that your contribution does not violate any copyright or intellectual property rights.
+
+## Merge and Close
+Once your contribution has been approved and all feedback has been addressed, you should merge your changes into the develop branch. After the changes have been merged, your contribution will be credited and acknowledged in the project's documentation or contributors list. Your pull request will then be closed, and your contribution will become part of the project's codebase.
+
+Congratulations! You have successfully contributed to dstoolkit-text2sql-and-imageprocessing. Thank you for your valuable contribution and for following the contribution guidelines.
+
+If you have any questions or need further assistance, feel free to reach out to the repository maintainers or the project's team channel for support.
+
+Happy contributing!
+
 ## Contributing
 
 This project welcomes contributions and suggestions.  Most contributions require you to agree to a

@@ -8,6 +8,7 @@
 from adi_2_ai_search import process_adi_2_ai_search
 from pre_embedding_cleaner import process_pre_embedding_cleaner
 from key_phrase_extraction import process_key_phrase_extraction
+from ocr import process_ocr
 
 logging.basicConfig(level=logging.DEBUG)
 app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)
@@ -124,3 +125,41 @@ async def key_phrase_extractor(req: func.HttpRequest) -> func.HttpResponse:
             status_code=200,
             mimetype="application/json",
         )
+
+@app.route(route="ocr", methods=[func.HttpMethod.POST])
+async def ocr(req: func.HttpRequest) -> func.HttpResponse:
+    """HTTP trigger for data cleanup function.
+
+    Args:
+        req (func.HttpRequest): The HTTP request object.
+
+    Returns:
+        func.HttpResponse: The HTTP response object."""
+    logging.info("Python HTTP trigger data cleanup function processed a request.")
+
+    try:
+        req_body = req.get_json()
+        values = req_body.get("values")
+        logging.info(req_body)
+    except ValueError:
+        return func.HttpResponse(
+            "Please valid Custom Skill Payload in the request body", status_code=400
+        )
+    else:
+        logging.debug("Input Values: %s", values)
+
+        record_tasks = []
+
+        for value in values:
+            record_tasks.append(
+                asyncio.create_task(process_ocr(value))
+            )
+
+        results = await asyncio.gather(*record_tasks)
+        logging.debug("Results: %s", results)
+
+        return func.HttpResponse(
+            json.dumps({"values": results}),
+            status_code=200,
+            mimetype="application/json",
+        )
@@ -0,0 +1,82 @@
+import logging
+import os
+from azure.ai.vision.imageanalysis.aio import ImageAnalysisClient
+from azure.ai.vision.imageanalysis.models import VisualFeatures
+from azure.core.credentials import AzureKeyCredential
+
+
+async def process_ocr(record: dict) -> dict:
+    logging.info("Python HTTP trigger function processed a request.")
+
+    try:
+        url = record["data"]["image"]["url"]
+        logging.info(f"Request Body: {record}")
+    except KeyError:
+        return {
+            "recordId": record["recordId"],
+            "data": {},
+            "errors": [
+                {
+                    "message": "Failed to extract data with ocr. Pass a valid source in the request body.",
+                }
+            ],
+            "warnings": None,
+        }
+    else:
+        logging.info(f"image url: {url}")
+
+        if url is not None:
+            try:
+                # keyvault_helper = KeyVaultHelper()
+                client = ImageAnalysisClient(
+                    endpoint=os.environ["AIService__Services__Endpoint"],
+                    credential=AzureKeyCredential(os.environ["AIService__Services__Key"])
+                ),
+                result = await client.analyze_from_url(
+                    image_url=url, visual_features=[VisualFeatures.READ]
+                )
+                logging.info("logging output")
+
+                # Extract text from OCR results
+                text = " ".join([line.text for line in result.read.blocks[0].lines])
+                logging.info(text)
+
+            except KeyError as e:
+                logging.error(e)
+                logging.error(f"Failed to authenticate with ocr: {e}")
+                return {
+                    "recordId": record["recordId"],
+                    "data": {},
+                    "errors": [
+                        {
+                            "message": f"Failed to authenticate with Ocr. Check the service credentials exist. {e}",
+                        }
+                    ],
+                    "warnings": None,
+                }
+            except Exception as e:
+                logging.error(e)
+                logging.error(
+                    f"Failed to analyze the document with Azure Document Intelligence: {e}"
+                )
+                logging.error(e.InnerError)
+                return {
+                    "recordId": record["recordId"],
+                    "data": {},
+                    "errors": [
+                        {
+                            "message": f"Failed to analyze the document with ocr. Check the source and try again. {e}",
+                        }
+                    ],
+                    "warnings": None,
+                }
+        else:
+            return {
+                "recordId": record["recordId"],
+                "data": {"text": ""},
+            }
+
+        return {
+            "recordId": record["recordId"],
+            "data": {"text": text},
+        }
@@ -27,7 +27,7 @@ async def get_client(self):
             return BlobServiceClient(account_url=endpoint, credential=credential)
         else:
             endpoint = os.environ.get("StorageAccount__ConnectionString")
-            return BlobServiceClient(account_url=endpoint)
+            return BlobServiceClient.from_connection_string(endpoint)
 
     async def add_metadata_to_blob(
         self, source: str, container: str, metadata: dict

@@ -24,6 +24,9 @@
     InputFieldMappingEntry,
     SynonymMap,
     SplitSkill,
+    DocumentExtractionSkill,
+    OcrSkill,
+    MergeSkill,
     SearchIndexerIndexProjection,
     BlobIndexerParsingMode,
 )
@@ -420,6 +423,108 @@ def get_key_phrase_extraction_skill(self, context, source) -> WebApiSkill:
 
         return key_phrase_extraction_skill
 
+    def get_document_extraction_skill(self, context, source) -> DocumentExtractionSkill:
+        """Get the document extraction utility skill.
+
+        Args:
+        -----
+            context (str): The context of the skill
+            source (str): The source of the skill
+
+        Returns:
+        --------
+            DocumentExtractionSkill: The document extraction utility skill"""
+
+        doc_extraction_skill = DocumentExtractionSkill(
+            description="Extraction skill to extract content from office docs like excel, ppt, doc etc",
+            context=context,
+            inputs=[InputFieldMappingEntry(name="file_data", source=source)],
+            outputs=[
+                OutputFieldMappingEntry(
+                    name="content", target_name="extracted_content"
+                ),
+                OutputFieldMappingEntry(
+                    name="normalized_images", target_name="extracted_normalized_images"
+                ),
+            ],
+        )
+
+        return doc_extraction_skill
+
+    def get_ocr_skill(self, context, source) -> OcrSkill:
+        """Get the ocr utility skill
+        Args:
+        -----
+            context (str): The context of the skill
+            source (str): The source of the skill
+
+        Returns:
+        --------
+            OcrSkill: The ocr skill"""
+
+        if self.test:
+            batch_size = 2
+            degree_of_parallelism = 2
+        else:
+            batch_size = 2
+            degree_of_parallelism = 2
+
+        ocr_skill_inputs = [
+            InputFieldMappingEntry(name="image", source=source),
+        ]
+        ocr__skill_outputs = [OutputFieldMappingEntry(name="text", target_name="text")]
+        ocr_skill = WebApiSkill(
+            name="ocr API",
+            description="Skill to extract text from images",
+            context=context,
+            uri=self.environment.get_custom_skill_function_url("ocr"),
+            timeout="PT230S",
+            batch_size=batch_size,
+            degree_of_parallelism=degree_of_parallelism,
+            http_method="POST",
+            inputs=ocr_skill_inputs,
+            outputs=ocr__skill_outputs,
+        )
+
+        if self.environment.identity_type != IdentityType.KEY:
+                ocr_skill.auth_identity = (
+                self.environment.function_app_app_registration_resource_id
+            )
+
+        if self.environment.identity_type == IdentityType.USER_ASSIGNED:
+            ocr_skill.auth_identity = (
+                self.environment.ai_search_user_assigned_identity
+            )
+
+        return ocr_skill
+
+    def get_merge_skill(self, context, source) -> MergeSkill:
+        """Get the merge
+        Args:
+        -----
+            context (str): The context of the skill
+            source (array): The source of the skill
+
+        Returns:
+        --------
+            mergeSkill: The merge skill"""
+
+        merge_skill = MergeSkill(
+            description="Merge skill for combining OCR'd and regular text",
+            context=context,
+            inputs=[
+                InputFieldMappingEntry(name="text", source=source[0]),
+                InputFieldMappingEntry(name="itemsToInsert", source=source[1]),
+                InputFieldMappingEntry(name="offsets", source=source[2]),
+            ],
+            outputs=[
+                OutputFieldMappingEntry(name="mergedText", target_name="merged_content")
+            ],
+        )
+
+        return merge_skill
+
+
     def get_vector_search(self) -> VectorSearch:
         """Get the vector search configuration for compass.
 

@@ -1,7 +1,8 @@
 # Copyright (c) Microsoft Corporation.
 # Licensed under the MIT License.
 import argparse
-from rag_documents import RagDocumentsAISearch
+# from rag_documents import RagDocumentsAISearch
+from rag_documents_old import RagDocumentsAISearch
 from text_2_sql import Text2SqlAISearch
 from text_2_sql_query_cache import Text2SqlQueryCacheAISearch
 import logging

@@ -217,6 +217,13 @@ def function_app_key_phrase_extractor_route(self) -> str:
         This function returns function app keyphrase extractor name
         """
         return os.environ.get("FunctionApp__KeyPhraseExtractor__FunctionName")
+
+    @property
+    def function_app_key_ocr_route(self) -> str:
+        """
+        This function returns function app keyphrase extractor name
+        """
+        return os.environ.get("FunctionApp__Ocr__FunctionName")
 
     @property
     def open_ai_embedding_dimensions(self) -> str:
@@ -249,6 +256,8 @@ def get_custom_skill_function_url(self, skill_type: str):
             route = self.function_app_adi_route
         elif skill_type == "key_phrase_extraction":
             route = self.function_app_key_phrase_extractor_route
+        elif skill_type == "ocr":
+            route = self.function_app_key_ocr_route
         else:
             raise ValueError(f"Invalid skill type: {skill_type}")
 

@@ -164,6 +164,7 @@ def get_skills(self) -> list:
 
         Returns:
             list: The skillsets  used in the indexer"""
+
 
         adi_skill = self.get_adi_skill(self.enable_page_by_chunking)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -164,6 +164,7 @@ def get_skills(self) -> list:

		Returns:
		list: The skillsets used in the indexer"""


		adi_skill = self.get_adi_skill(self.enable_page_by_chunking)

Expand Down