Merge pull request #273 from JohnSnowLabs/release/540

Release/540
JohnSnowLabs · Jul 13, 2024 · 6cda85f · 6cda85f
2 parents 6b0fc85 + 2e0c130
commit 6cda85f
Show file tree

Hide file tree

Showing 45 changed files with 1,631 additions and 82 deletions.
diff --git a/examples/colab/component_examples/classifiers/NLU_MPNetForSequenceClassification.ipynb b/examples/colab/component_examples/classifiers/NLU_MPNetForSequenceClassification.ipynb
diff --git a/examples/colab/component_examples/sequence2sequence/NLU_M2M100Transformer.ipynb b/examples/colab/component_examples/sequence2sequence/NLU_M2M100Transformer.ipynb
diff --git a/examples/colab/ocr/ocr_visual_document_deid.ipynb b/examples/colab/ocr/ocr_visual_document_deid.ipynb
@@ -0,0 +1,153 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "source": [
+    "![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)\n",
+    "\n",
+    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/ocr/ocr_visual_document_deid.ipynb)\n",
+    "\n",
+    "\n",
+    "## De-Identification\n",
+    "\n",
+    "Introducing our advanced healthcare deidentification model, effortlessly deployable with a single line of code. This powerful solution integrates state-of-the-art algorithms like ner_deid_subentity_augmented, ContextualParser, RegexMatcher, and TextMatcher, alongside a streamlined Deidentification stage. It efficiently masks sensitive entities such as names, locations, and medical records, ensuring compliance and data security in medical texts. Utilizing OCR capabilities, it also redacts detected information before saving the processed file to the specified location."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "#### Installing the libraries"
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "📋 Loading license number 0 from C:\\Users\\gadde/.johnsnowlabs\\licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json\n",
+      "👌 Launched \u001B[92mcpu optimized\u001B[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, 🕶Spark-OCR==5.3.2, running on ⚡ PySpark==3.1.2\n"
+     ]
+    }
+   ],
+   "source": [
+    "!pip install johnsnowlabs\n",
+    "from johnsnowlabs import nlp\n",
+    "nlp.install(visual=True)\n",
+    "nlp.start(visual=True)"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "ExecuteTime": {
+     "end_time": "2024-06-24T10:27:47.436477Z",
+     "start_time": "2024-06-24T10:27:21.668104700Z"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Warning::Spark Session already created, some configs may not take.\n",
+      "Warning::Spark Session already created, some configs may not take.\n",
+      "pdf_deid_pdf_output download started this may take some time.\n",
+      "Approx size to download 1.6 GB\n",
+      "[OK!]\n"
+     ]
+    }
+   ],
+   "source": [
+    "#loading the model\n",
+    "model = nlp.load(\"en.image_deid\")"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "ExecuteTime": {
+     "end_time": "2024-06-24T10:28:08.210754700Z",
+     "start_time": "2024-06-24T10:27:47.452292500Z"
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## PDF De-Identification\n",
+    "\n",
+    "With the specified input and output paths provided as arguments, the model efficiently processes PDF files, performing de-identification as needed, and seamlessly stores the processed documents at the designated location."
+   ],
+   "metadata": {
+    "collapsed": false
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Warning::Spark Session already created, some configs may not take.\n"
+     ]
+    }
+   ],
+   "source": [
+    "! wget https://github.com/JohnSnowLabs/nlu/raw/release/540/tests/datasets/ocr/deid/deid2.pdf\n",
+    "! wget https://github.com/JohnSnowLabs/nlu/raw/release/540/tests/datasets/ocr/deid/download.pdf\n",
+    "\n",
+    "#provide the input and the output path\n",
+    "input_path,output_path = ['download.pdf',' deid2.pdf'], ['download_deidentified.pdf',' deid2_deidentified.pdf']\n",
+    "\n",
+    "#predict and save the deidentified pdf's.\n",
+    "dfs = model.predict(input_path, output_path=output_path)"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "ExecuteTime": {
+     "end_time": "2024-06-24T10:33:43.625036300Z",
+     "start_time": "2024-06-24T10:33:40.477056300Z"
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [],
+   "metadata": {
+    "collapsed": false
+   }
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}