Skip to content

Commit

Permalink
Merge pull request #273 from JohnSnowLabs/release/540
Browse files Browse the repository at this point in the history
Release/540
  • Loading branch information
C-K-Loan authored Jul 13, 2024
2 parents 6b0fc85 + 2e0c130 commit 6cda85f
Show file tree
Hide file tree
Showing 45 changed files with 1,631 additions and 82 deletions.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

153 changes: 153 additions & 0 deletions examples/colab/ocr/ocr_visual_document_deid.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/nlu/blob/master/examples/colab/ocr/ocr_visual_document_deid.ipynb)\n",
"\n",
"\n",
"## De-Identification\n",
"\n",
"Introducing our advanced healthcare deidentification model, effortlessly deployable with a single line of code. This powerful solution integrates state-of-the-art algorithms like ner_deid_subentity_augmented, ContextualParser, RegexMatcher, and TextMatcher, alongside a streamlined Deidentification stage. It efficiently masks sensitive entities such as names, locations, and medical records, ensuring compliance and data security in medical texts. Utilizing OCR capabilities, it also redacts detected information before saving the processed file to the specified location."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"#### Installing the libraries"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 1,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"📋 Loading license number 0 from C:\\Users\\gadde/.johnsnowlabs\\licenses/license_number_0_for_Spark-Healthcare_Spark-OCR.json\n",
"👌 Launched \u001B[92mcpu optimized\u001B[39m session with with: 🚀Spark-NLP==5.3.2, 💊Spark-Healthcare==5.3.2, 🕶Spark-OCR==5.3.2, running on ⚡ PySpark==3.1.2\n"
]
}
],
"source": [
"!pip install johnsnowlabs\n",
"from johnsnowlabs import nlp\n",
"nlp.install(visual=True)\n",
"nlp.start(visual=True)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-06-24T10:27:47.436477Z",
"start_time": "2024-06-24T10:27:21.668104700Z"
}
}
},
{
"cell_type": "code",
"execution_count": 3,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning::Spark Session already created, some configs may not take.\n",
"Warning::Spark Session already created, some configs may not take.\n",
"pdf_deid_pdf_output download started this may take some time.\n",
"Approx size to download 1.6 GB\n",
"[OK!]\n"
]
}
],
"source": [
"#loading the model\n",
"model = nlp.load(\"en.image_deid\")"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-06-24T10:28:08.210754700Z",
"start_time": "2024-06-24T10:27:47.452292500Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"## PDF De-Identification\n",
"\n",
"With the specified input and output paths provided as arguments, the model efficiently processes PDF files, performing de-identification as needed, and seamlessly stores the processed documents at the designated location."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 7,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning::Spark Session already created, some configs may not take.\n"
]
}
],
"source": [
"! wget https://github.com/JohnSnowLabs/nlu/raw/release/540/tests/datasets/ocr/deid/deid2.pdf\n",
"! wget https://github.com/JohnSnowLabs/nlu/raw/release/540/tests/datasets/ocr/deid/download.pdf\n",
"\n",
"#provide the input and the output path\n",
"input_path,output_path = ['download.pdf',' deid2.pdf'], ['download_deidentified.pdf',' deid2_deidentified.pdf']\n",
"\n",
"#predict and save the deidentified pdf's.\n",
"dfs = model.predict(input_path, output_path=output_path)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"end_time": "2024-06-24T10:33:43.625036300Z",
"start_time": "2024-06-24T10:33:40.477056300Z"
}
}
},
{
"cell_type": "markdown",
"source": [],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading

0 comments on commit 6cda85f

Please sign in to comment.