Merge pull request #976 from sania-16/main

Updating documentation
zinggAI · Nov 28, 2024 · 8cbc26d · 8cbc26d
2 parents b75ad66 + 40c4235
commit 8cbc26d
Show file tree

Hide file tree

Showing 8 changed files with 46 additions and 14 deletions.
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -34,6 +34,7 @@
     * [Using Pre-existing Training Data](setup/training/addOwnTrainingData.md)
     * [Updating Labeled Pairs](updatingLabels.md)
     * [Exporting Labeled Data](setup/training/exportLabeledData.md)
+  * [Verification of Blocking Data](verifyBlocking.md)
   * [Building And Saving The Model](setup/train.md)
   * [Finding The Matches](setup/match.md)
   * [Adding Incremental Data](runIncremental.md)

diff --git a/docs/accuracy/stopWordsRemoval.md b/docs/accuracy/stopWordsRemoval.md
@@ -4,9 +4,11 @@ Common words like Mr, Pvt, Av, St, Street etc. do not add differential signals a
 
 The stopwords can be recommended by Zingg by invoking:
 
-`./scripts/zingg.sh --phase recommend --conf <conf.json> --columns <list of columns to generate stop word recommendations>`
+`./scripts/zingg.sh --phase recommend --conf <conf.json> --column <name of column to generate stopword recommendations>`
 
-By default, Zingg extracts 10% of the high-frequency unique words from a dataset. If the user wants a different selection, they should set up the following property in the config file:
+The stopwords generated are stored at the location - models/100/stopWords/columnName. This will give you the list of stopwords along with their frequency.
+
+By default, Zingg extracts 10% of the high-frequency unique words from a dataset. If the user wants a different selection, they should set up the following property in the config file under the respective field for which they want stopwords:
 
 ```
 stopWordsCutoff: <a value between 0 and 1>
@@ -24,3 +26,9 @@ Once you have verified the above stop words, you can configure them in the JSON
    		"stopWords": "models/100/stopWords/fname.csv"
    	},
 ```
+
+For recommending stopwords in **Zingg Enterprise Snowflake**, 
+
+`./scripts/zingg.sh --phase recommend --conf <conf.json> --properties-file <path to Snowflake properties file> --column <name of column to generate stopword recommendations>`
+
+The stopwords generated are stored in the table - zingg_stopWords_columnName_modelId where we can see the list of stopwords and their frequency associated with the given column name.
diff --git a/docs/deterministicMatching.md b/docs/deterministicMatching.md
@@ -1,10 +1,15 @@
+---
+description: >-
+  Ensuring higher matching accuracy and performance
+---
+
 # Deterministic Matching
 
-### Deterministic Matching - _Zingg Enterprise Feature_
+[Zingg Enterprise Feature](#user-content-fn-1)[^1]
+
+Zingg Enterprise allows the ability to plug rule-based deterministic matching along with already Zingg AI's probabilistic matching. If the data contains _sure_ identifiers like emails, SSNs, passport-ids etc, we can use these attributes to resolve records.
 
-Zingg Enterprise allows the ability to plug rule-based deterministic matching along with already Zingg AI's probabilistic matching. If the data contains _sure_ identifiers like emails, SSNs, passport-ids etc, we can use these attributes to resolve records.\
-\
-The deterministic matching flow is weaved into Zingg's flow to ensure that each record which has a match finds one, probabilistically, deterministically or both. If the data has known identifiers, Zingg Enterprise's deterministic matching highly improves both matching accuracy and performance.
+The deterministic matching flow is weaved into Zingg's flow to ensure that each record which has a match finds one, probabilistically, deterministically or both. If the data has known identifiers, Zingg Enterprise's Deterministic Matching highly improves both matching accuracy and performance.
 
 ### Example For Configuring In JSON:
 
@@ -40,3 +45,5 @@ The above conditions would translate into the following:
 2. Those rows which have **exactly** same `fname`, `dob` and `ssn` => exact match with max score 1\
    _OR_
 3. Those rows which have **exactly** same `fname` and `email` => exact match with max score 1
+
+[^1]: Zingg Enterprise is an advance version of Zingg Community with production grade features
diff --git a/docs/runIncremental.md b/docs/runIncremental.md
@@ -1,11 +1,13 @@
 ---
-description: >-
-  Building a continuosly updated identity graph with new, updated and deleted
-  records
+title: Adding incremental data
+parent: Step By Step Guide
+nav_order: 10
 ---
 
 # Adding Incremental Data
 
+## Building a continuosly updated identity graph with new, updated and deleted records
+
 [Zingg Enterprise Feature](#user-content-fn-1)[^1]
 
 Rerunning matching on entire datasets is wasteful, and we lose the lineage of matched records against a persistent identifier. Using the[ incremental flow](https://www.learningfromdata.zingg.ai/p/zingg-incremental-flow) feature in [Zingg Enterprise](https://www.zingg.ai/company/zingg-enterprise), incremental loads can be run to match existing pre-resolved entities. The new and updated records are matched to existing clusters, and new persistent [**ZINGG\_IDs**](https://www.learningfromdata.zingg.ai/p/hello-zingg-id) are generated for records that do not find a match. If a record gets updated and Zingg Enterprise discovers that it is a more suitable match with another cluster, it will be reassigned. Cluster assignment, merge, and unmerge happens automatically in the flow. Zingg Enterprise also takes care of human feedback on previously matched data to ensure that it does not override the approved records.

diff --git a/docs/setup/link.md b/docs/setup/link.md
@@ -1,3 +1,9 @@
+---
+title: Linking data
+parent: Step By Step Guide
+nav_order: 11
+---
+
 # Linking Across Datasets
 
 In many cases like reference data mastering, enrichment, etc, two individual datasets are duplicate-free but they need to be matched against each other. The link phase is used for such scenarios.

diff --git a/docs/setup/match.md b/docs/setup/match.md
@@ -1,7 +1,7 @@
 ---
 title: Find the matches
 parent: Step By Step Guide
-nav_order: 8
+nav_order: 9
 ---
 
 # Finding The Matches

diff --git a/docs/setup/train.md b/docs/setup/train.md
@@ -1,7 +1,7 @@
 ---
 title: Build and save the model
 parent: Step By Step Guide
-nav_order: 7
+nav_order: 8
 ---
 
 # Building And Saving The Model

diff --git a/docs/verifyBlocking.md b/docs/verifyBlocking.md
@@ -1,10 +1,13 @@
 ---
-description: >-
-  Understanding how blocking is working before running match or link
+title: Verifying the blocked data
+parent: Step By Step Guide
+nav_order: 7
 ---
 
 # Verification of Blocking Data
 
+## Understanding how blocking is working before running match or link
+
 Sometimes Zingg jobs are slow or fail due to a poorly learnt blocking tree. This can happen due to a variety of reasons. It can happen when:
 - A user adds significantly larger training samples compared to the labelling learnt by Zingg. The manually added training samples may have the same type of columns and blocking rules learnt are not generic enough. For example, providing California state only training data when the matching is using the State column and data has multiple states.
 - When there is a natural bias in the data with lots of null columns used in matching.
@@ -19,4 +22,9 @@ If we have an understanding of how blocking is working before deciding to run a
 
 The output contains two directories - zinggDir/modelId/blocks/timestamp/counts and zinggDir/modelId/blocks/timestamp/blockSamples. We can see the counts per block and the top 10% records associated with the top 3 blocks by counts in the directories respectively.
 
-For Enterprise Snowflake, we will be having tables with the names - zingg/modelId/blocks/timestamp/counts where we can see the counts per block and zingg/modelId/blocks/timestamp/blockSamples/hash where we can see the top 10% records associated with the top 3 blocks by counts in these tables respectively.
+
+For running verifyBlocking in **Zingg Enterprise Snowflake**, 
+
+`./scripts/zingg.sh --phase verifyBlocking --conf <path to conf> --properties-file <path to Snowflake properties file> <optional --zinggDir <location of model>>`
+
+This will generate tables with the names - zingg_modelId_blocks_timestamp_counts where we can see the counts per block and zingg_modelId_blocks_timestamp_blockSamples_hash where we can see the top 10% records associated with the top 3 blocks by counts in these tables respectively.