Skip to content

Commit

Permalink
Merge pull request #976 from sania-16/main
Browse files Browse the repository at this point in the history
Updating documentation
  • Loading branch information
sonalgoyal authored Nov 28, 2024
2 parents b75ad66 + 40c4235 commit 8cbc26d
Show file tree
Hide file tree
Showing 8 changed files with 46 additions and 14 deletions.
1 change: 1 addition & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
* [Using Pre-existing Training Data](setup/training/addOwnTrainingData.md)
* [Updating Labeled Pairs](updatingLabels.md)
* [Exporting Labeled Data](setup/training/exportLabeledData.md)
* [Verification of Blocking Data](verifyBlocking.md)
* [Building And Saving The Model](setup/train.md)
* [Finding The Matches](setup/match.md)
* [Adding Incremental Data](runIncremental.md)
Expand Down
12 changes: 10 additions & 2 deletions docs/accuracy/stopWordsRemoval.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@ Common words like Mr, Pvt, Av, St, Street etc. do not add differential signals a

The stopwords can be recommended by Zingg by invoking:

`./scripts/zingg.sh --phase recommend --conf <conf.json> --columns <list of columns to generate stop word recommendations>`
`./scripts/zingg.sh --phase recommend --conf <conf.json> --column <name of column to generate stopword recommendations>`

By default, Zingg extracts 10% of the high-frequency unique words from a dataset. If the user wants a different selection, they should set up the following property in the config file:
The stopwords generated are stored at the location - models/100/stopWords/columnName. This will give you the list of stopwords along with their frequency.

By default, Zingg extracts 10% of the high-frequency unique words from a dataset. If the user wants a different selection, they should set up the following property in the config file under the respective field for which they want stopwords:

```
stopWordsCutoff: <a value between 0 and 1>
Expand All @@ -24,3 +26,9 @@ Once you have verified the above stop words, you can configure them in the JSON
"stopWords": "models/100/stopWords/fname.csv"
},
```

For recommending stopwords in **Zingg Enterprise Snowflake**,

`./scripts/zingg.sh --phase recommend --conf <conf.json> --properties-file <path to Snowflake properties file> --column <name of column to generate stopword recommendations>`

The stopwords generated are stored in the table - zingg_stopWords_columnName_modelId where we can see the list of stopwords and their frequency associated with the given column name.
15 changes: 11 additions & 4 deletions docs/deterministicMatching.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
---
description: >-
Ensuring higher matching accuracy and performance
---

# Deterministic Matching

### Deterministic Matching - _Zingg Enterprise Feature_
[Zingg Enterprise Feature](#user-content-fn-1)[^1]

Zingg Enterprise allows the ability to plug rule-based deterministic matching along with already Zingg AI's probabilistic matching. If the data contains _sure_ identifiers like emails, SSNs, passport-ids etc, we can use these attributes to resolve records.

Zingg Enterprise allows the ability to plug rule-based deterministic matching along with already Zingg AI's probabilistic matching. If the data contains _sure_ identifiers like emails, SSNs, passport-ids etc, we can use these attributes to resolve records.\
\
The deterministic matching flow is weaved into Zingg's flow to ensure that each record which has a match finds one, probabilistically, deterministically or both. If the data has known identifiers, Zingg Enterprise's deterministic matching highly improves both matching accuracy and performance.
The deterministic matching flow is weaved into Zingg's flow to ensure that each record which has a match finds one, probabilistically, deterministically or both. If the data has known identifiers, Zingg Enterprise's Deterministic Matching highly improves both matching accuracy and performance.

### Example For Configuring In JSON:

Expand Down Expand Up @@ -40,3 +45,5 @@ The above conditions would translate into the following:
2. Those rows which have **exactly** same `fname`, `dob` and `ssn` => exact match with max score 1\
_OR_
3. Those rows which have **exactly** same `fname` and `email` => exact match with max score 1

[^1]: Zingg Enterprise is an advance version of Zingg Community with production grade features
8 changes: 5 additions & 3 deletions docs/runIncremental.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
---
description: >-
Building a continuosly updated identity graph with new, updated and deleted
records
title: Adding incremental data
parent: Step By Step Guide
nav_order: 10
---

# Adding Incremental Data

## Building a continuosly updated identity graph with new, updated and deleted records

[Zingg Enterprise Feature](#user-content-fn-1)[^1]

Rerunning matching on entire datasets is wasteful, and we lose the lineage of matched records against a persistent identifier. Using the[ incremental flow](https://www.learningfromdata.zingg.ai/p/zingg-incremental-flow) feature in [Zingg Enterprise](https://www.zingg.ai/company/zingg-enterprise), incremental loads can be run to match existing pre-resolved entities. The new and updated records are matched to existing clusters, and new persistent [**ZINGG\_IDs**](https://www.learningfromdata.zingg.ai/p/hello-zingg-id) are generated for records that do not find a match. If a record gets updated and Zingg Enterprise discovers that it is a more suitable match with another cluster, it will be reassigned. Cluster assignment, merge, and unmerge happens automatically in the flow. Zingg Enterprise also takes care of human feedback on previously matched data to ensure that it does not override the approved records.
Expand Down
6 changes: 6 additions & 0 deletions docs/setup/link.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
---
title: Linking data
parent: Step By Step Guide
nav_order: 11
---

# Linking Across Datasets

In many cases like reference data mastering, enrichment, etc, two individual datasets are duplicate-free but they need to be matched against each other. The link phase is used for such scenarios.
Expand Down
2 changes: 1 addition & 1 deletion docs/setup/match.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Find the matches
parent: Step By Step Guide
nav_order: 8
nav_order: 9
---

# Finding The Matches
Expand Down
2 changes: 1 addition & 1 deletion docs/setup/train.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Build and save the model
parent: Step By Step Guide
nav_order: 7
nav_order: 8
---

# Building And Saving The Model
Expand Down
14 changes: 11 additions & 3 deletions docs/verifyBlocking.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
---
description: >-
Understanding how blocking is working before running match or link
title: Verifying the blocked data
parent: Step By Step Guide
nav_order: 7
---

# Verification of Blocking Data

## Understanding how blocking is working before running match or link

Sometimes Zingg jobs are slow or fail due to a poorly learnt blocking tree. This can happen due to a variety of reasons. It can happen when:
- A user adds significantly larger training samples compared to the labelling learnt by Zingg. The manually added training samples may have the same type of columns and blocking rules learnt are not generic enough. For example, providing California state only training data when the matching is using the State column and data has multiple states.
- When there is a natural bias in the data with lots of null columns used in matching.
Expand All @@ -19,4 +22,9 @@ If we have an understanding of how blocking is working before deciding to run a

The output contains two directories - zinggDir/modelId/blocks/timestamp/counts and zinggDir/modelId/blocks/timestamp/blockSamples. We can see the counts per block and the top 10% records associated with the top 3 blocks by counts in the directories respectively.

For Enterprise Snowflake, we will be having tables with the names - zingg/modelId/blocks/timestamp/counts where we can see the counts per block and zingg/modelId/blocks/timestamp/blockSamples/hash where we can see the top 10% records associated with the top 3 blocks by counts in these tables respectively.

For running verifyBlocking in **Zingg Enterprise Snowflake**,

`./scripts/zingg.sh --phase verifyBlocking --conf <path to conf> --properties-file <path to Snowflake properties file> <optional --zinggDir <location of model>>`

This will generate tables with the names - zingg_modelId_blocks_timestamp_counts where we can see the counts per block and zingg_modelId_blocks_timestamp_blockSamples_hash where we can see the top 10% records associated with the top 3 blocks by counts in these tables respectively.

0 comments on commit 8cbc26d

Please sign in to comment.