Skip to content

Commit

Permalink
Merge pull request #977 from sania-16/main
Browse files Browse the repository at this point in the history
Working Documentation PR
  • Loading branch information
sonalgoyal authored Nov 29, 2024
2 parents 8cbc26d + e16de89 commit b70ebaf
Show file tree
Hide file tree
Showing 18 changed files with 181 additions and 70 deletions.
3 changes: 2 additions & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@ description: Hope you find us useful :-)

This is the latest documentation for Zingg. Release wise documentation can be accessed through:

* [v0.5.0 ]()
* [v0.4.0 ](https://docs.zingg.ai/zingg0.4.0/)
* [v0.3.4 ](https://docs.zingg.ai/zingg0.3.4/)
* [v0.3.3](https://docs.zingg.ai/zingg0.3.3/)
* [v0.3.3 ](https://docs.zingg.ai/zingg0.3.3/)

## Why?

Expand Down
12 changes: 11 additions & 1 deletion docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@
* [Spark Cluster Checklist](stepbystep/installation/installing-from-release/spark-cluster-checklist.md)
* [Installing Zingg](stepbystep/installation/installing-from-release/installing-zingg.md)
* [Verifying The Installation](stepbystep/installation/installing-from-release/verification.md)
* [Enterprise Installation for Snowflake](stepbystep/installation/installing-snowflake-enterprise/README.md)
* [Setting up Zingg](stepbystep/installation/installing-snowflake-enterprise/installing-zingg-enterprise.md)
* [Snowflake Properties](stepbystep/installation/installing-snowflake-enterprise/snowflake-properties.md)
* [Match Configuration](stepbystep/installation/installing-snowflake-enterprise/match-configuration.md)
* [Running Asynchronously](stepbystep/installation/installing-snowflake-enterprise/running-async-long-jobs.md)
* [Verifying The Installation](stepbystep/installation/installing-snowflake-enterprise/verify-installation.md)
* [Compiling From Source](stepbystep/installation/compiling-from-source.md)
* [Hardware Sizing](setup/hardwareSizing.md)
* [Zingg Runtime Properties](stepbystep/zingg-runtime-properties.md)
Expand All @@ -24,6 +30,7 @@
* [Output](stepbystep/configuration/data-input-and-output/output.md)
* [Field Definitions](stepbystep/configuration/field-definitions.md)
* [Deterministic Matching](deterministicMatching.md)
* [Pass Thru Data](passthru.md)
* [Model Location](stepbystep/configuration/model-location.md)
* [Tuning Label, Match And Link Jobs](stepbystep/configuration/tuning-label-match-and-link-jobs.md)
* [Telemetry](stepbystep/configuration/telemetry.md)
Expand All @@ -39,14 +46,17 @@
* [Finding The Matches](setup/match.md)
* [Adding Incremental Data](runIncremental.md)
* [Linking Across Datasets](setup/link.md)
* [Explanation of Models](modelexplain.md)
* [Approval of Clusters](approval.md)
* [Relate Feature](relations.md)
* [Data Sources and Sinks](dataSourcesAndSinks/connectors.md)
* [Zingg Pipes](dataSourcesAndSinks/pipes.md)
* [Databricks](dataSourcesAndSinks/databricks.md)
* [Snowflake](dataSourcesAndSinks/snowflake.md)
* [JDBC](dataSourcesAndSinks/jdbc.md)
* [Postgres](connectors/jdbc/postgres.md)
* [MySQL](connectors/jdbc/mysql.md)
* [AWS S3](connectors/aws-s3.md)
* [AWS S3](dataSourcesAndSinks/amazonS3.md)
* [Cassandra](dataSourcesAndSinks/cassandra.md)
* [MongoDB](dataSourcesAndSinks/mongodb.md)
* [Neo4j](dataSourcesAndSinks/neo4j.md)
Expand Down
2 changes: 1 addition & 1 deletion docs/accuracy/definingOwn.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Say we have data like this:

| Pair 2 | firstname | lastname |
| :-------: | :-------: | :------: |
| Rrecord A | mary | ann |
| Record A | mary | ann |
| Record B | marry | |

Let us assume we have hash function **first1char** and we want to check if it is a good function to apply to **firstname**:
Expand Down
17 changes: 17 additions & 0 deletions docs/approval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
title: Approve Clusters
parent: Step By Step Guide
nav_order: 13
---

# Approval of Clusters

##

[Zingg Enterprise Feature](#user-content-fn-1)[^1]



### The approval phase is run as follows:

` `
25 changes: 0 additions & 25 deletions docs/connectors/aws-s3.md

This file was deleted.

12 changes: 6 additions & 6 deletions docs/dataSourcesAndSinks/amazonS3.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# S3
# AWS S3

Zingg can use AWS S3 as a source and sink

## Steps to run zingg on S3

* Set a bucket e.g. zingg28032023 and a folder inside it e.g. zingg
* Set a bucket, for example - _zingg28032023_ and a folder inside it,for example - _zingg_

* Create aws access key and export via env vars (ensure that the user with below keys has read/write access to above)
export AWS_ACCESS_KEY_ID=<access key id>
export AWS_SECRET_ACCESS_KEY=<access key>
`export AWS_ACCESS_KEY_ID=<access key id>`
`export AWS_SECRET_ACCESS_KEY=<access key>`
(if mfa is enabled AWS_SESSION_TOKEN env var would also be needed )

* Download hadoop-aws-3.1.0.jar and aws-java-sdk-bundle-1.11.271.jar via maven
* Download _hadoop-aws-3.1.0.jar_ and _aws-java-sdk-bundle-1.11.271.jar_ via maven

* Set above in zingg.conf
spark.jars=/<location>/hadoop-aws-3.1.0.jar,/<location>/aws-java-sdk-bundle-1.11.271.jar
Expand All @@ -27,4 +27,4 @@ Zingg can use AWS S3 as a source and sink

## Model location
Models etc. would get saved in
Amazon S3 > Buckets > zingg28032023 >zingg > 100
Amazon S3 > Buckets > zingg28032023 > zingg > 100
29 changes: 0 additions & 29 deletions docs/improving-accuracy/stopwordsremoval/README.md

This file was deleted.

17 changes: 17 additions & 0 deletions docs/modelexplain.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
title: Explanation
parent: Step By Step Guide
nav_order: 12
---

# Explanation of Models

##

[Zingg Enterprise Feature](#user-content-fn-1)[^1]



### The explain phase is run as follows:

` ./scripts/zingg.sh --phase <phase for explanation> --conf <path to config> --mode explain `
8 changes: 8 additions & 0 deletions docs/passthru.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
description: >-
---

# Pass Thru Data

[Zingg Enterprise Feature](#user-content-fn-1)[^1]
12 changes: 6 additions & 6 deletions docs/reading.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,23 @@ nav_order: 11



Entity Resolution and The Modern Data Stack.
Entity Resolution and The Modern Data Stack:

* [From Rows to People](https://roundup.getdbt.com/p/from-rows-to-people)

Identity Resolution and Why CDPs fail
Identity Resolution and Why CDPs fail:

* []()


Entity Resolution using a graph database.
Entity Resolution using a graph database:

* [Entity Resolution with TigerGraph](https://towardsdatascience.com/entity-resolution-with-tigergraph-add-zingg-to-the-mix-95009471ca02)

A detailed write-up on entity resolution - the problem, its challenges, and applications.
A detailed write-up on entity resolution - the problem, its challenges, and applications:

* [Entity Resolution](https://towardsdatascience.com/an-introduction-to-entity-resolution-needs-and-challenges-97fba052dde5)

Understanding Master data and Master Data Management.
Understanding Master Data and Master Data Management:

* [Agile Data Mastering](https://towardsdatascience.com/a-guide-to-agile-data-mastering-with-ai-3bf38f103709)

17 changes: 17 additions & 0 deletions docs/relations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
title: Relations
parent: Step By Step Guide
nav_order: 14
---

# Relate Feature

##

[Zingg Enterprise Feature](#user-content-fn-1)[^1]



### The relate phase is run as follows:

` `
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
---
description: From the jars provided
---

# Installing Zingg on Snowflake for Enterprise

## Prerequisites

A) Java JDK 11

B) Snowflake - ZINGG_STAGE created in Enterprise Snowflake account

***

####
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
description: Setting things up
---

# Installing Zingg

Copy the release and license to a folder of your choice. Say directly under /home/ubuntu. Then execute the following:

> `gzip -d zingg-enterprise-snowflake-0.4.1-SNAPSHOT.tar.gz `
> `tar xvf zingg-enterprise-snowflake-0.4.1-SNAPSHOT.tar `
> `cd zingg-enterprise-snowflake-0.4.1-SNAPSHOT `
> `export ZINGG_SNOW_JAR=~/zingg-enterprise-snowflake-0.4.1-SNAPSHOT `
> `export ZINGG_SNOW_HOME=~/zingg-enterprise-snowflake-0.4.1-SNAPSHOT `
**better to keep ZINGG_SNOW_JAR and ZINGG_SNOW_HOME as part of the .bashrc for always having this value as part of the shell**

> `mv ~/zingg.license . `
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
description:
---

# Match Configuration:

Create the snowflake config file. This config file will contain the model locations , match types defined on fields, input and output tables in Snowflake. Please refer to **examples/febrl/configSnow.json** for a sample. Documentation for fields and match types is defined at [Zingg Field Definitions](https://docs.zingg.ai/zingg0.4.0/stepbystep/configuration/field-definitions).

Along with the changes to the field definitions, please do the following:
- Give ‘modelId’ a name. An example could be 28NovDev.
- Change ‘INPUT_TABLE_NAME’ in data with the name of source table
- Change ‘OUTPUT_TABLE_NAME’ in output with the name of output table
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
description: To ensure Zingg works as a background process
---

# Running Asynchronously For Long Duration Jobs:

Using no-hup mode, we can run Zingg as a background process and even if SSH is broken it will continue to work.

> `nohup ./scripts/zingg.sh --properties-file ~/zingg/snowEnv.txt --phase findTrainingData --conf ~/zingg/snowConfigFile.json & `
We can see the logs of scripts by running the following command:

> `tail -f nohup.out `
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
description: Connection Details
---

# Snowflake Connection Properties

Zingg needs details about accessing Snowflake which can be provided through a properties file.

> `touch snowEnv.txt `
### SnowEnv.txt format:

```
URL={snowflake_url}
USER={snowflake_user_name}
PASSWORD={snowflake_password}
ROLE={role}
WAREHOUSE={warehouse}
DB={database_name}
SCHEMA={schema}
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
description: To verify that Zingg ENterprise installation works perfectly
---

# Verifying The Installation

Let us now run a sample program to ensure that our installation is correct.

> `./scripts/zingg.sh --properties-file snowEnv.txt --phase findTrainingData --conf examples/febrl/configSnow.json `
The above will build Zingg models and use that to find duplicates in the **examples/febrl/test.csv** file. You will see Zingg logs on the console and once the job finishes, you will see tables being formed with the name **UNIFIED_CUSTOMERS_MODELID** with matching records sharing the same _cluster id_.

Congratulations, Zingg has been installed!
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,6 @@ Let us now run a sample program to ensure that our installation is correct.
> `./scripts/zingg.sh --phase trainMatch --conf examples/febrl/config.json`
The above will build Zingg models and use that to find duplicates in the **examples/febl/test.csv** file. You will see Zingg logs on the console and once the job finishes, you will see some files under **/tmp/zinggOutput** with matching records sharing the same _cluster id_.
The above will build Zingg models and use that to find duplicates in the **examples/febrl/test.csv** file. You will see Zingg logs on the console and once the job finishes, you will see some files under **/tmp/zinggOutput** with matching records sharing the same _cluster id_.

Congratulations, Zingg has been installed!

0 comments on commit b70ebaf

Please sign in to comment.