Merge pull request #977 from sania-16/main

Working Documentation PR
zinggAI · Nov 29, 2024 · b70ebaf · b70ebaf
2 parents 8cbc26d + e16de89
commit b70ebaf
Show file tree

Hide file tree

Showing 18 changed files with 181 additions and 70 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -8,9 +8,10 @@ description: Hope you find us useful :-)
 
 This is the latest documentation for Zingg. Release wise documentation can be accessed through:
 
+* [v0.5.0 ]()
 * [v0.4.0 ](https://docs.zingg.ai/zingg0.4.0/)
 * [v0.3.4 ](https://docs.zingg.ai/zingg0.3.4/)
-* [v0.3.3](https://docs.zingg.ai/zingg0.3.3/)
+* [v0.3.3 ](https://docs.zingg.ai/zingg0.3.3/)
 
 ## Why?
 

diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -13,6 +13,12 @@
       * [Spark Cluster Checklist](stepbystep/installation/installing-from-release/spark-cluster-checklist.md)
       * [Installing Zingg](stepbystep/installation/installing-from-release/installing-zingg.md)
       * [Verifying The Installation](stepbystep/installation/installing-from-release/verification.md)
+    * [Enterprise Installation for Snowflake](stepbystep/installation/installing-snowflake-enterprise/README.md)
+      * [Setting up Zingg](stepbystep/installation/installing-snowflake-enterprise/installing-zingg-enterprise.md)
+      * [Snowflake Properties](stepbystep/installation/installing-snowflake-enterprise/snowflake-properties.md)
+      * [Match Configuration](stepbystep/installation/installing-snowflake-enterprise/match-configuration.md)
+      * [Running Asynchronously](stepbystep/installation/installing-snowflake-enterprise/running-async-long-jobs.md)
+      * [Verifying The Installation](stepbystep/installation/installing-snowflake-enterprise/verify-installation.md)
     * [Compiling From Source](stepbystep/installation/compiling-from-source.md)
   * [Hardware Sizing](setup/hardwareSizing.md)
   * [Zingg Runtime Properties](stepbystep/zingg-runtime-properties.md)
@@ -24,6 +30,7 @@
       * [Output](stepbystep/configuration/data-input-and-output/output.md)
     * [Field Definitions](stepbystep/configuration/field-definitions.md)
     * [Deterministic Matching](deterministicMatching.md)
+    * [Pass Thru Data](passthru.md)
     * [Model Location](stepbystep/configuration/model-location.md)
     * [Tuning Label, Match And Link Jobs](stepbystep/configuration/tuning-label-match-and-link-jobs.md)
     * [Telemetry](stepbystep/configuration/telemetry.md)
@@ -39,14 +46,17 @@
   * [Finding The Matches](setup/match.md)
   * [Adding Incremental Data](runIncremental.md)
   * [Linking Across Datasets](setup/link.md)
+  * [Explanation of Models](modelexplain.md)
+  * [Approval of Clusters](approval.md)
+  * [Relate Feature](relations.md)
 * [Data Sources and Sinks](dataSourcesAndSinks/connectors.md)
   * [Zingg Pipes](dataSourcesAndSinks/pipes.md)
   * [Databricks](dataSourcesAndSinks/databricks.md)
   * [Snowflake](dataSourcesAndSinks/snowflake.md)
   * [JDBC](dataSourcesAndSinks/jdbc.md)
     * [Postgres](connectors/jdbc/postgres.md)
     * [MySQL](connectors/jdbc/mysql.md)
-  * [AWS S3](connectors/aws-s3.md)
+  * [AWS S3](dataSourcesAndSinks/amazonS3.md)
   * [Cassandra](dataSourcesAndSinks/cassandra.md)
   * [MongoDB](dataSourcesAndSinks/mongodb.md)
   * [Neo4j](dataSourcesAndSinks/neo4j.md)

diff --git a/docs/accuracy/definingOwn.md b/docs/accuracy/definingOwn.md
@@ -20,7 +20,7 @@ Say we have data like this:
 
 |   Pair 2  | firstname | lastname |
 | :-------: | :-------: | :------: |
-| Rrecord A |    mary   |    ann   |
+|  Record A |    mary   |    ann   |
 |  Record B |   marry   |          |
 
 Let us assume we have hash function **first1char** and we want to check if it is a good function to apply to **firstname**:

diff --git a/docs/approval.md b/docs/approval.md
@@ -0,0 +1,17 @@
+---
+title: Approve Clusters
+parent: Step By Step Guide
+nav_order: 13
+---
+
+# Approval of Clusters
+
+## 
+
+[Zingg Enterprise Feature](#user-content-fn-1)[^1]
+
+
+
+### The approval phase is run as follows:
+
+` `
diff --git a/docs/connectors/aws-s3.md b/docs/connectors/aws-s3.md
diff --git a/docs/dataSourcesAndSinks/amazonS3.md b/docs/dataSourcesAndSinks/amazonS3.md
@@ -1,17 +1,17 @@
-# S3
+# AWS S3
 
 Zingg can use AWS S3 as a source and sink
 
 ## Steps to run zingg on S3
 
-* Set a bucket e.g. zingg28032023 and a folder inside it e.g. zingg
+* Set a bucket, for example - _zingg28032023_ and a folder inside it,for example - _zingg_
 
 * Create aws access key and export via env vars (ensure that the user with below keys has read/write access to above)
-	export AWS_ACCESS_KEY_ID=<access key id>
-	export AWS_SECRET_ACCESS_KEY=<access key>
+	`export AWS_ACCESS_KEY_ID=<access key id>`
+	`export AWS_SECRET_ACCESS_KEY=<access key>`
 	(if mfa is enabled AWS_SESSION_TOKEN env var would also be needed )
 
-* Download hadoop-aws-3.1.0.jar and aws-java-sdk-bundle-1.11.271.jar via maven
+* Download _hadoop-aws-3.1.0.jar_ and _aws-java-sdk-bundle-1.11.271.jar_ via maven
 
 * Set above in zingg.conf
 	spark.jars=/<location>/hadoop-aws-3.1.0.jar,/<location>/aws-java-sdk-bundle-1.11.271.jar
@@ -27,4 +27,4 @@ Zingg can use AWS S3 as a source and sink
 
  ## Model location
 	Models etc. would get saved in 
-	Amazon S3 > Buckets > zingg28032023 >zingg > 100
+	Amazon S3 > Buckets > zingg28032023 > zingg > 100
diff --git a/docs/improving-accuracy/stopwordsremoval/README.md b/docs/improving-accuracy/stopwordsremoval/README.md
diff --git a/docs/modelexplain.md b/docs/modelexplain.md
@@ -0,0 +1,17 @@
+---
+title: Explanation
+parent: Step By Step Guide
+nav_order: 12
+---
+
+# Explanation of Models
+
+## 
+
+[Zingg Enterprise Feature](#user-content-fn-1)[^1]
+
+
+
+### The explain phase is run as follows:
+
+` ./scripts/zingg.sh --phase <phase for explanation> --conf <path to config> --mode explain `
diff --git a/docs/passthru.md b/docs/passthru.md
@@ -0,0 +1,8 @@
+---
+description: >-
+  
+---
+
+# Pass Thru Data
+
+[Zingg Enterprise Feature](#user-content-fn-1)[^1]
diff --git a/docs/reading.md b/docs/reading.md
@@ -6,23 +6,23 @@ nav_order: 11
 
 
 
-Entity Resolution and The Modern Data Stack.
+Entity Resolution and The Modern Data Stack:
 
 * [From Rows to People](https://roundup.getdbt.com/p/from-rows-to-people)
 
-Identity Resolution and Why CDPs fail
+Identity Resolution and Why CDPs fail:
 
+* []()
 
-
-Entity Resolution using a graph database.
+Entity Resolution using a graph database:
 
 * [Entity Resolution with TigerGraph](https://towardsdatascience.com/entity-resolution-with-tigergraph-add-zingg-to-the-mix-95009471ca02)
 
-A detailed write-up on entity resolution - the problem, its challenges, and applications.
+A detailed write-up on entity resolution - the problem, its challenges, and applications:
 
 * [Entity Resolution](https://towardsdatascience.com/an-introduction-to-entity-resolution-needs-and-challenges-97fba052dde5)
 
-Understanding Master data and Master Data Management.
+Understanding Master Data and Master Data Management:
 
 * [Agile Data Mastering](https://towardsdatascience.com/a-guide-to-agile-data-mastering-with-ai-3bf38f103709)
 
diff --git a/docs/relations.md b/docs/relations.md
@@ -0,0 +1,17 @@
+---
+title: Relations
+parent: Step By Step Guide
+nav_order: 14
+---
+
+# Relate Feature
+
+## 
+
+[Zingg Enterprise Feature](#user-content-fn-1)[^1]
+
+
+
+### The relate phase is run as follows:
+
+` `
diff --git a/docs/stepbystep/installation/Installing-snowflake-enterprise/README.md b/docs/stepbystep/installation/Installing-snowflake-enterprise/README.md
@@ -0,0 +1,15 @@
+---
+description: From the jars provided
+---
+
+# Installing Zingg on Snowflake for Enterprise
+
+## Prerequisites
+
+A) Java JDK 11 
+
+B) Snowflake - ZINGG_STAGE created in Enterprise Snowflake account
+
+***
+
+####
diff --git a/...tep/installation/Installing-snowflake-enterprise/installing-zingg-enterprise.md b/...tep/installation/Installing-snowflake-enterprise/installing-zingg-enterprise.md
@@ -0,0 +1,21 @@
+---
+description: Setting things up
+---
+
+# Installing Zingg
+
+Copy the release and license to a folder of your choice. Say directly under /home/ubuntu. Then execute the following:
+
+> `gzip -d zingg-enterprise-snowflake-0.4.1-SNAPSHOT.tar.gz `
+
+> `tar xvf zingg-enterprise-snowflake-0.4.1-SNAPSHOT.tar `
+
+> `cd zingg-enterprise-snowflake-0.4.1-SNAPSHOT `
+
+> `export ZINGG_SNOW_JAR=~/zingg-enterprise-snowflake-0.4.1-SNAPSHOT `
+
+> `export ZINGG_SNOW_HOME=~/zingg-enterprise-snowflake-0.4.1-SNAPSHOT `
+
+**better to keep ZINGG_SNOW_JAR and ZINGG_SNOW_HOME as part of the .bashrc for always having this value as part of the shell**
+
+> `mv ~/zingg.license .  `
diff --git a/.../stepbystep/installation/Installing-snowflake-enterprise/match-configuration.md b/.../stepbystep/installation/Installing-snowflake-enterprise/match-configuration.md
@@ -0,0 +1,12 @@
+---
+description: 
+---
+
+# Match Configuration:  
+
+Create the snowflake config file. This config file will contain the model locations , match types defined on fields, input and output tables in Snowflake. Please refer to **examples/febrl/configSnow.json** for a sample. Documentation for fields and match types is defined at [Zingg Field Definitions](https://docs.zingg.ai/zingg0.4.0/stepbystep/configuration/field-definitions).
+
+Along with the changes to the field definitions, please do the following: 
+- Give ‘modelId’ a name. An example could be 28NovDev.
+- Change ‘INPUT_TABLE_NAME’ in data with the name of source table 
+- Change ‘OUTPUT_TABLE_NAME’ in output with the name of output table 
diff --git a/...pbystep/installation/Installing-snowflake-enterprise/running-async-long-jobs.md b/...pbystep/installation/Installing-snowflake-enterprise/running-async-long-jobs.md
@@ -0,0 +1,13 @@
+---
+description: To ensure Zingg works as a background process
+---
+
+# Running Asynchronously For Long Duration Jobs:
+
+Using no-hup mode, we can run Zingg as a background process and even if SSH is broken it will continue to work. 
+
+> `nohup ./scripts/zingg.sh --properties-file ~/zingg/snowEnv.txt --phase findTrainingData --conf ~/zingg/snowConfigFile.json & `
+
+We can see the logs of scripts by running the following command: 
+
+> `tail -f nohup.out `
diff --git a/...stepbystep/installation/Installing-snowflake-enterprise/snowflake-properties.md b/...stepbystep/installation/Installing-snowflake-enterprise/snowflake-properties.md
@@ -0,0 +1,21 @@
+---
+description: Connection Details
+---
+
+# Snowflake Connection Properties
+
+Zingg needs details about accessing Snowflake which can be provided through a properties file.
+
+> `touch snowEnv.txt `
+
+### SnowEnv.txt format:
+
+``` 
+    URL={snowflake_url}   
+    USER={snowflake_user_name} 
+    PASSWORD={snowflake_password}  
+    ROLE={role} 
+    WAREHOUSE={warehouse}  
+    DB={database_name}  
+    SCHEMA={schema} 
+``` 
diff --git a/.../stepbystep/installation/Installing-snowflake-enterprise/verify-installation.md b/.../stepbystep/installation/Installing-snowflake-enterprise/verify-installation.md
@@ -0,0 +1,13 @@
+---
+description: To verify that Zingg ENterprise installation works perfectly
+---
+
+# Verifying The Installation
+
+Let us now run a sample program to ensure that our installation is correct.
+
+> `./scripts/zingg.sh --properties-file snowEnv.txt --phase findTrainingData --conf  examples/febrl/configSnow.json `
+
+The above will build Zingg models and use that to find duplicates in the **examples/febrl/test.csv** file. You will see Zingg logs on the console and once the job finishes, you will see tables being formed with the name **UNIFIED_CUSTOMERS_MODELID** with matching records sharing the same _cluster id_.
+
+Congratulations, Zingg has been installed!
diff --git a/docs/stepbystep/installation/installing-from-release/verification.md b/docs/stepbystep/installation/installing-from-release/verification.md
@@ -22,6 +22,6 @@ Let us now run a sample program to ensure that our installation is correct.
 
 > `./scripts/zingg.sh --phase trainMatch --conf examples/febrl/config.json`
 
-The above will build Zingg models and use that to find duplicates in the **examples/febl/test.csv** file. You will see Zingg logs on the console and once the job finishes, you will see some files under **/tmp/zinggOutput** with matching records sharing the same _cluster id_.
+The above will build Zingg models and use that to find duplicates in the **examples/febrl/test.csv** file. You will see Zingg logs on the console and once the job finishes, you will see some files under **/tmp/zinggOutput** with matching records sharing the same _cluster id_.
 
 Congratulations, Zingg has been installed!