- Introduction
- Folder structure
- Data Contract Validation Module
This folder contains source code (src) in the form of a Python package "gaia_data_products", comprising a number of modules to host reusable Python classes and functions to support and develop Databricks data pipelines (bronze to silver).
- src/ --> container for all reusable code
- src/gaia_data_products/ --> root of "package" (including an init.py)
- src/gaia_data_products/dp_validator/ --> module for validating data products (eg yaml specs vs yaml of product)
- src/gaia_data_products/dp_metadata/ --> module to use a yaml of a product, and "attach" that metadata to the product (basically wrapper around current ALTER table setting the TAGS, and the COMMENT statement for the description).
Python Module for validating the DataContract yaml file. (i.e. dp_validator/yaml_validator.py) This module consists of three classes
This is a utility class, all the methods declared in this class are static. These methods are bound to the class and not the object of the class. Methods declared in the class:
- Sorting a Python dictionary based on the keys. Input: Python dictionary Returns: Sorted Python dictionary based on key
- Check whether a file exists in the mentioned path. Input: File Path Returns: AssertException if the file is not present
- Fetches the yaml file in the mentioned path and return its content in the form of JSON using pyyaml library Input: yaml file path Returns: JSON variable
- List the yaml files present in the given directory, if the directory is not passed, it will list the yamls present in the current directory Input: Directory path Returns: list of yaml files in the given directory
- Take data product id as input and get the domain to which it is pertaining to Input: Data Product ID Returns: Domain of the data product
- This fetches the tags applied to a table in the form of JSON using the information_schema in the specific catalog Input: catalog_name, table_name Returns: JSON object containing tags and its values of the mentioned dataproduct
- This method fetches the comment applied to the dataproduct using the information_schema in the specific catalog Input: catalog_name, table_name Returns: Returns string which contains the comment applied to the mentioned dataproduct
This class contains methods for validating keys of the datacontract. A pre-defined method for validating each key in a data contract. Apart from this method, there is a method defined to validate every key in the data contract whether it is null or not.
- Checks if all the keys in the DataContract are not null Input: DataContract JSON (dp_datacontract) Returns: Raises AssertException if any of the key is null
- Validate the description key value in the datacontract file. Checks if the value in the description key is of string datatype Input: DataContract JSON (dp_datacontract) Returns: Raises AssertException if the description is not of string datatype
- Validate the DataProduct ID(gaia_dp_id key) in the datacontract file, if the id is in the format gaiaxxxxx Input: DataContract JSON (dp_datacontract) Returns: Raises AssertException if the gaia_dp_id is not in valid format
- Validate the title key in the datacontract file. Checks if the value in the title key is of string datatype Input: DataContract JSON (dp_datacontract) Returns: Raises AssertException if the title is not of string datatype
- Validate the product_owner key in the datacontract file. The value contains multiple emails seperated by comma(','). Checks if each mail mentioned is from the syngenta organization (mail ends with @syngenta.com) Input: DataContract JSON (dp_datacontract) Returns: Raises AssertException if the product_owner deviates the rules specified
- Validate the version key in the datacontract file. Checks if the values is of Semantic versioning SemVar. This method checks if the values is of form MAJOR.MINOR.PATCH or MAJOR.MINOR. (i.e. 0.1, 1.0.0, 1.2.0 etc) Input: DataContract JSON (dp_datacontract) Returns: Raises AssertException if the version is of SemVar type
- Validate the refresh_rate key in the datacontract file. Refresh rate of a dataproduct file must be one of the following 'daily','weekly','monthly','once'. Input: DataContract JSON (dp_datacontract) Returns: Raises AssertException if the refresh rate key deviates the rules specified.
- Validate the apiVersion key in the datacontract file. Checks if the values is of Semantic versioning SemVar. This method checks if the values is of form MAJOR.MINOR.PATCH or MAJOR.MINOR. (i.e. 0.1, 1.0.0, 1.2.0 etc) Input: DataContract JSON (dp_datacontract) Returns: Raises AssertException if the version is of SemVar type
- Validate the kind key in the datacontract file. Checks if the value is equal to DataContract. Input: DataContract JSON (dp_datacontract) Returns: Raises AssertException if the kind is not of DataContract.
- Validate the datasetDomain key in the datacontract file. This checks whether the dataproduct is realted to particular domain or not. To fetch the domain of the dataproduct based on the id, and the domain must be one of the following 'portfolio','ppd','productsafetyregulatory','realworld','research'. Input: DataContract JSON (dp_datacontract) Returns: Raises AssertException if the domain is not valid or domain not related to the dataproduct
- Validate the sourcePlatform key in the datacontract file. This checks whether the sourcePlatform is 'gaia'. Input: DataContract JSON (dp_datacontract) Returns: Raises AssertException if the sourcePlatform key deviates the checks.
- Validate the sourceSystem key in the datacontract file. This checks whether the sourceSystem is 'databricks' Input: DataContract JSON (dp_datacontract) Returns: Raises AssertException if the sourceSystem key deviates the checks.
This class contains methods that check if the dataproduct already exists or not, extract the tags and comments of a dataproduct in the form of JSON, fetch the data contract in the form of JSON, fetch the specification YAML in the form of JSON.
- Checks if a dataproduct is present in the specified catalog. If the arguments are not passed then method uses class variables Input: target_catalog, gaia_dp_id Returns: True if exists, otherwise False
- Fetches the metadata of a dataproduct inclusing comment. For fetching these values used methods in DataContractUtility. Input: target_catalog, gaia_dp_id Returns: Returns sorted JSON of the metadata
- Checks if DataContract file is present or not. If presents then fetches the data in DataContract.yaml file into python JSON object. Input: gaia_dp_id Returns: Returns sorted JSON of the datacontract
- Fetches the specification version with which datacontract needs to be verified with. Input: spec_version Returns: Returns the version
- Fetches the specification yaml file and returns the data as JSON Input: uses class variables, spec_version Returns: Returns sorted JSON of the specification yaml file
- Compare the metadata of a dataproduct(tags,comment) with datacontract data. If there are updates returns true and specifies taskValue true for key "valid". Input: dp_metadata, data_contract Returns: True if updates are present otherwise False
- Validate the datacontract with the specification yaml of that specific version. Sets taskValues True if the datacontract is valid, version with which the datacontract is validated against Input: data_contract, dp_specification Returns: Creates two taskValues Valid, Version based on the validation results.