This repository is specifically for Senzing SDK V4. It is not designed to work with Senzing API V3.
To find the Senzing API V3 version of this repository, visit code-snippets-v3.
Succinct examples of how you might use the Senzing SDK for operational tasks.
- Legend
- Warning
- Senzing Engine Configuration
- Senzing APIs Bare Metal Usage
- Docker Usage
- Items of Note
- 🤔 - A "thinker" icon means that a little extra thinking may be required. Perhaps there are some choices to be made. Perhaps it's an optional step.
- ✏️ - A "pencil" icon means that the instructions may need modification before performing.
⚠️ - A "warning" icon means that something tricky is happening, so pay attention.
A JSON configuration string is used by the snippets to specify initialization parameters to the Senzing engine:
{
"PIPELINE": {
"SUPPORTPATH": "/home/senzing/mysenzproj1/data",
"CONFIGPATH": "/home/senzing/mysenzproj1/etc",
"RESOURCEPATH": "/home/senzing/mysenzproj1/resources"
},
"SQL": {
"CONNECTION": "postgresql://user:password@host:5432:g2"
}
}
The JSON configuration string is set via the environment variable SENZING_ENGINE_CONFIGURATION_JSON
.
You may already have installed the Senzing and created a Senzing project by following the Quickstart Guide. If not, and you would like to install Senzing directly on a machine, follow the steps in the Quickstart Guide. Be sure to review the Quickstart Roadmap, especially the System Requirements.
When using a bare metal install, the initialization parameters used by the Senzing Python utilities are maintained within <project_path>/etc/G2Module.ini
.
🤔To convert an existing Senzing project G2Module.ini file to a JSON string use one of the following methods:
-
- Modify the path to your projects G2Module.ini file.
-
-
cat <project_path>/etc/G2Module.ini | jc --ini
-
-
Python one liner
-
python3 -c $'import configparser; ini_file_name = "<project_path>/etc/G2Module.ini";engine_config_json = {};cfgp = configparser.ConfigParser();cfgp.optionxform = str;cfgp.read(ini_file_name)\nfor section in cfgp.sections(): engine_config_json[section] = dict(cfgp.items(section))\nprint(engine_config_json)'
-
✏️ <project_path>
in the above example should point to your project.
- Clone this repository
- Export the engine configuration obtained for your project from Configuration, e.g.,
export SENZING_ENGINE_CONFIGURATION_JSON='{"PIPELINE": {"SUPPORTPATH": "/<project_path>/data", "CONFIGPATH": "<project_path>/etc", "RESOURCEPATH": "<project_path>/resources"}, "SQL": {"CONNECTION": "postgresql://user:password@host:5432:g2"}}'
- Source the Senzing project setupEnv file
source <project_path>/setupEnv
- Run code snippets
✏️ <project_path>
in the above examples should point to your project.
The included Dockerfile leverages the Senzing API runtime image to provide an environment to run the code snippets.
Coming soon...
Coming soon...
A feature of Senzing is the capability to pass changes from data manipulation SDK calls to downstream systems for analysis, consolidation and replication. SDK methods add_record()
, delete_record()
and process_redo_record()
accept a flags=
argument that when set to SzEngineFlags.SZ_WITH_INFO will return a response message detailing any entities affected by the method. In the following example (from add_record("TEST", "10945", flags=SzEngineFlags.SZ_WITH_INFO)
a single entity with the ID 7903 was affected.
{
"DATA_SOURCE": "TEST",
"RECORD_ID": "10945",
"AFFECTED_ENTITIES": [
{
"ENTITY_ID": 7903,
"LENS_CODE": "DEFAULT"
}
],
"INTERESTING_ENTITIES": []
}
The AFFECTED_ENTITIES object contains a list of all entity IDs affected. Separate processes can query the affected entities and synchronize changes and information to downstream systems. For additional information see Real-time replication and analytics.
Many of the example tasks demonstrate concurrent execution with threads. The entity resolution process involves IO operations, the use of concurrent processes and threads when calling the Senzing APIs provides scalability and performance.
Many of the examples demonstrate using multiple threads to utilize the resources available on the machine. Consider loading data into Senzing and increasing the load rate, loading (and other tasks) can be horizontally scaled by utilizing additional machines.
If a single very large load file and 3 machines were available for performing data load, the file can be split into 3 with each machine running the sample code or your own application. Horizontal scaling such as this does require the Senzing database to have the capacity to accept the additional workload and not become the bottleneck.
When providing your own input file(s) to the snippets or your own applications and processing data manipulation tasks (adding, deleting, replacing), it is important to randomize the file(s) or other input methods when running multiple threads. If source records that pertain to the same entity are clustered together, multiple processes or threads could all be trying to work on the same entity concurrently. This causes contention and overhead resulting in slower performance. To prevent this contention always randomize input data.
You may be able to randomize your input files during ETL and mapping the source data to the Senzing Entity Specification. Otherwise utilities such as shuf or terashuf for large files can be used.
When trying out different examples you may notice consecutive tasks complete much faster than an initial run. For example, running a loading task for the first time without the data in the system will be representative of load rate. If the same example is subsequently run again without purging the system it will complete much faster. This is because Senzing knows the records already exist in the system and it skips them.
To run the same example again and see representative performance, first purge the Senzing repository of the loaded data. Some examples don't require purging between running them, an example would be the deleting examples that require data to be ingested first. See the usage notes for each task category for an overview of how to use the snippets.
There are different sized load files within the data path that can be used to increase the volume of data loaded depending on the specification of your hardware. Note, Senzing V4 comes with a default license that allows up to 500 source records to be loaded, without a larger license you will not be able to load these larger files.