The purpose of the project is to demonstrate the efficacy of graph embeddings for detecting similar nodes in an RDF graph. The code here is made available so that our experiments could be easily replicated.
The datasets that we use are available at:
The code is divided into folders by purpose.
- Data - contains code for generating the intermediate datasets from the original data in a form that can be consumed by the rest of the code.
- Baseline - contains the code for running the baseline.
- Graph Embeddings - contains the code for creating graph embeddings.
- Classification - contains the code for creating the classifier model and getting classification results for node pairs.
Creating datasets.
-
Create a smaller RDF dataset from the large
dblp_RDF_GRAPH
such that only the neighborhood information of the ground truth nodes are available. Rundata\crawler.ipynb
Input:ground truth
anddblp RDF graph
files.
Output:dblp_dataset_*.nt
file containing the smaller RDF dataset.dblp_*.json
file containing a dictionary of ground truth data that was used in creating the smaller dataset. -
Create a list of node pairs along with labelling information. Run
data\create_edgelist.py
Inputdblp_*.json
file from the previous step.
Outputdblp_*.edges
,dblp_*.authors
,dblp_*.papers
files.dblp_*.edges
contains the labelled data along with node ids. The other files map node ids to nodes in the RDF graph.
Run baseline\run_baseline.ipynb
Input dblp_dataset_*.nt
file and dblp_*.edges
denoting labelling information.
Output: Accuracy information of the baseline.
Refer the instructions at graph embedding generation
Refer the instructions at classification