BGP Extractor for logs of the SPARQL endpoint of DBpedia

DBpedia logs from http://usewod.org

Contacts

Emmanuel Desmontils (Emmanuel.Desmontils_at_univ-nantes.fr)
Patricia Serrano-Alvarado (Patricia.Serrano-Alvarado_at_univ-nantes.fr)

User guide

This is a guide to analyse a day of DBPedia 2015. Consider the log of October 31th located in './data/logs20151031/'access.log-20151031.log'.

The first step is, to extract BGPs from each line that corresponds to a http request containing a SPARQL query:

python3.6 bgp-extractor.py -p 64 -d ./data/logs20151031/logs-20151031-extract -f ./data/logs20151031/access.log-20151031.log

The result is a set of directories (one for each hour) that contains one file by user. Each file is named 'userIp-be4dbp.xml'

Then, filter BGPs that can be excuted on the data provider (e.g. a TPF serveur with a timeout of 20 secondes)

python3.6 bgp-test-endpoint.py -e TPF ./data/logs20151031/logs-20151031-extract/*/*-be4dbp.xml -to 20

The result is, for each user file, a file (named 'userIp-be4dbp-tested-TPF.xml') conform to 'http://documents.ls2n.fr/be4dbp/log.dtd' (which uses 'http://documents.ls2n.fr/be4dbp/bgp.dtd'), where each 'entry' (a BGP) is evaluated according to the data provider.

Next, rank BGPs to identify most the frequents:

python3.6 bgp-ranking-analysis.py ./data/logs20151031/logs-20151031-extract/*/*-tested-TPF.xml

The result is, for each user file, a file (named 'userIp-be4dbp-tested-TPF-ranking.xml') valid with 'http://documents.ls2n.fr/be4dbp/ranking.dtd'.

Next, these XML files are given as input to LIFT. We suppose that LIFT results (for extracted queries) are in the directory './data/divers/liftDeductions/traces/' (see 'https://github.com/coumbaya/lift' for execution LIFT). This directory contains a set of directories (one by hour). Each one contains a file for each user (same hierarchy as for dbpedia log extraction). Like for dbpedia extracted BGPs, rank BGP founded by LIFT.

python3.6 bgp-ranking-analysis.py ./data/divers/liftDeductions/traces/*/traces_*-be4dbp-tested-TPF-ranking/*-ldqp.xml -t All

Then, compute precision and recall to produce a set of CSV files:

sh bigCompare.sh

Finaly, to be able to calculate agregates (avg, max, etc.), load CSV files in a MySQL database (you have to modify loadPrecisionRecall_MySQL.sh to introduce the name of your database, your user and password).

sh loadPrecisionRecall_MySQL.sh

Once the CSV files are loaded in the MySQL datatabase you can execute the script queries.sql.

Command descriptions

bgp-extractor

usage: bgp-extractor.py [-h] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                        [-t REFDATE] [-d BASEDIR] [-r] [--tpfc]
                        [-e {SPARQLEP,TPF,None}] [-ep EP] [-to TIMEOUT]
                        [-p NB_PROCESSES]
                        file

BGP Extractor for DBPedia log.

positional arguments:
  file                  Set the file to study

optional arguments:
  -h, --help            show this help message and exit
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level (INFO by default)
  -t REFDATE, --datetime REFDATE
                        Set the date-time to study in the log
  -d BASEDIR, --dir BASEDIR
                        Set the directory for results ('./logs' by default)
  -p NB_PROCESSES, --proc NB_PROCESSES
                        Number of processes used to extract (4 by default)
                        over 8 usuable processes

bgp-test-endpoint

usage: bgp-test-endpoint.py [-h] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                            [-p NB_PROCESSES] [-e {SPARQL,TPF}] [-ep EP]
                            [-to TIMEOUT]
                            file [file ...]

Request test with SPARQL endpoint or TPF server

positional arguments:
  file                  files to analyse

optional arguments:
  -h, --help            show this help message and exit
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level
  -p NB_PROCESSES, --proc NB_PROCESSES
                        Number of processes used (8 by default)
  -e {SPARQL,TPF}, --empty {SPARQL,TPF}
                        Request a SPARQL or a TPF endpoint to verify the query
                        and test it returns at least one triple (TPF by
                        default)
  -ep EP, --endpoint EP
                        The endpoint requested for the '-e' ('--empty') option
                        (for exemple 'http://localhost:5001/dbpedia_3_9' for
                        TPF by default)
  -to TIMEOUT, --timeout TIMEOUT
                        Endpoint Time Out (60 by default). If '-to 0' and the
                        file already tested, the entry is not tested again.

bgp-ranking-analysis

usage: bgp-ranking-analysis.py [-h] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                               [-p NB_PROCESSES]
                               [-t {NotEmpty,Valid,WellFormed,All}]
                               file [file ...]

Ranking analysis of BGPs

positional arguments:
  file                  files to analyse

optional arguments:
  -h, --help            show this help message and exit
  -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level
  -p NB_PROCESSES, --proc NB_PROCESSES
                        Number of processes used (8 by default)
  -t {NotEmpty,Valid,WellFormed,All}, --type {NotEmpty,Valid,WellFormed,All}
                        How to take into account the validation by a SPARQL or
                        a TPF endpoint (NotEmpty by default)

The '-t' argument describes entries the process has to take into account :

'All' : all entries,
'WellFormed' : only correct SPARQL queries,
'Valid' : only queries that are accepted by the endpoint (e.g. TPF client does'nt accept all SPARQL queries)
'NotEmpty' : only queries having at least one answer with the endpoint

Librairies to install

RDFLib : https://github.com/RDFLib/rdflib (doc: https://rdflib.readthedocs.io/)
SPARQLWarpper : https://github.com/RDFLib/sparqlwrapper (doc: https://rdflib.github.io/sparqlwrapper/)
lxml : http://lxml.de/

Name		Name	Last commit message	Last commit date
Latest commit History 210 Commits
data		data
lib		lib
resources		resources
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
bgp-extractor.py		bgp-extractor.py
bgp-ranking-analysis.py		bgp-ranking-analysis.py
bgp-test-endpoint.py		bgp-test-endpoint.py
bigCompare.sh		bigCompare.sh
filesCompare.py		filesCompare.py
loadPrecisionRecall_MySQL.sh		loadPrecisionRecall_MySQL.sh
queries.lst		queries.lst
queries.sql		queries.sql
tstDir.py		tstDir.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BGP Extractor for logs of the SPARQL endpoint of DBpedia

DBpedia logs from http://usewod.org

Contacts

User guide

Command descriptions

bgp-extractor

bgp-test-endpoint

bgp-ranking-analysis

Librairies to install

About

Releases

Packages

Contributors 3

Languages

License

edesmontils/BE4DBPedia

Folders and files

Latest commit

History

Repository files navigation

BGP Extractor for logs of the SPARQL endpoint of DBpedia

DBpedia logs from http://usewod.org

Contacts

User guide

Command descriptions

bgp-extractor

bgp-test-endpoint

bgp-ranking-analysis

Librairies to install

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages