Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Images for tooltips from wikidata and wikipedia #1

Open
graue70 opened this issue Feb 7, 2020 · 4 comments
Open

Images for tooltips from wikidata and wikipedia #1

graue70 opened this issue Feb 7, 2020 · 4 comments

Comments

@graue70
Copy link

graue70 commented Feb 7, 2020

The image in the wikipedia infobox is not always from wikidata. See https://www.wikidata.org/wiki/Q16742294 and https://www.wikidata.org/wiki/Q16742291, which might be helpful in determining differences.

As explained here, the wikipedia image can be queried in the following way: https://en.wikipedia.org/w/api.php?action=query&prop=pageimages&titles=Jaguar&pithumbsize=500&format=json&formatversion=2.

Per default, it returns only images with a free license. For Lord of the Rings, the image is not free, so it is not returned. However, it is possible to return any (including non-free) image with the additional argument pilicense=any, as in https://en.wikipedia.org/w/api.php?action=query&prop=pageimages&titles=The_Lord_of_the_Rings:_The_Fellowship_of_the_Ring&pithumbsize=500&format=json&formatversion=2&pilicense=any.
I don't know what the licensing means for aqqu tooltips, but there is more info on that here.

It is possible to query multiple images with one query: https://en.wikipedia.org/w/api.php?action=query&prop=pageimages&titles=The_Lord_of_the_Rings:_The_Fellowship_of_the_Ring|Sun|Jaguar&pithumbsize=500&format=json&formatversion=2&pilicense=any.

Maybe one option would be to use the following query:

PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX schema: <http://schema.org/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?x ?m ?image ?sitelinks WHERE {
  ?m schema:about ?x .
  ?m @en@schema:abstract ?abstract .
  OPTIONAL { ?x wdt:P18 ?image . }
  ?m schema:isPartOf <https://en.wikipedia.org/> .
  ?article schema:about ?x .
  ?article wikibase:sitelinks ?sitelinks .
  FILTER (?sitelinks >= "15"^^<http://www.w3.org/2001/XMLSchema#int>)
} ORDER BY DESC(?sitelinks)

and then loop over the results without an image and use the wikipedia image only for those.

On the other hand, maybe one should prefer the wikipedia image over the wikidata image. For the example of mexico, wdt:P18 yields a bunch of images, but an image of the flag (P41) would probably be more useful. Wikipedia uses the flag in this case.

In either case, the script or command to produce the file qid_to_wikipedia.tsv should be included in the repo for better reproducibility, especially regarding entities with more than one image.

@graue70
Copy link
Author

graue70 commented Feb 7, 2020

Ignoring wikipedia completely at the moment, these are four possible ways to express the image in the sparql query from above:

?x wdt:P18 ?image .
?x wdt:P18|wdt:P109|wdt:P14|wdt:P1442|wdt:P154|wdt:P1543|wdt:P158|wdt:P1766|wdt:P1801|wdt:P2096|wdt:P2713|wdt:P2716|wdt:P2910|wdt:P3311|wdt:P3383|wdt:P3451|wdt:P367|wdt:P41|wdt:P4291|wdt:P4640|wdt:P5252|wdt:P5775|wdt:P7407|wdt:P7415|wdt:P94|wdt:P996 ?image .
OPTIONAL { ?x wdt:P18 ?image . }
OPTIONAL { ?x wdt:P18|wdt:P109|wdt:P14|wdt:P1442|wdt:P154|wdt:P1543|wdt:P158|wdt:P1766|wdt:P1801|wdt:P2096|wdt:P2713|wdt:P2716|wdt:P2910|wdt:P3311|wdt:P3383|wdt:P3451|wdt:P367|wdt:P41|wdt:P4291|wdt:P4640|wdt:P5252|wdt:P5775|wdt:P7407|wdt:P7415|wdt:P94|wdt:P996 ?image . }

One still needs to deal with duplicates because of multiple images for one entity. Some kind of preference would be good which would be possible with the BIND(IF(BOUND())) construct from https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial#BIND,_BOUND,_IF, but that's not supported by qlever at the moment.

PS: The list of predicates was generated with this query:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?image ?label WHERE {
  wd:P18 wdt:P1659 ?image .
  ?image rdfs:label ?label .
  FILTER langMatches(lang(?label), "en") .
}

flackbash added a commit that referenced this issue Mar 3, 2020
Before, only information of entities which had a link to an image on
Wikidata via the P18 property was included in the wiki info mapping.
Now, information about all entities that can be mapped to one of (title,
abstract, image) is included.
Images are now either retrieved via the Wikipedia API or via a Wikidata
image property (P18, P109, P15, ...).
This commit adds the necessary scripts to create the mapping from
scratch and adds documentation about the process.
Relates to #1.
@flackbash
Copy link
Member

Thanks for the detailed analysis!
The current solution is to query the Wikipedia API for images and prefer these images over the images retrieved using a SPARQL query with properties wdt:P18|wdt:P109|wdt:P14|... as listed in your comment.
What I have not implemented is a preference over the images retrieved using the SPARQL query. However, if I grep'ed correctly, only 21,871 images out of 404,840 in the current qid_to_wikipedi_info.tsv file stem from Wikidata anyway. All other images were retrieved using the Wikipedia API, so this should not be a big problem.

@graue70
Copy link
Author

graue70 commented Mar 3, 2020

How did you deal with the license question for wikipedia images?

@flackbash
Copy link
Member

Not at all. I skillfully overlooked that part.

So right now, all images are included in the mapping, i.e. the pilicense=any parameter is set. Without setting this parameter, the final mapping contains 384,879 instead of 404,840. This is probably good enough if it saves us the hassle.

From what I understood, Wikipedia can use these non-free contents under the fair use policy which exists in the US but not in the EU (which is probably why the English Wikipedia contains theatrical release posters for films and the German Wikipedia does not). Too bad...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants