10 useful things about Wikidata & SPARQL that I wish I knew earlier
Wikidata is the nerdy cousin of Wikipedia. It’s a machine-readable user-editable graph database born in 2012. SPARQL is a query language that is used to query Wikidata (and other databases).
The knowledge that you can extract from Wikidata with SPARQL is quite fascinating. Here are a few examples:
- 🐱 Mayors that are any kind of domesticated animal (query)
- ⚰️ Inventors killed by their own invention (query)
- 🦸 Humans whose gender we know we don’t know (query)
- 🥐 Birthplaces of humans named Antoine (query)
From these examples (here’s the full list), it should become obvious that Wikidata is much more powerful than Wikipedia — at least from a knowledge extraction standpoint — because you can cross-reference, compare or accumulate specific properties, instead of having to deal with one wall of text.
You can use this knowledge to create all kinds of serious applications, but also experimental artistic and fun stuff! I do both :-)
- Serious: OpenParliamentTV is a project I contribute to as a data scientist/programmer. It is an interactive database of parliamentary speeches. We use Wikidata to provide additional data about the deputies and other entities.
- Fun: “Speculative Datasets” is a Github repository and ongoing art project in which I create and collect custom image datasets for machine learning by leveraging Wikidata queries
Now let’s get to the 10 useful tips:
1 — The Wikidata Query Service website is the place you want to hang out
This is a GUI including a text editor where you can write your queries and send them to the Wikidata backend.
Very handy: After you’ve hit the “send” button you can click on “Code” to see your query in several languages (Python, Javascript, PHP and more)! Very useful if you are making your requests from within a web application.
2 — Start with an example query and adapt
Check out the example queries (via the button “Examples” at the top of the Wikidata Query Service, or here’s the full list), find one that is close to your use case in its structure and then work from there by changing properties to match your use case!
3 — Do the official Wikidata SPARQL tutorial
https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial
4 — In case of utter despair: There are actual human beings (called “query helpers”) that are willing to help you with your query!
But keep in mind: Your entry will disappear after a while. It’s worth creating an account so you get a notification when somebody reacts to your question!
5 — Help, the Wikidata Query Service returns a red error message saying a comma is wrong?!
This is a very unhelpful error message. Most probably there’s nothing wrong with your query or with a comma, but your query returned too many results and it’s a time-out problem! Try filtering your results more heavily to reduce the number of results.
6 — SPARQL pitfall: Missing properties = missing entries
If you query a certain statement (e.g. “has academic title”) and there’s no entry for this statement, the whole item is neglected. Solution: Put it in an OPTIONAL clause. It’s most safe to put *everything* in an OPTIONAL clause if the completeness of results matters to you
7 — SPARQL pitfall: Ddduuuuplicaaaations
This one is especially disastrous if you somehow rely on the number of results to count things. You probably shouldn’t.
If a property has multiple values (e.g. someone can have multiple first names) then you get an entry for every one of these values!
And it gets worse: If there are multiple values for more than 1 property (e.g. multiple values for first name AND multiple values for children), then you get an entry for every possible combination of these values!
This is just the way SPARQL works, it’s a feature not a bug..
8 — Use INCLUDE to avoid time outs
You can extract parts of your query in a subquery and include them with the INCLUDE statement.
9 — How to query the description of an item
A Wikidata description isn’t just another property of an item. If you look at an item on the Wikidata website, the description is the text below the title of an item. A description disambiguates the label of an item. E.g. the description of item with item identifier Q420646 (“pho”) is: “Vietnamese noodle soup”
That’s how you query a description (in this example the description will be in Dutch/NL):
Source: https://www.wikidata.org/wiki/Help:Description#How_to_query_them_in_sparql
10 — In some cases you just want to avoid SPARQL altogether and work with grep&friends on a Wikidata dump
Parse Wikidata dump in Python: https://gist.github.com/mcobzarenco/863af69690fb44eb22a4
RDF Dump Format: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format
Wikibase & JSON: https://doc.wikimedia.org/Wikibase/master/php/md_docs_topics_json.html
More useful links
Wikidata Sparql query FAQs and Discussions: https://www.wikidata.org/wiki/Wikidata_talk:SPARQL_query_service/queries
Wikidata Sparql query optimization tips: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization
2h Youtube course: https://www.youtube.com/watch?v=kJph4q0Im98
A tool that allows you do create visual queries: https://hay.toolforge.org/vizquery/. You can’t do everything but it does most simple queries. And it also translates your visual query into regular SPARQL.
Many thanks to Lucie Kaffee for her generous help, extensive knowledge and infectious enthusiasm for Wikidata and SPARQL!! Thanks to Hay Kranen!!