Published On: April 23rd, 2021

How to dig the wealth of information stored by Wikidata and use it for journalism: a new R package

An article by Monika Sengul-Jones titled “The promise of Wikidata” published on datajournalism.com a couple of months ago highlighted how Wikidata — a sort of database associated with Wikipedia — could be used by data journalists in a number of ways. Indeed, in the past years using Wikidata as a source has come up in various brainstorming sessions with colleagues contributing to EDJNet, and indeed we will be publishing soon a new material that makes extensive use of Wikidata.

Why aren’t more data journalists using Wikidata? Even beyond issues highlighted by Monika Sengul-Jones in her piece such as (unevenly) incomplete data, we have identified two additional obstacles to wider adoption of Wikidata in this context.

Firstly, getting data out of Wikidata can be an intimidating task even for people who are familiar with coding. Besides data wrangling, one needs some familiarity with the data structure of Wikidata (this is unavoidable, but it’s not too bad) and with SPARQL database queries, a major pain for those unaccustomed to database languages (see Wikidata’s instructions). Exploration of data — a typical component of data journalism — remains complex, and iterative processes less than intuitive.

Secondly, matching Wikidata identifiers to lists of individuals or objects as found “in the wild” is error-prone, and manual checks can be extremely time consuming.

To deal with this, we have been working on an interface to facilitate matching lists of strings to relevant Wikidata identifiers; we will be releasing it soon and announce it in a dedicated post.

Today, we are instead presenting a new tool, or rather, a package for the R programming language — tidywikidatar — that facilitates interacting with Wikidata for the many data journalists who use R and are familiar with its established data wrangling tools. In brief, tidywikidatar makes it easier to get data from Wikidata and explore them, without having to deal with complex database queries or nested data structures.

To see it in action, in this post we will outline a basic routine for exploring information stored on Wikidata, and find out what Wikidata knows about members of the European Parliament.

Setting up the package

First, you need of course to install the package.

Copy to Clipboard

I would also suggest you enable local caching, ideally in a folder that can be accessed by different projects (as it just caches information from Wikidata, there is mostly no reason to keep it in a folder synced with the likes of Dropbox or include it in backups).

Copy to Clipboard

As you see, all tidywikidatar functions start with tw_ followed by a verb describing what the function does.

Some familiarity with Wikidata is useful to follow along (check out this introduction on Wikidata’s own website). At the most basic, you should know that every item in Wikidata has an id (it always starts with a Q, something like Q123456). Each item is described by properties (they always start with a P, something like P1234), and some of these properties have qualifiers. With this in mind, you can just follow along this post and find out more about Wikidata on their own website, and about tidywikidatar on the package’s own website including more examples and more detailed documentation.

We are now ready to start.

Finding out more about MEPs with Wikidata

To find out more about members of the European Parliament, we must first know who they are.

tidywikidatar allows for basic queries such as this one, if it is given a table including couples of properties and values. In our case, let’s ask for everybody in the Wikidata database who has “member of the European Parliament” (Q27169) as “position held” (P39).

Copy to Clipboard
Copy to Clipboard

Here we end up with a list of 4579 individuals who, according to Wikidata, have been members of the European Parliament.

Check all properties

That’s a lot of MEPs! What does Wikidata know about them?

Here’s the top twenty MEPs about whom Wikidata has more properties:

Copy to Clipboard

Many of these may not be best known for their past as MEP, but indeed, if they are in this list, it means they all have been members of the European Parliament at some point in their life.

The “number of properties” column shows high figures, but many of these properties are simply identifiers in other archives.

What matters most, probably, is how complete these data are.

Copy to Clipboard

We really have rather complete data for only a handful of properties, but this is already a start. We can for example quickly find out the gender balance: about 26 per cent of MEPs were women.

Let’s check out another property about which we apparently have rather complete information: their job. We would expect all MEPs to be politicians, but many likely were not only politicians. What were their others occupations (P106)?

Copy to Clipboard

Some interesting hints, but surely also testament of the fact that data are not really complete. This is perhaps not surprising, keeping in mind that this list includes all MEPs starting from 1958 and Wikidata may not have much information about politicians who may have had a brief public career half a century ago. Shall we focus on those who were MEP in the latest legislature?

Who’s been a MEP in which legislature?

This is where Wikidata starts to get a bit more complex, as this kind of information is stored as qualifiers of properties.

If we take, for example, Willy Brandt (Q2514), we can see that he held many positions (P39):

Copy to Clipboard

For each of these, we have some qualifiers. Let’s look at what Wikidata knows about Brandt’s stint as a MEP (Q27169).

Copy to Clipboard

So now we know that Willy Brandt was a member of the first directly elected European Parliament (Q17315702).

If we are interested only in “parliamentary terms” (P2937) in the European Parliament (Q27169) for all the people on Wikidata we know have been members of the EP… we just have to ask.

Copy to Clipboard

Let’s take only MEPs who at any point in time have been members of the Ninth European Parliament (Q64038205) (due to Brexit, many already had to abandon their seat).

Copy to Clipboard

Having more complete data about them, we can ask some more interesting questions. For example, where have they been “educated at” (P69).

Copy to Clipboard

We have this information for 593 out of 795: not bad!

So here’s the top 20 institutions where MEPs have studied at least for some time:

Copy to Clipboard

Or… how many of them were born in a capital city? We just have to ask for their place of birth (P19), and then ask Wikidata about that place.

Out of 795 MEPs that have been members of the Ninth European Parliament, we know the place of birth for 787 of them.

Which is the city where most MEPs have been born? [I feel this could be one of those clickbaity posts such as “You would never guess number 1”]

Copy to Clipboard

Honestly, I somehow expected more concentration. If we ask Wikidata about those places of birth, it will tell us that almost half of them were born in a “big city” (Q1549591), but given that Wikidata considers such any town with more than 100 hundred thousands residents, this is not surprising at all.

Copy to Clipboard

148 were born in a capital city. Is it a lot? Not so much?

And now that I think about it, which regions are over-represented at the European Parliament if we take the place of birth of MEPs as a starting point?

Mmmm… do you start to feel the power of Wikidata?

The fun thing with Wikidata is that everything is connected, every piece of information added to it is available to everybody, and you can get this kind of information about lots of different subjects, objects, events, abstract concepts… all of this released with a CC0/public domain/“it’s all yours to enjoy” license.

Check out tidiwikidatar’s website for more details about its functions and more examples.

If you feel like exploring further the data presented in this post, here’s a csv file with all Wikidata id of MEPs of the 9th legislation, generated with this script.