Search4EU: Building a journalist’s ideal librarian
This piece details the work of the Datactivist team with the European Data Journalism Network (EDJNet) to master and use Aleph to build a search engine for public documents, regardless of their format or language.
The repository containing the code for the Search4EU project is available under a Creative Commons license. We are also assembling a user guide as a wiki for the repository, for those who would like to use Aleph for their own crawling and indexation projects.
EDJNet was the perfect place to explore this tool. We also thank and praise the wonderful team at OCCRP who put up this game-changing tool and welcomed, listened and helped us all along : your work is gold.
Introduction
Sylvain discovered the Aleph stack at the 2018 Open Government Partnership summit in Tbilisi (Georgia). After a pretty classic presentation on open source investigation, he sat at the table of a digital nomad/dev working at the time for OCCRP in Sarajevo who flipped her laptop open at the black interface he would later become familiar with: a search engine to query several databases covering pretty much every country in the world.
In the meantime, Mathieu was working on the best ways to explore administrative acts. Because of legal obligations (for example, the need to use the signed version of a city council’s decision to make it valid), most of those were published as image-based PDF files, effectively preventing them from being indexed and searched. This trove of documents, with the data and information it contained, was technically locked.
Both cooperators for Datactivist, an open data consulting company based in Provence, France, we were discussing our researches at the Ecomotive café, down the great white stairs of the Saint-Charles station in Marseille. It’s only after a couple of hours that we realised that one could have the key to the safe the other was trying to open.
Using Aleph to index a batch of files
Our first attempt consisted in using Aleph as a local search engine: feeding it with batches of PDF files and letting it sort them out, reading their content and connecting the dots to help our research through them… a sort of home-made librarian.
This feature is actually the first intent of Aleph. The original OCCRP’s Aleph looks up more than 300 publics datasets and leaks, from the German companies registry to the blog posts of the assassinated Maltese reporter Daphne Caruana-Galizia’s blog or Wikileaks publications. It is, by itself, an open air gold mine.
We already had a batch of files to dig in, so we decided to come in by the easy path: throw them at Aleph and see how it chew it.
Our approach
We decided to go slow by using Aleph locally first. It meant installing Aleph, updating it and launching it from our command line to get a local URL that could only be accessed from our computer.
We had already scraped documents from our local governments: registers of administrative acts of the Marseille city council, and reports of local government meetings from several cities in Seine-Saint-Denis.
The operation is quite simple: you open the command line and you point Aleph at what you want it to “ingest”. In this case, a folder containing all the PDF files. Not complicated but not very efficient either: without the extra metadata explaining what was in the file, the only intel Aleph could retrieve was what it could find in the written content of the documents… and it was pretty messy.
In practice: the Marseille registers use a two column layout. This let the Aleph program wonder: where does one decision start and where does it end? Where is that date? Where’s the title? In the end, our batch was ingested but the nutrient for investigation was missing badly…
Lessons learned
Aleph is not a file cabinet, it’s a search engine. And without meta data, a search engine is as messy as the photo folder in your smartphone (when you haven’t sorted it in two or three years).
We thought at first that the information contained in the docs would be sufficient for Aleph to self-organize. But it was foolish. Not because Aleph is bad at it but because the documents come into so many sizes and shapes, you can’t predict the metadata and data that will eventually be extracted from it!
We needed metadata. And, to gather some, we needed to use the full Aleph stack and start looking at Memorious. To learn how to do it yourself, check the Aleph section of our user guide once it’s out.
Using Memorious to extract public documents
We decided to turn to Memorious, the “crawling” tool of the Aleph stack, to automatically retrieve big batches of documents from public websites. In practical terms, Memorious allows you to manage small programs (the crawlers) that will follow a given path to retrieve documents. These crawlers can be defined using a set of existing functions, whose behavior can be customized through various parameters, as a pipeline of stages: generating seed URLs, connecting to services, following links, cleaning content, storing documents, etc. Of course, you can also write your own functions in Python (and that’s what Mathieu did, on several occasions, to fill the gaps or fine-tune crawlers).
We wanted to harness the power of Memorious to help our colleagues from Civio investigate the access to mental healthcare in Europe. And we aimed too high for a start : retrieve all the public decisions regarding health in the Spanish national and provincial legislature available online.
In practice, it meant designing one crawler per website that would generate the URL for a decision, retrieve the page, extract from the layout the part of the page on the decision, clean it and store it. Easier said than done.
Our approach
We first asked our colleague from Civio where to look for. She provided us with a long list of websites, mainly national and local authorities : Boletin Oficial del Estado (the Spanish Official journal), its equivalent in each autonomous province and city, the Spanish Congress website, etc.
We decided to focus first on the official gazettes, as they represented the most promising source of documents with the wider variety.
The first difficulty we encountered was that each website had its own structure that reflected into the shape of the URLs of the decisions. The place of publication for the deliberation of the Junta of Andalucia, the Boletín Oficial de la Junta de Andalucía (which we ended up calling the Boja), labelled each URL with the year and day of publication. Decisions are published almost every day. Fortunately, Memorious had this covered: with init methods, we could generate the pieces of the URL, adding one day on year down or up, and paste them together to look up for legitimate URL. We had to do it for each website. And, here, we deeply regretted our Spanish classes.
But this part was the easy one. Once we got to the page itself, we needed to target the right link to follow to get to the docs. And that’s where things got complicated: some of those sites generate their pages as they are asked for (with dynamic pages). We hit a wall several times, trying to point at something that we “saw” but that wasn’t available at a given URL. We couldn’t point at a place, we had to trace the path to it from a starting page. Memorious can do that, but you have to become a first class GPS device to get it where you want.
Lessons learned
Back from this Spanish expedition, we came to understand the importance of mastering every step of the pipeline and the versatility of it: the many options offered by the already existing functions and the level of detail with which you can specify what you keep and what you stash came as essential tools to use the crawler in websites as complex as those.
The second important thing: we would have greatly benefited from a good guide in this mess! At the crossroads of legal terminology and local parameters, those pages may look as an amusement park for an experienced hiker of the Spanish legislative path. For us, it was a foggy maze of exotic references and strange labels. Knowing your way in a website is not something to learn while programming a crawler: it’s a prerequisite.
We’ll compile the step-by-step process of Memorious installation, crawler setup and running in the Memorious section of our user guide.
Combining Memorious and Aleph to create a better search engine for public websites
We now had experienced the ingestion process as well as the crawling. Combining the two would allow us to design a custom document search engine, updating itself and indexing documents on the fly. Our original intent, in a way. But, for this, we needed two things:
- a source of documents narrow enough to be manageable but diverse enough to cover several topics and countries at once to be useful to as many journalists as possible;
- a source that didn’t already feature a search engine or that featured a search engine that didn’t fully meet the needs of journalistic methodology (we didn’t want to make a duplicate internal search engine for a public database).
And so, we decided to use the Aleph stack to improve the search on one of the most European topics of all: competition.
The competition website of the European Commission is a story-rich maze: behind each decision or step of the procedure lies potential cartel, free market rules infrigement and interesting power plays between companies. Covering all the economic sectors and the life span of the EU, it suffers from a search engine that doesn’t meet the needs of journalistic research:
- the company names are not indexed as a whole but token by token (see the example below);
- worse: the relevant entities mentioned inside the docs (either web page or docx and pdf files) are not indexed. Meaning you can only search for tokens in the title of the decisions or documents.
The topic and potential was worth the work. So we got to it.
Our approach
As usual, the first step was to learn how to navigate the site. Among its particularities, it is made of a different fabric depending on the type of decision you look at: each type (antitrust/cartels, mergers, and state aid) displays its content in a different manner. That called for a set of three different crawlers, one for each type. Because even on a single website, in order to navigate efficiently, extract clean metadata and follow the right links, you need to customize and tailor the crawler for specific page templates. We decided to start with the best organized and most promising section from a journalistic perspective: antitrust and cartels.
The other problem we had was that the URLs didn’t follow a clear pattern: case identifiers are (mostly) numeric but non-consecutive, at least in a given section, leaving very large gaps. Memorious can handle missing pages, it does it very well, but each page had to be called, downloaded, analyzed… and that consumed time, energy and storage. So we decided to use another method: bouncing from page to page.
As Memorious works a lot by following links, we decided to target those that pointed at lists of cases: the economic sector of the companies involved, the name of the companies, the related cases… The pages returned by those links were full of other cases, sometimes not sorted in chronological order, allowing us to jump from a 2020 cartel investigation to a 1980 business deal. This “bouncing bot” was able to gather as many as 1,000 documents per hour.
The crawled documents are fed directly from Memorious to Aleph via the Aleph connect module we learned to use (and love). But it was still missing something: the document came with no metadata. No date, no company name, sector or even category of inquiry (cartel or antitrust). Aleph couldn’t use any such metadata to index the documents in a useful way.
So Mathieu digged in the toolbox. So far, we focused on how to use the crawler and its functions as they came. There was a lot of useful stuff, improved by other users. But here, we needed an extra layer of code to do what we needed: make Memorious snatch some data about the document in the page that contained the link. Those tables we mentioned earlier had it all: company name, type of case, date of publication… So, Mathieu wrote a new series of functions of his own, the “competition” set, that extracts this metadata and passes it to be stored alongside the doc.
We now had the docs but also the right company names, dates and sectors… those things we wish the original search engine had and offered to its users.
Lessons learned
The first thing we realised here is that Aleph can be very user-friendly. During our previous attempts, we sweat at building our crawlers and then took hours to analyze the docs they brought us back… Now we knew how to connect Aleph and Memorious, we could see the result popping up in real time in the graphical interface!
The second important piece of advice: it’s OK if it’s messy, it will get sorted later. Crawling docs is hard but, if you get all the metadata you need, some sort of organization will emerge and the collection will tidy up soon. Hence, following all the leads you have on a page can be a much more efficient approach to a website that trying desperately to reproduce its architecture.
And last but not least: Aleph and Memorious are designed to enable extension, adaptation and fine-tuning. We were stuck for weeks in the set of function and pipeline we read about in the documentation, following them with care, bending our brains to fit squares into circles. In the end, adding our own functions derived from the existing toolbox, and inserting them in this framework was easier and liberating. This stack is solid enough to play with as you intent to.
You’ll find technical details on how to connect Memorious to Aleph and index your docs on the fly in a dedicated sub-section of the user guide.
Conclusion and follow up
The Aleph stack made us dream big. From local government decisions to national health regulation or continent-wide inquiry, we embarked on a journey into open document repositories that made us realise the potential of this tool for journalists (and it’s huge!). But the most important lesson we learned is: every step to set up those tools requires a lot of expertise, exploration and time.
Aleph accomplishes several tasks no single journalist (or even newsroom) could achieve by itself:
- searching through complex paths for documents of all sorts;
- extracting data on the fly, regardless of the format;
- using the metadata and content to sort those documents and connect them, inside and across databases.
The skills required to leverage each of those capabilities are diverse:
- backend developers to setup the tools and manage the databases;
- frontend developers to improve the access and build new entry points for users into the database;
- natural language processing experts to improve the detection of entities (people, companies, places, etc.) that enable to filter and explore a collection of documents;
- and journalists, of course, to assert the relevance of the retrieved documents and entities, focus on what matters and make use of it.
The OCCRP team worked hard to make Aleph exist. This organisation needs more of those specialists to help it achieve what it can offer to journalists.
Next steps
We have a lot of work ahead of us:
- documentation: the user guide, some translation to help the folks at OCCRP;
- contribution: Mathieu keeps pushing contributions (fixes or features) to the original repository and we’re digging into the features we think would be useful to add;
- Memorious:
- the competition crawler needs to be extended, benefiting from the latest updates we made;
- we’d like to get back to our original project and use our new skills and tools to harness Aleph for public doc indexation;
4. Aleph:
- the Aleph instance for EU competition would benefit from some improvements with immediate gains to the user experience (PDF rendering, search, etc.);
- we’d like to work on new features to dig into some of our topics and help fellow journalists use this tool for their own.
All this can’t be done just by the two of us. We wait for colleagues to come with fresh ideas and we have some other leads to follow with people from NGOs. And, of course, we will get back and keep contributing to this great project with OCCRP.