Assembling DocNow Data Flow

Dan Chudnov
Published in
9 min readMay 30, 2016

--

In this early phase of the DocNow project we are experimenting with a few technical approaches for gathering, analyzing, and presenting selected content to users for their review and refinement. Because one of the project’s goals is to put useful tools for this work in the hands of a wide range of people, we want everything we do under the hood to be reliable, repeatable, and amenable to a thoughtful user experience that meaningfully empowers scholars, archivists, and everyone else who uses DocNow. As such, there is a balance to be struck somewhere in between requiring someone to learn and understand the affordances of the Twitter API for themselves and hiding all of those details. As we work toward finding that balance, we want to be sure we minimize the drudgery of using the API and present summary results of collected data that illustrate patterns that might not be obvious.

The rest of this post reviews some aspects of the dnflow repository where we are collecting some of our prototyping code. To be clear, we are still in the early days of the project, and everything about the project’s eventual technical stack is up for discussion. In the meantime, we wanted to share some of our thinking along the way.

Reliable and repeatable data flow

A first piece we have been experimenting with is a repeatable workflow for incoming data. Whether we fetch data through the Twitter API within DocNow or ingest data procured by a scholar through other means, we need to be able to filter through and analyze that data, extract some summary details, and prepare those results for use in the UI. There are many ways to approach this kind of technical workflow. The first approach we’re testing is to define this workflow using Luigi, a Python library designed to “build complex pipelines of batch jobs” and “to address all the plumbing typically associated with long-running batch processes.” We distinguish between the data pipeline as a batch processing job and the user interface as a user-driven environment because we most need the data flow to be reliable and repeatable, whereas the user interface needs to be coherent and responsive. It’s okay for the data pipeline to take several minutes to complete as long as its results are correct, but we wouldn’t want to design a web app that made anyone stare at a spinning ball for several minutes.

Luigi was built by developers at Spotify who use it to process their music listener data, so it fits the problem of data pipelines like ours nicely. Our dnflow prototype has several steps:

  1. Acquire tweets from the Twitter API and write them to disk
  2. Summarize tweets by counting mentions, hashtags, URLs, comparing follower counts and ratios, and so on, and write those counts to disk
  3. Fetch media linked from tweets, such as images
  4. Analyze collected data one level deeper, such as comparing images — more on this in a bit
  5. Prepare summary and analysis results for use in the UI such as indexing, writing to a database, or reformatting files

Each of these steps could take a long while to complete, especially if we are examining a large amount of data. And there is a necessary ordering to the steps as well, as we can’t compare images until after we’ve fetched them. Luigi enables workflows like this by allowing developers to define each step as its own Task, each of which has its own definition of required input and output. In dnflow, for example, we have a CountHashtags task which does pretty much what you’d think: count the hashtags used in all the collected tweets. CountHashtags requires a set of tweets to count, of course, so for the case where fetch tweets through the Twitter API, CountHashtags requires the output of a FetchTweets task as input. In other words, CountHashtags doesn’t even start until FetchTweets completes successfully. Similarly we can’t fetch images until we’ve run FetchTweets, and we can’t compare fetched images until we’ve run FetchMedia, and so on.

Luigi makes it easy for a developer to define tasks such as these and how their inputs and outputs relate to each other. When a Luigi workflow executes, then, it examines all these relationships and computes a dependency graph, allowing it to schedule and run the first required steps first, followed upon successful completion by the later steps as defined by each task. Here’s a dnflow Luigi workflow in process:

In this dependency graph, visualized by Luigi in its web-based monitoring UI, green tasks are finished, yellow tasks are pending completion of a dependency, and the blue task is active. Currently, FetchMedia (in blue) is running, with FetchTweets and CountMedia (in green) having completed before it and MatchMedia (yellow) and other tasks pending its completion. This represents a state of the workflow that often takes a long time, in that we’re fetching media files linked from tweets from the web. In some kinds of data pipelines, we might think of this typically slow step as a bottleneck — the step that causes the greatest delay, forcing the entire workflow to complete later. In our thinking so far, though, we are less concerned with speed as a bottleneck than we are with having a reliable process that provides meaningful summary feedback to users. Whether it takes ten seconds or ten minutes to complete, none of this processing will be useful if it doesn’t complete successfully or if nobody can understand the results. So far, Luigi has been easy to work with and is proving to be a reliable component of our prototype, letting us shift our focus to user experience.

Making sense of collected data

This focus on user experience is central to our work as we seek that balance between automation and technical understanding that can best empower DocNow users. As an example, while I was preparing notes for this post, Bergis suggested we look at tweets containing “stoppolicebrutality”. Following this suggestion, I ran a dnflow process which collected over a thousand tweets matching that search term, and saw these hashtag counts:

Top hashtags for tweets matching “stoppolicebrutality”

Even this simple visualization reveals a few key insights. First, we need to enable users to explore this data with some basic manipulations like removing that first long bar. You can see that indicating the set of the 1,022 tweets collected from a search of “stoppolicebrutality” contains over 1,000 tweets that include the hashtag “stoppolicebrutality” is problematic. It confirms the correctness of the search by matching what we expected to see, but it also obscures the data we didn’t know to expect: every other hashtag that is used in conjunction with “stoppolicebrutality” has a proportionally smaller bar. While technically correct, it could be that these other data points indicate alternate narratives and viewpoints which might be a focal point for research, and this chart would make it difficult to focus on them.

If we can choose to see the same chart with that one hashtag, essentially a stop word within this search result, removed, we can see a lot more. In particular, the updated chart below indicates that we should also be looking at tweets that match “IEBC” (the Independent Electoral and Boundaries Commission in Kenya, a target of active protests).

Top hashtags for tweets matching “stoppolicebrutality” (same as above), with top hashtag “stoppolicebrutality” removed

With that in mind, we can consider options like making it easy for users to run a new search for just “IEBC” or perhaps both terms together, to compare the results of both searches, or perhaps combine them, and so on. This is an area of work we expect to spend a lot of time on, and a series of user interviews we’ve recently started should give us a lot to consider as we begin to consider and prioritize development of features like these.

To carry this example a step further, I ran a second search for “IEBC” and reviewed the images linked within the matching result set of tweets. The simplest task here is to count image URLs, and we do that, but we also go a step further to show which sets of images are substantially similar to each other. Here is the result from the latter search, for tweets matching “IEBC”:

Common and similar images linked from tweets matching “IEBC”

Across the top row of “Common images”, we see the most commonly tweeted images from left to right, beginning with that wide group shot, which had 36 tweets linking to it. For this we are simply counting occurrences of the URL linking to the image, so it is unsurprising that we see (in fourth place with 18, and then again in seventh place with 14) another image, a tighter shot of two people, appear twice. These latter two images very much appear to be the same image, and if we add up their totals, the 32 when taken together would move that image into second place.

This happens easily and often on the web and on social media networks like Twitter, where an image can be copied and republished by anyone with an account or a web site. With that in mind, we are further experimenting with reviewing all of the collected image files and looking for matches like this, further details of which you can see under the two rows of “Matching images”. When we compare the content of the captured images, we find at least a half-dozen distinct URLs within the collected tweets that reference one of two sets of what are visibly near-identical copies of the same two source images. And the overall combined counts of each matching image set are much higher than the original counts of each image when counted by unique URL.

There are several approaches to comparing the content of images, one of which reduces the size and color space of images to make comparisons fast and fairly reliable. So far we have had good results using the imagehash Python library for this purpose.

Another detail we can pick up by looking for similar images is subtle modifications of a source image resulting in derivative forms. For example, from the same set of results, we find these similar images:

Similar images from tweets matching “iebc”, varying in aspect ratio and branding

There are two key differences between these images: the first two have an incorrect aspect ratio (they seem “squeezed”) when compared with the last image at right, and the last image has a television network’s logo in the top right corner and headline information added in a lower third.

More similar images from tweets matching “iebc”, varying by watermark

Above is another example from the same result set, with a photo archive watermark (look for the faint transparent grey text across the middle) appearing on some of the images and not on others. We can imagine the questions we might have upon seeing patterns like these: who tweeted which images, and how do the communities of tweeters and retweeters vary? Which came first, and might the original creator of the image be among them? How do those counts of hashtags, mentions, and influential accounts attached to one version of the image vary with the others?

Reviewing these similar image sets manually, it is easy to spot these similarities, but we might have missed one or more entirely had we skipped this step. When we consider data sets containing hundreds of thousands or millions of tweets, the possibility of missing details like these increases proportionally. Add to that the common trolling technique of recycling old images to bolster false narratives and it is easy to see the potential for additional post-processing of media within DocNow.

Hopefully this peek inside the dnflow prototype has given you an idea of our data pipeline work so far, and the direction of our next steps. The long term goal of DocNow is to leverage social media streams like Twitter as a way to identify valuable content for research archives. If you have any ideas that you think can help researchers and archivists zoom in on valuable content please get in touch either in comments here in Medium or using the dnflow issue tracker.

--

--