The Catalog and the Hydrator

Ed Summers
Documenting DocNow
Published in
7 min readJul 24, 2017

--

This post is a (somewhat extended) version of a lightning talk I gave at the Collections as Data on July 25, 2017 at the Library of Congress in Washington, DC.

A year ago Bergis Jules spoke here at the Library of Congress’ Collections as Data conference about how memory workers (archivists, librarians, researchers, activists) must think about practices for social media archiving in the context of the powerful assemblage of state and corporate entities that are actively building their own archives for the purposes of surveillance and control.

In the Documenting the Now Slack channel we’ve been working to understand what our ethical responsibilities are when building collections of social media content for research — especially when it comes to designing tools that help people build those collections. This post is about two small applications that have emerged from this discussion: the Catalog and the Hydrator.

One of the primary goals of the Documenting the Now project has been to support the use of social media data in academic research — with a specific focus on Twitter data. Of course there is no shortage of research using social media data. A quick search in Google Scholar yields thousands of studies that rely on the analysis of Twitter data. But just because the research is happening does not mean that there aren’t challenges and significant roadblocks.

In their ethnographic study of social media researchers’ data management practices Weller and Kinder-Kurlanda discovered that problems are in fact evident once they took a look below the surface at how the research is being enacted:

Social media researchers’ approaches and practices in general were highly influenced by certain constraints to their work. Constraints mainly concerned the access to data, the sharing of data, and the publication of information about data collection and processing … It may be concluded that in the current state, both data and knowledge are hidden in social media research. Data is hidden in the sense that research datasets are being shared privately rather than formally published for reuse. More than in other disciplines, use of secondary data is difficult and informal data sharing is connected to many uncertainties regarding legal and ethical questions.

Science and knowledge production are built on the foundation of reproducibility, or the ability for scientists to duplicate each others results. To reproduce results you need to replicate the method by which the results were generated. What data was analyzed? How was it obtained? How was it manipulated? Access to the data and a description of the data are crucial. In today’s big data research environment access to data is increasingly mediated by corporations who own the data. Despite widespread efforts to encourage the use of data management plans social media data presents researchers with legal and ethical dilemmas that constrain the deposit and linking of research to underlying datasets. A good example of this is Twitter’s Terms of Service which have struck an interesting and (we think) useful balance between access and control.

Twitter’s API allows researchers to easily access tweets from their streaming and search APIs after you agree to their Terms of Service. An ecosystem of tools has developed around these APIs, one of which is twarc, which we’ve worked on with 25 other collaborators from around the world, including the Social Feed Manager project at George Washington University. However a significant clause in the Twitter Terms of Service prevents the publication of collected data, but does allow datasets of tweet identifiers to be distributed.

A list of tweet identifiers

Researchers aren’t really sharing tweet identifier datasets very much yet. Even if a file of tweet identifiers can be obtained it isn’t always clear how a researcher could easily turn these identifiers back into the information rich tweet data that can be the subject of analysis. This is where the Documenting the Now project decided to make a small intervention with the Catalog and the Hydrator applications.

The Catalog is a registry of tweet identifier datasets. It doesn’t house the datasets themselves — it is simply a clearinghouse that points to where these datasets live in repositories of various kinds on the web. At the moment there are 35 datasets listed, on topics such as the 2014 protests in Ferguson, the Panama Papers, the 2016 US Presidential Election, 2 years of J K Rowling’s fan correspondence, and more. An interested researcher can read a brief description of the dataset and then go and download the identifiers. Researchers can also publish their own datasets in the catalog. But what should the researcher do after they have downloaded the dataset of identifiers? What is the use of a text file full of numbers?

Hydrator’s datasets and dataset detail consoles

This is where the Hydrator comes in. The Hydrator is a cross-platform desktop application that manages the conversion of the tweet identifiers back into the data-rich, highly structured JSON data for a tweet. The Hydrator allows the user to log in with their own Twitter credentials and then negotiates access to Twitter’s statuses/lookup API to turn the tweet identifiers back into data. Hydrator is careful to stay within the Twitter API Rate Limits which constrain how many tweets can be requested in a given amount of time. It lets you run a conversion for days or weeks, over potentially unstable network connections, if the size of the dataset requires it. The person running the Hydrator is the only one who has access to the retrieved data. Once the download is complete it also lets you turn the JSON into a CSV file that can easily be viewed as a spreadsheet in Excel, analyzed in a tool like Stata or visualized in software like Tableau.

Returning to where we started, what do the Catalog and Hydrator mean in the context of surveillance and researcher ethics? It’s important to note that the process of hydration is lossy. Twitter’s API will not return data for tweets that have been deleted. If a user has deleted a particular set of tweets or protected/deleted their account the Hydrator will not be able to reconstitute the data even if it knows the tweet identifiers.

From the context of reproducibility this is potentially problematic, since you are not guaranteed the exact same dataset. But from the perspective of the users’ right to control their own content, who may not have consented to being in your study in the first place, this is a feature not a bug. The Documenting the Now project has made the concerted decision to work for the rights of the content creator, because for us, the positionality of the archive relative to the archived matters. It matters because we’re not just interested in studying the data, we’re interested in preserving it for the long term.

Tools like Politwhoops that hold powerful public figures to account for what they say in social media have an important role. But when it comes to documenting events like the social activism and protest it is important that our tools and collections do what they can to respect and empower the decisions of content creators to remain legible. Legibility often involves great risk, especially for marginalized communities. And as Proferes recently discovered in a study of users knowledge of information flows in Twitter, the public/private distinction isn’t a binary either/or — it is a continuum:

… only a slim majority of users accurately indicated that Tweets are set to be public by default. Given the common refrain that Twitter is a “public” platform, having 33% of respondents indicate they are uncertain whether or not Twitter is public by default suggests some users may not actively perceive it this way. This raises many questions about the kinds of literacy work that needs to be done to improve user understanding of what it means for a platform to be “public.” Together, these individual findings suggest that the problems of inaccurate knowledge of information flow highlighted by these three anecdotes may be more common across a wider swath of users.

As we have moved into our second year of work we are increasingly focused on opportunities to connect researchers and archivists with content creators in social media. As Alexandra Dolan-Mescal wrote previously this involves adequately documenting the decisions being made about what and how to collect so that these collections can be adequately understood later. It also means we must look for opportunities to work with social media users to gain their consent when it comes to the long term preservation of their content.

Recently Twitter updated its Terms of Service to cap the maximum number of tweet identifiers that can be shared at 1.5 million. Thanks to a post by Justin Littman of the Social Feed Manager project and subsequent awareness raising Andy Piper of Twitter clarified that “researchers affiliated with an accredited academic institution” are exempt from this limitation.

This is good news for the research community and for users of the Hydrator and the Catalog. It also highlights the precarity of this type of memory work, and the need to work together as a community to share and describe our collections in a way that encourages preservation, access and use. If you get a chance to use either the Catalog or the Hydrator and have any feedback please let us know here or over on GitHub — we would love to hear from you.

--

--

I’m a software developer at @umd_mith & study archives on/of the web at @iSchoolUMD