Fork in path, Holkham woods by Derek Harper

Looking Forwards

Ed Summers
Documenting DocNow
Published in
4 min readJul 19, 2023

--

Since 2014 the Documenting the Now project has worked to develop tools and cultivate a community of practice for collecting social media. Documenting the Now began as a collective memory project amidst the protests following the murder of Mike Brown in Ferguson. Informed by early work studying the political and social significance of social media (Freelon, McIlwain & Clark, 2016) the Documenting the Now project hosted workshops, a Slack forum for discussing how to collect social media data, and developed several specialized tools specifically for Twitter, including:

  • twarc: for collecting data from the Twitter v1.1, v2 and Enterprise APIs
  • Catalog: a clearinghouse for sharing tweet id datasets
  • Hydrator: a desktop application for converting tweet id datasets into JSON and CSV
  • DocNow: a web application for collecting Twitter data and obtaining consent from content creators

According to Google Scholar, twarc has been cited over 500 times as a tool for collecting and hydrating Twitter datasets. It has been starred 1,300 times and forked 250 times on GitHub as it has been part of other software tools such as the Social Feed Manager. Numerous guides have been independently prepared by data scientists, Software Carpentry, and even Twitter themselves as part of their Twitter Research outreach. Enhanced access under Twitter’s Academic Research Product Track enabled searching the full archive of public tweets back to Twitter’s beginning in 2006.

After the acquisition of Twitter by Elon Musk, the changes to the Twitter API have resulted in the complete dismantling of DocNow’s tools. While it still technically functions (if you still have API keys), twarc is basically useless unless you want to pay $100/month to retrieve 10,000 tweets ($0.01 per tweet), or $5000/month for up to 1 million tweets. These prices and the new policies around deletion of Twitter data are designed for commercial use, and to terminate academic research and study of the platform. The DocNow application, which was built around Academic Research Product Track quotas and historical search no longer functions with the new API regime. Twitter has revoked the keys for Hydrator and the instance we have been running at community.docnow.io.

The truth is that the writing has been on the wall for access to corporate social media platform APIs for some time (Freelon, 2018 ; Bruns, 2019). Twitter and Reddit have been anomalies more than examples of the norm when it comes to access to platform data. Scraping practices are widespread, but are fraught with fragility due to interface changes, lack of documentation, and also present major reproducibility and provenance challenges. While social media APIs have typically been layered on top of the web, their key management, authentication, URL routes, and data representations have required the development of specialized tooling, that also has fragmented methods and diminished cooperation amongst projects.

With these challenges, and the shuttering of API access except for the most powerful, we believe that the social media and web research community needs to embrace the use of general-purpose web archiving technology and practices to preserve and study social media platforms and the web. As divergent as they are most platforms still have some kind of web surface as an interface. Service providers such as the Internet Archive’s Wayback Machine, and tools such as those provided by the Webrecorder project need support from the research community. Tools for analyzing web archive data such as those by the Archives Unleashed Project and specialized methodological approaches to using WARC data in research (Proferes, 2019) need to be further developed and socialized. Expertise in using WARC data needs to become more commonplace through curriculum development, training, and community workshops.

In addition to supporting existing tools and service providers, we should foster a more diverse community of people who can use web and social media archiving tools to create data collections. A more diverse set of users beyond academic researchers and the handful of archivists at academic and national libraries, will allow us to grow the number of collections available for research. A more diverse set of users of these tools also creates the possibility to envision new types of tools and practices for collecting this data, brings new meanings to what is valuable as web data, and builds a more inclusive historical record. Documenting the Now and the Archiving the Black Web project will be working more towards this community-building and education approach in the coming years.

Some aspects of traditional archiving practices also present a way forward, for example, working with individuals to donate their collections (or social media data), respecting an individual’s right to consent to have their information collected, building publicly accessible digital repositories of archival materials, and collaborative collection development, are useful practices and can help us imagine new ways forward for working with social media data in this post open API era.

No matter the path, the preservation of social media and the web will require not only our willingness to collaborate and share but also our recognition that its preservation is our responsibility. These platforms will not protect us in the physical world or the worlds we’ve created online.

References

Bruns, A. (2019). After the `APIcalypse’: Social media platforms and their fight against critical scholarly research. Information, Communication & Society. https://www.tandfonline.com/doi/abs/10.1080/1369118X.2019.1637447

Freelon, D. (2018). Computational research in the post-API age. Political Communication. https://www.tandfonline.com/doi/abs/10.1080/10584609.2018.1477506

Freelon, D., McIlwain, C. D., & Clark, M. D. (2016). Beyond the Hashtags: #Ferguson, #BlackLivesMatter, and the struggle for online justice. Center for Social Media and Impact. http://www.cmsimpact.org/sites/default/files/beyond _the _hashtags _2016.pdf

Proferes, N., & Summers, E. (2019). Algorithms and agenda-setting in Wikileaks’ #Podestaemails release. Information, Communication & Society, 22(11), 1630–1645. https://doi.org/10.1080/1369118X.2019.1626469

--

--

I’m a software developer at @umd_mith & study archives on/of the web at @iSchoolUMD