Skip to content

Get all twitter posts by any user | CDX API use case

Akash Mahanty edited this page Apr 2, 2022 · 1 revision

Ever wanted to get all the URLs for all tweets including the deleted ones for a Twitter user? If yes, then you are in luck because of the Wayback Machine you can. Before we go ahead I would like to clarify that the title is a bit clickbaity as we can't get all the URLs but certainly we can get all the tweets that were archived on the Wayback Machine. Also, this guide is for educational purposes only.

Wayback Machine has three useful public APIs: SavePageNow API, Availability API, and the CDX API. For fetching all archived tweets by a user, we will be using the CDX API.

What is CDX API?
The wayback-cdx-server is a standalone HTTP servlet that serves the index that the wayback machine uses to lookup captures. The index format is known as 'cdx' and contains various fields representing the capture, usually sorted by URL and date. The CDX server responds to GET requests and returns the CDX data. [ref]

We will be using the Twitter user, jack in the example code.

Retrieving all archives of the Twitter profile of a user

In this section, we will be retrieving all archives for the webpage at https://twitter.com/jack.

The WaybackMachineCDXServerAPI class is the interface for the CDX API. It can take many arguments including url and user_agent. Other arguments will be explored later in this post.

from waybackpy import WaybackMachineCDXServerAPI

user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
url = "https://twitter.com/jack"

wayback = WaybackMachineCDXServerAPI(url=url, user_agent=user_agent) # only the exact match

snapshots = wayback.snapshots() # <class 'generator'>

for snapshot in snapshots:
    print(snapshot.archive_url)

Equivalent CLI interface command:

waybackpy --url "https://twitter.com/jack" --user-agent "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1" --cdx --cdx-print archiveurl

The snapshots method of the WaybackMachineCDXServerAPI returns a generator, it returns a generator as we can run out of memory if we use a list. For some URLs like https://www.google.com, there are millions of archives.

Retrieving all archives for all posts by a Twitter user

Notice the argument prefix passed to a parameter called match_type. This tells the CDX API to return the snapshots(archives) which have https://twitter.com/jack as the prefix.

from waybackpy import WaybackMachineCDXServerAPI

user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
url = "https://twitter.com/jack"

wayback = WaybackMachineCDXServerAPI(url=url, user_agent=user_agent, match_type="prefix") # prefix match enabled
snapshots = wayback.snapshots() # <class 'generator'>

for snapshot in snapshots:
    print(snapshot.archive_url)

Equivalent CLI interface command:

waybackpy --url "https://twitter.com/jack" --user-agent "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1" --cdx --cdx-print archiveurl --match-type prefix

But it contains a lot of duplicate archives for the same URLs. How do we remove the duplicate archives? Collapsing is a feature supported by the CDX server to remove duplicates from the API response. In the next section, we make the API not return multiple/duplicate archives for a post.

Making the CDX API not return multiple archives for the same URL

The collapses parameter of WaybackMachineCDXServerAPI class takes a list as an argument. See https://github.com/akamhy/waybackpy/wiki/Python-package-docs#collapsing for more info about collapsing.

from waybackpy import WaybackMachineCDXServerAPI

user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
url = "https://twitter.com/jack"

wayback = WaybackMachineCDXServerAPI(url=url, user_agent=user_agent, match_type="prefix", collapses=["urlkey"]) 
#  prefix match enabled with urlkey based collapsing active

snapshots = wayback.snapshots() # <class 'generator'>

for snapshot in snapshots:
    print(snapshot.archive_url)

Equivalent CLI interface command:

waybackpy --url "https://twitter.com/jack" --user-agent "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1" --cdx --cdx-print archiveurl --match-type prefix --collapse urlkey

Retrieving archives of tweets between a start date and an end date

The parameter start_timestamp and end_timestamp takes timestamp as an argument. The ranges are inclusive and are specified in the same 1 to 14 digit format used for Wayback captures: yyyyMMddhhmmss. Visit https://github.com/akamhy/waybackpy/wiki/Python-package-docs#filtering for more information.

from waybackpy import WaybackMachineCDXServerAPI

user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
url = "https://twitter.com/jack"

wayback = WaybackMachineCDXServerAPI(url=url, user_agent=user_agent, match_type="prefix", collapses=["urlkey"], start_timestamp="2010", end_timestamp="2014") 
#  timeframe bound prefix matching enabled along with active urlkey based collapsing

snapshots = wayback.snapshots() # <class 'generator'>

for snapshot in snapshots:
    print(snapshot.archive_url)

Equivalent CLI interface command:

waybackpy --url "https://twitter.com/jack" --user-agent "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1" --cdx --cdx-print archiveurl --match-type prefix --collapse urlkey --to 2014 --from 2010