-
Notifications
You must be signed in to change notification settings - Fork 33
Get all twitter posts by any user | CDX API use case
Ever wanted to get all the URLs for all tweets including the deleted ones for a Twitter user? If yes, then you are in luck because of the Wayback Machine you can. Before we go ahead I would like to clarify that the title is a bit clickbaity as we can't get all the URLs but certainly we can get all the tweets that were archived on the Wayback Machine. Also, this guide is for educational purposes only.
Wayback Machine has three useful public APIs: SavePageNow API, Availability API, and the CDX API. For fetching all archived tweets by a user, we will be using the CDX API.
What is CDX API?
The wayback-cdx-server is a standalone HTTP servlet that serves the index that the wayback machine uses to lookup captures. The index format is known as 'cdx' and contains various fields representing the capture, usually sorted by URL and date. The CDX server responds to GET requests and returns the CDX data. [ref]
We will be using the Twitter user, jack in the example code.
In this section, we will be retrieving all archives for the webpage at https://twitter.com/jack.
The WaybackMachineCDXServerAPI
class is the interface for the CDX API. It can take many arguments including url
and user_agent
. Other arguments will be explored later in this post.
from waybackpy import WaybackMachineCDXServerAPI
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
url = "https://twitter.com/jack"
wayback = WaybackMachineCDXServerAPI(url=url, user_agent=user_agent) # only the exact match
snapshots = wayback.snapshots() # <class 'generator'>
for snapshot in snapshots:
print(snapshot.archive_url)
Equivalent CLI interface command:
waybackpy --url "https://twitter.com/jack" --user-agent "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1" --cdx --cdx-print archiveurl
The snapshots
method of the WaybackMachineCDXServerAPI
returns a generator, it returns a generator as we can run out of memory if we use a list. For some URLs like https://www.google.com, there are millions of archives.
Notice the argument prefix
passed to a parameter called match_type
. This tells the CDX API to return the snapshots(archives) which have https://twitter.com/jack as the prefix.
from waybackpy import WaybackMachineCDXServerAPI
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
url = "https://twitter.com/jack"
wayback = WaybackMachineCDXServerAPI(url=url, user_agent=user_agent, match_type="prefix") # prefix match enabled
snapshots = wayback.snapshots() # <class 'generator'>
for snapshot in snapshots:
print(snapshot.archive_url)
Equivalent CLI interface command:
waybackpy --url "https://twitter.com/jack" --user-agent "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1" --cdx --cdx-print archiveurl --match-type prefix
But it contains a lot of duplicate archives for the same URLs. How do we remove the duplicate archives? Collapsing is a feature supported by the CDX server to remove duplicates from the API response. In the next section, we make the API not return multiple/duplicate archives for a post.
The collapses
parameter of WaybackMachineCDXServerAPI
class takes a list as an argument. See https://github.com/akamhy/waybackpy/wiki/Python-package-docs#collapsing for more info about collapsing.
from waybackpy import WaybackMachineCDXServerAPI
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
url = "https://twitter.com/jack"
wayback = WaybackMachineCDXServerAPI(url=url, user_agent=user_agent, match_type="prefix", collapses=["urlkey"])
# prefix match enabled with urlkey based collapsing active
snapshots = wayback.snapshots() # <class 'generator'>
for snapshot in snapshots:
print(snapshot.archive_url)
Equivalent CLI interface command:
waybackpy --url "https://twitter.com/jack" --user-agent "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1" --cdx --cdx-print archiveurl --match-type prefix --collapse urlkey
The parameter start_timestamp
and end_timestamp
takes timestamp as an argument. The ranges are inclusive and are specified in the same 1 to 14 digit format used for Wayback captures: yyyyMMddhhmmss
. Visit https://github.com/akamhy/waybackpy/wiki/Python-package-docs#filtering for more information.
from waybackpy import WaybackMachineCDXServerAPI
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
url = "https://twitter.com/jack"
wayback = WaybackMachineCDXServerAPI(url=url, user_agent=user_agent, match_type="prefix", collapses=["urlkey"], start_timestamp="2010", end_timestamp="2014")
# timeframe bound prefix matching enabled along with active urlkey based collapsing
snapshots = wayback.snapshots() # <class 'generator'>
for snapshot in snapshots:
print(snapshot.archive_url)
Equivalent CLI interface command:
waybackpy --url "https://twitter.com/jack" --user-agent "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1" --cdx --cdx-print archiveurl --match-type prefix --collapse urlkey --to 2014 --from 2010