Skip to content
This repository has been archived by the owner on Jan 6, 2022. It is now read-only.

blacklight integration #273

Open
redshiftzero opened this issue Sep 28, 2020 · 5 comments
Open

blacklight integration #273

redshiftzero opened this issue Sep 28, 2020 · 5 comments

Comments

@redshiftzero
Copy link
Contributor

The Markup released a tool last week for scanning sites for the use of ad trackers, third party cookies, key logging, session recording, among other technologies that privacy-conscious readers should know about. Links:

It would be worthwhile to integrate this tool into STN so that we can track the use of these technologies over time on major news sites. If we like this, there are a few questions to decide here:

  • How to (or whether to) incorporate these privacy metrics into the grading scheme?
  • Which items to highlight on the leaderboard? We could reduce the set of HTTPS-related items on the leaderboard in order to make space for a small number of the privacy-related items (e.g. number of ad trackers perhaps).

Another question is how best to integrate the tool itself: to perform the scans we need node installed (we're currently using a container to do the scanning that doesn't). After initial discussions today, it seemed like proceeding with installing node in the container where scans are done is a reasonable/acceptable approach for this purpose, but noting here in case folks come up with other ideas.

@eloquence
Copy link
Member

I would suggest starting by including the scan results in the detail view (e.g. https://securethe.news/sites/the-intercept) and the API. That gives us some time to live with the data and check for false positives/false negatives without immediately modifying the scores.

At least from the web UI it looks like Blacklight identifies some particularly egregious practices:

  • Canvas fingerprinting
  • Session/keystroke recording

These may be good candidates for surfacing on the leaderboard in the near term. https://themarkup.org/blacklight?url=theintercept.com seems to employ its own scoring under the hood ("more than the average" number of trackers etc.) -- perhaps we could collaborate with them on a privacy score?

@conorsch
Copy link
Contributor

conorsch commented Oct 6, 2020

I would suggest starting by including the scan results in the detail view [...] without immediately modifying the scores.

Agreed, that sounds like a modest investment, and allows us to add the integration with minimal commitment.

Another question is how best to integrate the tool itself

Regarding the architecture, I'll summarize out of band conversations with a few folks. The STN scanning code to date is all Python, and the Blacklight code is JS. We could try templating out JS files with the domain name hardcoded and evaluate that, then read in the file that was written to disk. It might be cleanest to bolt on a simple HTTP GET service the existing JS functionality, then poll that endpoint via the Python app code. That'd allow us to keep the Python & Node containers completely separate, and the Node container wouldn't need to be publicly accessible—it'd only be available to the app for local requests and responses.

There's still a bit of JS code to write to make that work, but the having the blacklight scanning logic separate from the bulk of the wagtail code sounds worth the effort. Might be worth pinging the Markup folks if we have trouble cobbling together that solution.

@eloquence
Copy link
Member

@redshiftzero Checking in, are you planning to work on this in the near future / already working on it? If so, will add to the web board for visibility.

@redshiftzero
Copy link
Contributor Author

redshiftzero commented Oct 13, 2020

I'm not actively working on this right now, but I'll assign myself if/when I do (looks like I have permissions to do that now)

@conorsch
Copy link
Contributor

Had a chat with Surya & Simone at the Markup recently, and they magnanimously offered to let us poll their API directly for inclusion in STN, rather than bottle up the Blacklight scanning code and re-run scans for each website ourselves. That's certainly a far sight simpler than pulling in the code ourselves! Looking forward to stubbing out some endpoints locally, although the question of how to present findings on the site still leaves us a lot of options in terms of design.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants