Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Analysis on #34 #75

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Analysis on #34 #75

wants to merge 4 commits into from

Conversation

Soumya0803
Copy link
Contributor

No description provided.

Copy link
Contributor

@birdsarah birdsarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, two big things here:

  1. Your notebook articulates what you've done, but it does not articulate why. What does this analysis help us understand? Why are you interested in this question? What do you want to know about fingerprinting, tracking, ....? How does this analysis further work on the core issue in Can we build a heuristic for browser attribute fingerprinting? #34?

  2. Your methodology does not support your stated aim. You say "It calculates the percentage of each of the three scripts with respect to the total number of scripts" but that's not what you've done. Please think about why this might be, and update your methodology. I'm avoiding telling you to give you an opportunity to think about what the data is more yourself.

Smaller things:

  1. You've used a 10% sample, is that appropriate for the analysis you're doing?
  2. You have done a lot of unnecessary computation which I'm assuming took up RAM and time. I don't see any need for each of the df.compute() calls you've made.
  3. The result of the df.compute() is that you've outputted a lot of data which is not adding anything to my reading of the analysis you've performed. Try and keep things clean for easy reading of the knowledge you're generating.
  4. You are manually transcribing counts into values. Use a variable.
  5. It's not clear to me why in cell 40 and cell 45 you have the same values.

@Soumya0803
Copy link
Contributor Author

Hi @birdsarah I realized I made a wrong assumption that each row has a unique script and did not consider there is redundancy. I first need to find the count of total unique scripts and the count of unique fingerprintjs scripts, hs-analytics, akam scripts.
I should be using value_1000_only dataset that contains all the rows of the dataset, but truncates the value field to only keep the first 1000 characters in a column called value_1000.
I will keep in mind to use df.head() instead of df.compute() to keep things readable.
I am working on making these changes and adding other information details which is conveyed in the first point.
Thanks for reviewing.

@Soumya0803 Soumya0803 changed the title Analysis on #34, calculating percentage of scripts present in dataset Analysis on #34, calculating percentage of scripts present in dataset [WIP] Mar 21, 2019
@Soumya0803 Soumya0803 changed the title Analysis on #34, calculating percentage of scripts present in dataset [WIP] Analysis on #34 Apr 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants