Analysis on #34 #75

Soumya0803 · 2019-03-19T17:11:51Z

No description provided.

birdsarah

Hi, two big things here:

Your notebook articulates what you've done, but it does not articulate why. What does this analysis help us understand? Why are you interested in this question? What do you want to know about fingerprinting, tracking, ....? How does this analysis further work on the core issue in Can we build a heuristic for browser attribute fingerprinting? #34?
Your methodology does not support your stated aim. You say "It calculates the percentage of each of the three scripts with respect to the total number of scripts" but that's not what you've done. Please think about why this might be, and update your methodology. I'm avoiding telling you to give you an opportunity to think about what the data is more yourself.

Smaller things:

You've used a 10% sample, is that appropriate for the analysis you're doing?
You have done a lot of unnecessary computation which I'm assuming took up RAM and time. I don't see any need for each of the df.compute() calls you've made.
The result of the df.compute() is that you've outputted a lot of data which is not adding anything to my reading of the analysis you've performed. Try and keep things clean for easy reading of the knowledge you're generating.
You are manually transcribing counts into values. Use a variable.
It's not clear to me why in cell 40 and cell 45 you have the same values.

Soumya0803 · 2019-03-21T13:40:06Z

Hi @birdsarah I realized I made a wrong assumption that each row has a unique script and did not consider there is redundancy. I first need to find the count of total unique scripts and the count of unique fingerprintjs scripts, hs-analytics, akam scripts.
I should be using value_1000_only dataset that contains all the rows of the dataset, but truncates the value field to only keep the first 1000 characters in a column called value_1000.
I will keep in mind to use df.head() instead of df.compute() to keep things readable.
I am working on making these changes and adding other information details which is conveyed in the first point.
Thanks for reviewing.

percent_analyses

c85a3f1

birdsarah suggested changes Mar 20, 2019

View reviewed changes

Soumya0803 changed the title ~~Analysis on #34, calculating percentage of scripts present in dataset~~ Analysis on #34, calculating percentage of scripts present in dataset [WIP] Mar 21, 2019

Soumya0803 added 2 commits April 9, 2019 22:39

Merge remote-tracking branch 'upstream/master' into analyses

14906b7

issue34_analysis

55027ec

Soumya0803 force-pushed the analyses branch from acf7a01 to 55027ec Compare April 9, 2019 17:25

issue34_analysis adding links

cfba3e9

Soumya0803 changed the title ~~Analysis on #34, calculating percentage of scripts present in dataset [WIP]~~ Analysis on #34 Apr 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis on #34 #75

Analysis on #34 #75

Soumya0803 commented Mar 19, 2019

birdsarah left a comment

Soumya0803 commented Mar 21, 2019

Analysis on #34 #75

Are you sure you want to change the base?

Analysis on #34 #75

Conversation

Soumya0803 commented Mar 19, 2019

birdsarah left a comment

Choose a reason for hiding this comment

Soumya0803 commented Mar 21, 2019