Skip to content

The format of the raw data

Paul Houle edited this page Jan 6, 2014 · 3 revisions

Raw format

Let's look at a singlefile, the

s3://wikimedia-pagecounts/2013/2013-09/pagecounts-20130901-000000.gz

in the first few lines, after some that say * we see

aa.b File:Admin_mop.PNG 1 9410
aa.b Special:CentralAutoLogin/createSession 2 1486
aa.b Special:Imagelist 1 631
aa.b Special:Recentchanges 1 635
aa.b User:Az1568 1 6663
aa.b User:EVula 1 7571
aa.b User:Manecke 1 284
aa.d Main_Page 2 6266

Column 1 is the name of the Wiki, Column 2 is the URL, Column 3 is a count of how many hits, and Column 4 is a count of how many bytes.

Considerations for Hadoop processing

It's slightly awkward that the timestamp is not contained in the file, rather it is contained in the filename. If the Mapper or the InputFormat has access to the FileSplit, it can get the file name, so this information can be added.

The files of the current format are typically in the 50MB-100MB range which is close to the 64MB block size used in HDFS by default. That is, they are sized appropriately for efficient processing. We can't split the gzip files, but there are so many of them we can read them concurrently.

If we're interested in filtering a subset of the data based on time, the original organization is excellent.

Pre-filtering

If we're initially interested in just the English language, we find that roughly 1/3 of the records (35%) concern English and that these comprise 27% of the uncompressed data. (Probably other languages have a lot of %-escapes that bulk things up.)

$ zcat pagecounts-20130901-000000.gz | awk '{if ($1=="en") print}' | wc
2239416 8957664 81253340
$ zcat pagecounts-20130901-000000.gz | wc
6353504 25414016 305379210

so we could get a 3x speedup if we processed just English data.

For further compression note that if we get rid of the "en" at the beginning and get rid of the downloaded byte count (I don't care how big the pages are myself) we get a considerable improvement in gzip compressed size:

$ ls -l k*.gz
-rw-rw-r-- 1 ec2-user ec2-user 27804723 Dec 13 21:16 kompressed1.gz
-rw-rw-r-- 1 ec2-user ec2-user 19452591 Dec 13 21:17 kompressed2.gz

The minimal data set required for working in English is thus 23% of the size of the whole set. One would assume that prefiltering would speed up and reduce the cost of the map tasks by about that much.

Note that doing the prefiltering in Hadoop would be awkward (I think) if we wanted to keep the output organized the same as the input, because (1) a job with a reduce task will respray the data and cause the "file-names-linked-with-dates" property to be lost, (2) by default if we do this with a map-only job the system will still scramble the dates because it will assign numbers to the files sequentially in the order that they came in.

Probably with the right kind of file output format, we could name the files the way the originals were named, but I'll say that file output formats can be tricky to debug and often have problems that don't turn up in unit tests. Some non-hadoop system that just builds a list of files to work on, downloads them, does something like

pagecounts-20130901-000000.gz | awk '{if ($1=="en") print $2 " " $3}' | gzip -c - > kompressed2.gz

and uploads them to another spot would do the trick.

More thoughts on compression...

To look at one single file, I considered the impact of (1) removing the 'bytes' data which I'm not interested in and (2) compressing with Bzip.

[ec2-user@ip-10-78-145-122 sizeCompare]$ ls -l
total 169396
-rw-rw-r-- 1 ec2-user ec2-user 74619886 Jan  5 20:24 pagecounts-20121225-050000.gz
-rw-rw-r-- 1 ec2-user ec2-user 44227143 Jan  5 20:28 pagecounts-nobytes.bz2
-rw-rw-r-- 1 ec2-user ec2-user 54423694 Jan  5 20:26 pagecounts-nobytes.gz

What seems to be most pressing at the moment is storage costs, and let's say that those run at $300 a month. The 72% savings we get from getting rid of bytes gets that to $218 a month, or roughly a savings of $80 a month. Over a two year planning horizon, that comes to $1920 which justifies about 20 hours of work to make the savings. Execution time of the jobs will also improve, so it will be cheaper to run jobs against this data, saving even more money.

The bzip case gets storage costs to $177 a month, saving $123 a month, or $2952 over the two year horizon. Bzip compression is more expensive to decompress and we'd have to have some real numbers and some idea of how often we'll want to run jobs about it to make the decision to bzip compress.


Note a more radical version is to filter out the many junk URIs that only occur infrequently. We could probably cut the the storage costs down to something more like $30 and greatly speed things up if we delete URIs that only show up less than a certain number of times per month. We could make the list of 'good' URIs, build a bloom filter from it, and then filter the records for that month to greatly shrink the data set, and of course running jobs against this data would be much faster too.