Skip to content

Dat update #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,15 @@ Especially in the social sciences, researchers depend on large, public datasets

## Review of Existing Tools and Approaches

The Open Data Institute has [a nice post](https://theodi.org/blog/adapting-git-simple-data) outlining the challenges of using standard, software-oriented version control software (namely, git) for version control of data. The main issue is that git, like almost all VCS, is designed to monitor changes to lines in text files. This makes a lot of sense for code, as well as for articles, text, etc. But it starts to make less sense for digital objects where a line is not a meaningful unit. This becomes really clear when we start to version something like a comma-separated values (CSV) file (as the ODI post describes). Changing a single data field leads to a full-line change, even though only one cell actually changed. A similar problem emerges in XML, JSON, or other text-delimited formats (though, note that [the Open Knowledge Foundation seems to favor JSON as a storage mode](http://dataprotocols.org/data-packages/)).
The Open Data Institute has [a nice post](https://theodi.org/blog/adapting-git-simple-data) outlining the challenges of using standard, software-oriented version control software (namely, git) for version control of data. The main issue is that git, like almost all VCS, is designed to monitor changes to lines in text files. This makes a lot of sense for code, as well as for articles, text, etc. But it starts to make less sense for digital objects where a line is not a meaningful unit. This becomes really clear when we start to version something like a comma-separated values (CSV) file (as the ODI post describes). Changing a single data field leads to a full-line change, even though only one cell actually changed. A similar problem emerges in XML, JSON, or other text-delimited formats (though, note that [the Open Knowledge Foundation seems to favor JSON as a storage mode](http://dataprotocols.org/data-packages/)).

Using the linebreak `\n` as delimiter to track changes to an object does work for data. [daff](https://github.com/paulfitz/daff) is a javascript tool that tries to make up for that weakness. It provides a diff (comparison of two file versions) that respects the tabular structure of many data files by highlighting cell-specific changes. This is not a full version control system, though, it's simply a diff tool that can serve as a useful add-on to git.

[A post on the Open Knowledge Foundation blog](http://blog.okfn.org/2013/07/02/git-and-github-for-data/) argues that there is some logic to line-based changesets because a standard CSV (or any tabular data file) typically records a single observation on its own line. Line-based version control then makes sense for recording changes to observations (thus privileging rows/observations over columns/variables).

[dat](http://dat-data.com/about.html) aims to try to be a "git for data" but the status of the project is a little unclear, given there hasn't been very active [development on Github](https://github.com/maxogden/dat). [A Q/A on the Open Data StackExchange](http://opendata.stackexchange.com/questions/748/is-there-a-git-for-data) points to some further resources for data-appropriate git alternatives, but there's nothing comprehensive.
[dat](https://dat.foundation/) aims to try to be a "git for data and the project has evolved into a Open Protocol for data tracking and sharing, allowing for a Peer2Peer file sharing and auto version control over the installed filesystem. Dat is present as [datproject on Github](https://github.com/datproject) and has some community tools and projects built with dat [dat-land](https://github.com/dat-land).

[A Q/A on the Open Data StackExchange](http://opendata.stackexchange.com/questions/748/is-there-a-git-for-data) points to some further resources for data-appropriate git alternatives, but there's nothing comprehensive.

[Universal Numeric Fingerprint (UNF)](http://guides.dataverse.org/en/latest/developers/unf/index.html) provides a potential strategy for version control of data. Indeed, that's specifically what it was designed to do. It's not a version control system per se, but it provides a hash that can be useful for determining when a dataset has changed. It has some nice features:
- File format independent (so better than an MD5)
Expand All @@ -28,6 +30,8 @@ But, UNF is not perfect. The problems include:
- It is quite sensitive to data structure (e.g., "wide" and "long" representations of the same dataset produce different UNFs)
- It is not a version control system and provides essentially no insights into what changed, only that a change occurred

[DVC.org](https://github.com/iterative/dvc) is a data-versioning tool which is almost similar to working of Git-SCM. You can learn about it through [this post](https://dvc.org/doc/get-started) and [this](https://dvc.org/doc/tutorial).

All of these tools also focus on the data themselves, rather than associated metadata (e.g., the codebook describing the data). While some data formats (e.g., proprietary formats like Stata's .dta and SPSS's .sav) encode this metadata directly in the file, it is not a common feature of widely text-delimited data structures. Sometimes codebooks are modified independent of data values and vice versa, but it's rather to see large public datasets provide detailed information about changes to either the data or the codebook, except in occasional releases.

Another major challenge to data versioning is that existing tools version control are not well-designed to handle provenance. When data is generated, stored, or modified, a software-oriented version control system has no obvious mechanism for recording *why* values in a dataset are what they are or why changes are made to particular values. A commit message might provide this information, but as soon as a value is changed again, the history of changes *to a particular value* are lost in the broader history of the data file as a whole.
Expand Down