Skip to content
This repository was archived by the owner on Mar 10, 2025. It is now read-only.

Commit c81c4b7

Browse files
committed
Initial outline draft.
0 parents  commit c81c4b7

File tree

2 files changed

+78
-0
lines changed

2 files changed

+78
-0
lines changed

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
*.pyc
2+
*.swp
3+
*.swo

README.md

+75
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
Information Management for Journalists
2+
======================================
3+
4+
Introducing data
5+
----------------
6+
7+
* Types of data:
8+
* Simple types
9+
* Numbers (bytes, ints, floats)
10+
* Text (chars, strings)
11+
* Boolean
12+
* Enumerations
13+
* Binary blobs (images)
14+
* Complex types
15+
* Lists (1D)
16+
* Tables (2D)
17+
* Relationships (3 or more dimensions)
18+
* Objects (hierarchical data)
19+
* Metadata:
20+
* Unique ids
21+
* Constraints ("business rules")
22+
23+
24+
Storing data
25+
------------
26+
27+
* Lists: Plain text
28+
* Tables: Spreadsheets (CSV, Excel, Google Docs)
29+
* Relationships: SQL Databases (MySQL, PostgreSQL, sqlite)
30+
* Objects: Flat-files (JSON, YAML) or Object Databases (MongoDB, CouchDB)
31+
32+
33+
Tracking changes
34+
----------------
35+
36+
* What constitutes a change in version?
37+
* The timestamp: atomic unit of change
38+
* Sometimes it makes sense to record that things don't change too
39+
* Files and folders (a.k.a naming things)
40+
* Always name from less specific to more specific
41+
* Always timestamp things you're changing
42+
* Always make copies (you have unlimited hard drive space--use it)
43+
* Dropbox: The poor man's version control
44+
* git: The engineer's version control
45+
* git in 5 minutes
46+
* Commit messages matter
47+
* Works for text formats (CSV, flat files), but not for binaries
48+
* Not designed for big data
49+
* Change tables and data warehouses: creating problems to solve problems
50+
51+
52+
Data pipelines
53+
--------------
54+
55+
* Why repeatable processes?
56+
* Ensure clear provenance of the data
57+
* Depend against attacks on integrity
58+
* Self-documenting workflow
59+
* Show your work
60+
* Acquisition
61+
* Ways to get data
62+
* Curl
63+
* APIs
64+
* Scrapers
65+
* Documenting provenance
66+
* Processing
67+
* Scripts
68+
* Bash scripts
69+
* csvkit
70+
* Python
71+
* Documenting order of operations
72+
* Naming things; revisited
73+
* Shims, adapters and glue code
74+
* Crons
75+

0 commit comments

Comments
 (0)