- person: an entity that has a first name
fname
, last namelname
and date of birthdob
and telephone numberphone
. A person is uniquely identified by theirfname
,lname
,dob
. - database: a PostgreSQL instance
- persons table: a table in the database containing person records
- person record: a person stored in the persons table
- XML input file: a file on disk that contains the data to be merged
- entry: an XML element contained in the XML input file that represents a person with the following form:
<member>
<firstname>JOHN</firstname>
<lastname>DOE</lastname>
<date-of-birth>2002-02-01</date-of-birth>
<phone>9548938821</phone>
</member>
- well-formed (referred to an entry or aperson record): that has
fname
,lname
, anddob
not empty and and of the correct type - importer: a program that takes the entries in the XML input file and asks the database to merge them with the person records in the persons table
- merge process: update the persons table in such a way that, for each entry:
- if the
phone
in the persons table is equal to thephone
in the entry nothing should be changed in the persons table - if the
phone
in the persons table is not equal to thephone
in the entry it should be changed in the persons table - if the entry is not stored in the persons table, a new person record needs to be created
- if the
- merge job: a sub-unit consisting of a given number of entries in which the merge process is split for efficiency reasons
Write a piece of software that executes the merge process in a way such that:
- it must be clean
- it must be correct
- it should provide basic loading statistics at the end of the merge process
- it should process the XML input file efficiently
- it should minimize the overall run-time of the merge process
Besides that, reason about performance and memory usage of the merge process
# create and start database
docker-compose up
# compile the importer
stack build
# run the importer
stack exec importer -- update-file.xml 10000
The importer right now is meant to be manually run from the command-line. It'd
be not hard to put it in a Docker container and run it in an automated fashion.
The db dump and
update file
need to be downloaded separately as they are too big. Copy the db dump in the
top-level folder of the project since that's where the Dockerfile
expects it.
- database: PostgreSQL 10. I use
PL/pgSQL
mostly to provide stats on person records created or updated during the merge process. - importer: Haskell. It's a purely functional and strongly, statically typed language that allowed me to have high assurances on correctness and cleanness of my code and develop my code in a type-driven fashion.
- database communication: PostgREST is a library that provides an easy and convenient vai to interface with a PostgreSQL database instance using a REST API over HTTP. It also takes care of authentication using JWT (authorization is left to the database)
- Docker: PostgreSQL, postgREST run inside Docker containers using Docker Compose. The importer can be run from any machine manually or also packaged in a Docker container.
- Haskell ecosystem is big and provided a lot of nice tools for handling concurrency such as Software Transactional Memory or the Async library that made expressing relatively complex logic quite straight-forward and enjoyable
- The vast Haskell ecosystem made it also easy to write code to efficiently read entries from in XML input file in and parse them writing a mimimum amount of code, and to communicate to the database over REST API
- Haskell profiling tools are really enjoyable to work with and allow to easily pin-point performance bottlenecks
- Working with the DB as a REST API was also enjoyable and helped with separation of concerns since PostgREST forces consumers to write store procedures in the database to define user-defined workflows. This is also good for performance
- Haskell is not currently used at Lambdawerk as far as I understood and this could make it harder to get deep feedback or for the reviewer to easily understand the code I've written
- PostgREST is a relatively simple tool that can be used as a black-box but internally it depends on other tools and frameworks - i.e. ngnix - and these become transitive dependencies
- Docker to have a sandboxed and reproduceable infrastructure (tested on Version 17.09.0-ce-mac35 (19611) Channel: stable a98b7c1b7c)
- Stack build tool to have a sandboxed and reproduceable Haskell environment and install all needed dependencies (tested on Version 1.5.1 x86_64 hpack-0.17.1)
Components need some configurations parameters that need to be supplied via environment variables, specifically:
- to set the environment variables needed for the PostgreSQL instance, you could
create a file named
postgres.env
and provide values for the following variables:POSTGRES_DB
POSTGRES_USER
POSTGRES_PASSWORD
have a look at the sample filepostgres.env.sample
- to set the environment variables needed for the PostgRESTL instance, you could
create a file named
postgREST.env
and provide values for the following variables:PGRST_DB_URI
of the form:postgres://[POSTGRES_USER]:[POSTGRES_PASSWORD]@postgres_container_alias:5432/[POSTGRES_DB]
PGRST_DB_SCHEMA
PGRST_DB_ANON_ROLE
PGRST_SERVER_PROXY_URI
PGRST_JWT_SECRET
have a look at the sample filepostgREST.env.sample
andrsa.jwk.pub.sample
. The latter contains a sample secret that can be set asPGRST_JWT_SECRET
and to generate JWT tokens to use asAPI_TOKEN
- the environment variables needed for the importer are:
API_ENDPOINT
API_TOKEN
While pondering and evaluating a solution, I identified three sub-problems:
- reading and parsing the XML input file
- serialising the content of the XML input file
- merging entries respecting the given invariants
To achieve a clean architecture my design decision early on has been that the importer will be responsable for 1. and the database for 3. The following sequence diagram aims to give a high level view of the architecture of the software I implemented
It was clear from reading the problem definition that the XML input file was
potentially to large to be read and parsed in memory. So the alternative was to
stream it. Because lazy I/O is quite tricky and has subtleties that makes it easy
to do something wrong, I used the conduit
library that provides a safer (when it come to releasing resources) and more
convenient way of streaming content. The importer reads and parses batchSize
entries at a time, where batchSize
is a parameter that must be configured when
invoking the importer to tell it how many entries should be parsed in one
"read & parse" pass to be submitted to the database.
When choosing a batchSize
one should consider at least the following factors:
- how much memory is available on the machine where the importer runs
- how much memory is needed to represent a person (I wrote down details about
that in
Persons.hs:22
) - the optimal batch size for multi-row
INSERT
s in PostgreSQL
The database is responsanble for merging entries supplied by the importer in a correct and efficient way. I explored various possibilities:
SELECT
ing person records one-by-one and checking the invariants is hopelessly slow- conditional
UPDATE
s don't cover the case of new person records INSERT
ing using theON CONFLICT
clause looked promising since it's basically what the invariants describe. The only catch here is that theconflict_target
must be unique. That meant adding aPRIMARY KEY
constraint to the persons table.
Ultimately I decided to implement the last option: when the database is created
and populated the person
table is ALTER
ed to add a PRIMARY KEY
but that lead
to another issue: 257 person records are not well-formed. The "Assumptions" and
"Sanitising data" sections talk about how I decided to tackle this.
The final SQL statement to merge entries looks like the following:
DECLARE
row_stats integer;
BEGIN
INSERT INTO person AS p
SELECT * FROM json_populate_recordset(null::person,entries)
ON CONFLICT (fname,lname,dob) DO UPDATE
SET phone = EXCLUDED.phone
WHERE p.phone != EXCLUDED.phone OR p.phone IS null;
GET DIAGNOSTICS row_stats = ROW_COUNT;
RETURN json_build_object('row_stats',row_stats);
END;
It uses PL/pgSQL
to be able to return basic statistics about the affected
person records in the database back to the importer.
I run these benchmarks on my laptop: MacBook Pro (Retina, 15-inch, Mid 2015), 2,5 GHz Intel Core i7, 16 GB 1600 MHz DDR3. First I did some general tests playing around with different batch size to check how that impacted performance:
batchSize=1000, max_wal_size=1GB
=> 69.265649s (lots of connection errors: postgREST can't keep up apparently)batchSize=10000, max_wal_size=1GB
=> 62.356193s (2-3 req/sec)batchSize=100000, max_wal_size=1GB
=> 65.404345s (1 req/~5secs)
Looking at PostgreSQL logs I noticed lots of warnings like
checkpoints are occurring too frequently (29 seconds apart)
so I configured
PostgreSQL to use a bigger max_wal_size
:
batchSize=2000, max_wal_size=2GB
=> 61.797248s (20+ req/sec)batchSize=5000, max_wal_size=2GB
=> 64.429382s (4-5 req/sec)batchSize=10000, max_wal_size=2GB
=> 61.140297s (2-3 req/sec)
The warnings were gone but the running time of the merge process wasn't changed.
Changing the batch size didn't seem to affect the total running time of the merge process. In order to make it run quicker I could think about two strategies:
- make the SQL in the database more efficient
- make the parsing/serialising in the parser more efficient
Let's start with the database:
- sending the entries to the database: there is going to be some I/O involved, I tried tuning that by changing the batch size already and didn't really get anything out of it
- deserialisation of the input: this is internal to the database and it cannot be optimised as far as I know
- optimising the SQL statement: the statement is a database function that is a
stored procedure. It's already compiled and optimised by the database query planner.
The index on the
PRIMARY KEY
slowsINSERT
ions down but it cannot be removed since the constraint is needed forON CONFLICT
to work. Tuning some database settings is also an option, I explored changingmax_wal_size
but didn't get any substantial perfomance gains - hardware: this is out of the scope of the assignment
- serialisation of the output: the database return a minimal response and I don't see this as a bottleneck
I didn't walk this road too long because from my tests it didn't seem like the database was the bottleneck. So I moved my attention to the importer with the goal of raising the number of requests per second that it could send to the database focusing on:
- deserialisation of the entries in the XML input file
- serialisation of the entries into JSON
- deserialisation of the database response
Here things got more interesting: while profiling the importer to optimise its running time I actually found a space leak that prevented it from running in constant memory and fixed it. The following image shows how memory consumption grows linearly with the number of entries (50000 in this test)
and the following shows how the memory consumption stays mostly constant
When testing the space leak-free version of the importer I noticed that the merge process running time increased dramatically. It turned out the importer wasn't running merge jobs fully in parallel, that is it was waiting for a merge job to complete before merging the next one. The way I fixed it it's also described in the commit message whose SHA-1 is 333086d.
- The database is hosted on a remote machine therefore the importer must connect to it over the internet.
- Creating the persons table is a one-time task and the one-time costs to pay
upfront to sanitise the person records (see below) so that a
PRIMARY KEY
constraint can be created are acceptable costs. APRIMARY KEY
constraint is added because is generally a good practice to have it and allows to implement the merge process conveniently and efficiently. - Removing the 267 entries that do not have either a last name or a birthdate is an unacceptable data loss.
There are some not well-formed entries in the XML input file, i.e.:
"ABD AL", "PIKULSKI", "\\N", "3158419207"
.
I decided to simply sanitise it this way:
"ABD AL", "PIKULSKI", "-infinity", "3158419207"
.
Other options I considered where:
- filtering out the entry
- somehow reporting the entries that are not well-formed
- rejecting the whole file
The focus of the assignment wasn't on this particular aspect and 267 person records are also not well-formed, I applied the easiest solution I could think of.
dob
:-infinity
lname
,fname
:_
(in Haskell_
means: "ignore this")
This is a short analysis of the various solutions I pondered before implementing the final one
- parse all entries in XML input file
SELECT
all person records- for each person record check if their telephone number is different from the
one in the XML file: if so,
UPDATE
the person record otherwise do nothing INSERT
all remaining entries
The algorithm needs O(P) memory, it would require O(1) SELECT
and O(P) UPDATE
s /
INSERT
s where P is the number of entries in the XML input file.
The memory requirement are undesirable, the ideal solution would require constant memory
- parse one entry in the XML input file
SELECT
the person record corresponding to the entry from the database- if the result is empty
INSERT
the entry otherwise check if the phone number of the person record is different from the one of the wentry: if so,UPDATE
the person record otherwise do nothing - repeat for all entrie in the XML input file
The algorithm needs O(1) memory but would require O(P) SELECT
s and O(P) UPDATE
s /
INSERTION
s where P is the number of entries in the XML input file.
Constant memory usage is good but O(P) SELECT
s are required. That's slow and
puts too much load on the database.
- parse
batchSize
entries in the XML input file (batchSize
might depend on different factors, i.e. memory available) SELECT
batchSize
person records- for each person record check if their telephone number is different from the
one in the XML file: if so,
UPDATE
the person record otherwise do nothing INSERT
all remaining entries- repeat for all entris in the XML input file
The algorithm needs O(1) memory and it would require O(P / M) SELECT
s and O(P)
UPDATE
s / INSERTION
s where P is the number of entries in the XML input file
and batchSize
is the fixed batch size - this might depend on different factors,
i.e. memory available.
Constant memory usage is good but again O(P / N) SELECT
s are required. That's
better than the previous algorithm but still potentially slow. Is it possible to
avoid this step completely?
UPDATE
the persons table and add defaultlname
s anddob
s where missing in order to be able to add a primary key constraintALTER
the persons table and add aPRIMARY KEY
constraint needed by theON CONFLICT
statement.
- parse
batchSize
entries from XML input file (batchSize
might depend on different factors, i.e. memory available) UPSERT
batchSize
entries leveraging the database own merge capabilities- repeat for all entries in the XML input file
Merging would look like this in SQL:
INSERT INTO person AS p VALUES
('JIARA','HERTZEL','1935-06-05','5859012134'),
('RONJARVIOU','COMELLO','1932-09-27','7702713416')
ON CONFLICT (fname,lname,dob) DO UPDATE
SET phone = EXCLUDED.phone
WHERE p.phone != EXCLUDED.phone OR p.phone IS null;
The algorithm needs O(1) memory and O(P) UPSERT
ions where P is the number of
persons in the XML file.
- client runs in constant memory
- leverage native merging capabilities of the database to:
- simplify the importer logic
- avoid the need to do any (slow) I/O to
SELECT
person records first to apply merging in the client - achieve good separation of concerns:
- importer parses the XML input file and sends the entries to the database
- database handles the merge logic
- there is a one-time cost to pay to sanitise the data in the persons table
- there is a one-time cost to pay to add a
PRIMARY KEY
constraint in the persons table
The following problems are not solved in satisfactory fashion, mostly because of time constraints but I wanted to mention them since I did ponder them.
The importer doesn't have any short-circuiting when it come to temporarly halting or slowing down the merge process. This might overwhelm the database in case for some reason it cannot keep up.
If the database is down or the is any other issue that prevents the request from
reaching the database, the importer should either store the failed entries - i.e.
in an update-failed.xml
file - to retry later or be manually re-submitted.
- I'm not entirely sure how merge can be tested in an automated fashion.
That would include spinning up the infrastructure that holds the database and
invoke its
merge
function and then check the returned stats and query the database. I did do some manual tests to assure the correctness of themerge
function. - Fuzzy testing on parsing / serialising entries in the importer.