If you’ve ever designed a database from scratch, or worked with a database migrations, then you know how important it is to get the data schema right the first time. If you get them wrong, then when you (inevitably) have to fix it, you must do a major overhaul of your code just to fit the updated schema. In Rails, this means re-implementing potentially dozens of objects. In Haskell, it’s not as bad because the compiler handles most error checking (with Persistent, anyhow), but it’s still not a trivial exercise.
So how do you design a database with Persistent? You begin with database normalization (DN) in mind.
The objectives of DN are to minimize the amount of redesign you have to do later, reduce data redundancy, and reduce data anomalies (double insertions, old data, records don’t get deleted, etc):
- To free the collection of relations from undesirable insertion, update and deletion dependencies;
- To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;
- To make the relational model more informative to users;
- To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
E.F. Codd, “Further Normalization of the Data Base Relational Model”
In toto, database normalization means to simplify the design of the database. Let’s see what that means mathematically, and then work through an example in Haskell.
(All from Wikipedia. You might want to jump down to the example below, and refer back here if you don’t understand something.)
In order to understand the normal forms, we first need to understand non-prime attributes.
A non-prime attribute of a relation R is an attribute that does not belong to any candidate key of R
Okay, so it’s a column in the table that is not part of a candidate key. So, what’s a candidate key?
A candidate key of a relation is a minimal superkey for that relation; that is, a set of attributes such that:
1. the relation does not have two distinct tuples (i.e. rows or records in common database language) with the same values for these attributes (which means that the set of attributes is a superkey)
2. there is no proper subset of these attributes for which (1) holds (which means that the set is minimal).
First Normal Form (1NF)
A relation is in first normal form if and only if the domain of each attribute contains only atomic (indivisible) values, and the value of each attribute contains only a single value from that domain. (Wikipedia)
So basically, 1NF means each column in the database is as small as we can make it. By “small” I mean simple, atomic, indivisible. This is intentionally vague and does not have a concrete definition. As the database designer, one of your design decisions includes determining what that atomic parts of the schema should be.
Second Normal Form (2NF)
So, the Wikipedia definition is rather dense:
a relation is in 2NF if it is in 1NF and no non-prime attribute is dependent on any proper subset of any candidate key of the relation. A non-prime attribute of a relation is an attribute that is not a part of any candidate key of the relation.
As far as I can tell, 2NF is about removing duplication by minimizing the number of possible primary keys in the row. There is a rigorous, mathematical definition for this, but I’m not going that rabbit hole today.
Third Normal Form (3NF)
3NF has 2 requirements:
- The table/relation (R) must be in 2NF
- Every non-prime attribute of R is non-transitively dependent on every key of R.
Finally, now that we have our terms defined, let’s work through an example. Say I have the following “Customers” table in my database (also from Wikipedia):
|Customer ID||First Name||Surname||Telephone Number|
|456||San||Zhang||(555) 403-1659 Ext. 53; 182-929-2929|
This can be described with the following Persistent entity:
Customers firstname Text surname Text telephoneNumber Text
You’ll notice this is definitely not normal. The telephone column does not have a format to which all phone numbers conform, therefore there is no norm. Imagine writing a decoder for the telephone column: it would be difficult and annoying. We need to normalize this data to make life easier. The rest of this post will go about refactoring this Persistent entity into a normal form.
The simplest and most flexible solution is to break up the telephone column into its own table:
This establishes a one-to-many relationship between customers and telephone numbers: as you can see, the
Customer Id field has
456 listed twice. In Persistent this would look like:
Customer firstname Text surname Text CustomerTelephone customer CustomerId telephoneNumber Text
CustomerId is a foreign key linking the customer’s telephone number to the customer’s name.
Breaking up your columns into tables like this also allows for arbitrary extension. What if, later on, you decide you need addresses for each of your customers? Doing so is as easy as:
Address street Text city Text country Text zipcode Text CustomerAddress customer CustomerId address AddressId UniqCustomerAddress customer address
CustomerAddress is a join table that establishes a many-to-many relationship; each customer can be linked to multiple addresses, and vice versa.
Anyway I hope this helps someone trying to learn Haskell and Persistent. I wrote this to aid my own education as I learn to use Persistent, and I have more learning to do. I’d like to know the best way to query these normalized databases, how to use Esqueleto, and what the performance characteristics are of queries to normalized databases.