Your first database
Normalised vs Denormalised Data
How to decide which database structure is best for you
There are a lot of things to consider, and a million and one opinions on the internet about the right way to do it
Being handed your first complete greenfield project - where you’ll be building a whole new piece of software from the ground-up, from database to back-end service to the client - is great and exciting until you actually start working on it, when it swiftly becomes terrifying and overwhelming. There are a lot of things to consider, and a million and one opinions on the internet about the right way to do it.
One specific problem I had when building the Mongoose/MongoDB database architecture for Notches - a bespoke piece of project management software and my first professional attempt to build something in its entirety - was working out the relations between different documents within the same collection, and between collections. The main difficulty was knowing when to reference data vs when to embed it - aka normalised vs denormalised data - and sure enough, it was easy to find plenty of opinions about why I should definitely be doing one and not the other.
This is not a blog that has a strong opinion either way (spoilers: I went for a mixture of both), nor is it meant to be a definitive example of the best or most efficient way to structure your database, but if you find yourself in a similar situation on your first project, it should hopefully make those decisions a bit clearer and provide some examples.
The shape of things
The core piece of data in Notches is a workItem
- an umbrella term that can refer to a feature
, bug
, or task
, e.g., the things a user will be tracking work against - and it looks like this in the database:
Key | Type |
---|---|
name | String |
_id | ObjectId |
friendlyReference | Int32 |
users | Array |
effort | Array |
parent | Object |
businessValue | Object |
pointsEstimate | Int32 |
comments | Array |
description | String |
sprint | Object |
state | Object |
timestamps | Object |
type | String |
siblings | Array |
The keys we’re concerned with when it comes to the relations in the database are:
effort
parent
businessValue
comments
sprint
state
siblings
and the principal question in each case is this: should the data for each of these keys exist independent of the workItem
with some kind of reference to connect the two, should it exist solely on the workItem
, or both?
Let’s take state
as an example. A workItem
will pass through many states during the work cycle, e.g., Open
, Planned
, In progress
, etc., and the same states will be shared across all workItems
, that is, a state
is not unique to a workItem
.
It’s an object that looks like this:
{ name: “Open”, _id: '5e3af5a4548644000823794e' }
As states
are their own entities that will exist across Notches (we’ll also be using them with requests
, for example), they’ll live in their own collection in MongoDB.
The question then becomes whether to embed a workItem
’s state on the workItem
itself, or instead to keep a reference of the relationship between a workItem
and its current state
and combine the two elsewhere in the back-end service. The difference looks like this:
Embedded:
workItem = {
...workItemKeys,
state: {
name: “Open”,
_id: '5e3af5a4548644000823794e'
}
}
Relational:
workItem = {
...workItemKeys,
state: {
_id: '5e3af5a4548644000823794e',
},
}
The differences look slight here, but imagine an object with multiple keys, such as another workItem
(we’ll come to that later).
Trading places
There are two obvious trade-offs between embedding vs referencing data: efficiency and accuracy. If we embed state
on a workItem
, then whenever we want to fetch that workItem
, all the information we want about its state
will already be there, so we only need to make one call to the database.
If we just have a reference to the state
(e.g., the _id
), then we’ll have to make two calls: one to the workitems
collection for the workItem
and another to the states
collection for the corresponding state
.
To be clear: the workItem
would have a state
object on it, containing just an _id
field. Once we’ve retrieved the workItem
from the database, we’ll then use state._id
to find the corresponding state
, retrieve state.name
and add it to the workItem
object. (Mongoose does have a .populate()
method to handle this for us, which I’ll come to later, but that’s still effectively what it’s doing.)
Seems easy then: embed the data. Sure, that means that there will be lots of copies of the same data (once in the state
collection, and also a copy on each workItem
), but in the vast majority of cases - and certainly with Notches - the impact will be negligible, and you’re reducing the number of calls you’re making to the database.
But what happens when you want to change the name of the state
from “Open” to “Available”? You update the state.name
in the states
collection, but what about all the copies that exist on all the workItems
? The state.name
on those will still be “Open” because there’s no direct relation between the state
collection and the state
information on the workItem
.
What if we’d referenced - or normalised - the data instead? We’d use the state._id
reference on the workItem
to also get the corresponding state
information every time we fetched a workItem
, and therefore we’d always get the most up-to-date state
information.
How to choose which one to use then? For Notches, I chose to embed the state
data, along with the data for businessValue
and sprint
, even though the data existed in separate collections. But why? Let’s look at another example of relations between workItems
in the database with the parent
object.
I get that reference
The parent
object describes the relationships between the types of workItem
: features
, bugs
and tasks
. There is a hierarchy to these workItems
, where a feature
can be a parent
to either a bug
or a task
. There also exist other parents
throughout Notches, such as projects
that have features
as children
, but we don't need to worry about those here.
A typical parent
object will look like this:
{ entity: feature, _id: '5d8b4d9125f75e2405665505' }
This is an example of how we’d set up information for a normalised database. The only information about a task
’s parent feature
that exists on it is the feature
’s _id
, so every time we get a task
and we also want information about its feature
, we use the _id
on the parent
object to get the feature
, then add that feature
information on to the task
elsewhere in the service.
You may have noticed as well that a workItem
has a parent
object, but not a children
array. If we get a feature
, how do we then get all the bugs
and tasks
that are its children? Simple - we take the feature._id
and use that to list all the bugs
and tasks
where parent._id
matches feature._id
, and then add them to the feature
object. This maintains a “one-to-many” relationship among the data. It means that the relations object - in this case parent
- only needs to contain one _id
. If instead we kept a record of all a feature
’s children, we’d have a potentially longer array of child _id
s to iterate through and a more complex series of Mongoose queries.
So why embed - or denormalise - state
data and reference - or normalise - relations between workItems
? One big factor in deciding which technique to use for Notches came down to how often the data was going to be updated. Things like states
may change, but they’re not likely to change very often. We can write an updateMany
query that will update the state
information on workItems
at the same time as it updates the state
in the state
collection, and the chances of the database being out of sync is minimal.
Ultimately, it’s a read/write, efficiency/integrity trade-off
A task
, however, is likely to be changing all the time, and we don’t want to be having to update things more than once every time we make a change. It’s relatively expensive, and the chances of something going wrong increase as well. It’s much more reliable to update it in one place, and then just retrieve the latest version using the _id
whenever we need it in relation to another workItem
elsewhere. Ultimately, it’s a read/write, efficiency/integrity trade-off.
Are you going to be reading the data often, do you want it as quick as possible, and are you rarely going to be updating the relations between the data? Then you probably want to denormalise your data. If the related data are going to be updated relatively often and you want to reduce redundancy and inconsistency in your database, then you’ll probably want to normalise your data.
.populate()
For referencing between models or collections in a database, Mongoose has the populate()
method, and it handles filling in the details of the referenced item. You can see how it works in the Mongoose documentation.
We ended up not using .populate()
for Notches for a few reasons: it didn’t offer the flexibility we wanted in choosing how to populate all keys on workItem
throughout the service (such as repositories
, which exist in another service and collection); the complexity of the syntax - and the subsequent bugs - grew exponentially once we wanted to .populate()
beyond the top level of a document; and we also ran into bugs trying to implement it alongside Mongoose discriminators, a very useful method for differentiating between documents in the same collection (in this case, it allowed us to differentiate between features
, bugs
and tasks
in the workitems
collection).
I’ll be going over how we did our own version of populating in the Notches service in a subsequent blog.
The easy bits
And what about effort
and comments
? These are the simple ones. An effort
or a comment
will only ever relate to a single workItem
. It’s not data that needs to be shared between documents or collections, we don’t want the data to exist after we delete a workItem
, and we only ever want to retrieve the data in the context of its parent workItem
. This is the kind of data that can easily be embedded.
Ready to dive in?
At JDLT, we manage IT systems & build custom software for organisations of all sizes.
Get in touch to see how we can help your organisation.
Book a call