Your first database

Normalised vs Denormalised Data

Richard TzanovRichard Tzanov
5 months ago | 9 min read

Colourful thread connected by pins

How to decide which database structure is best for you

Being handed your first complete greenfield project - where you’ll be building a whole new piece of software from the ground-up, from database to back-end service to the client - is great and exciting until you actually start working on it, when it swiftly becomes terrifying and overwhelming. There are a lot of things to consider, and a million and one opinions on the internet about the right way to do it.

One specific problem I had when building the Mongoose/MongoDB database architecture for Notches - a bespoke piece of project management software and my first professional attempt to build something in its entirety - was working out the relations between different documents within the same collection, and between collections. The main difficulty was knowing when to reference data vs when to embed it - aka normalised vs denormalised data - and sure enough, it was easy to find plenty of opinions about why I should definitely be doing one and not the other.

This is not a blog that has a strong opinion either way (spoilers: I went for a mixture of both), nor is it meant to be a definitive example of the best or most efficient way to structure your database, but if you find yourself in a similar situation on your first project, it should hopefully make those decisions a bit clearer and provide some examples.

The shape of things

The core piece of data in Notches is a workItem - an umbrella term that can refer to a feature, bug, or task, e.g., the things a user will be tracking work against - and it looks like this in the database:

KeyType
nameString
_idObjectId
friendlyReferenceInt32
usersArray
effortArray
parentObject
businessValueObject
pointsEstimateInt32
commentsArray
descriptionString
sprintObject
stateObject
timestampsObject
typeString
siblingsArray

The keys we’re concerned with when it comes to the relations in the database are:

  • effort
  • parent
  • businessValue
  • comments
  • sprint
  • state
  • siblings

and the principal question in each case is this: should the data for each of these keys exist independent of the workItem with some kind of reference to connect the two, should it exist solely on the workItem, or both?

Let’s take state as an example. A workItem will pass through many states during the work cycle, e.g., Open, Planned, In progress, etc., and the same states will be shared across all workItems, that is, a state is not unique to a workItem.

It’s an object that looks like this:

{ name: “Open”, _id: '5e3af5a4548644000823794e' }

As states are their own entities that will exist across Notches (we’ll also be using them with requests, for example), they’ll live in their own collection in MongoDB.

The question then becomes whether to embed a workItem’s state on the workItem itself, or instead to keep a reference of the relationship between a workItem and its current state and combine the two elsewhere in the back-end service. The difference looks like this:

Embedded:

workItem = {
  ...workItemKeys,
  state: {
    name: “Open”,
    _id:  '5e3af5a4548644000823794e'
  }
}

Relational:

workItem = {
  ...workItemKeys,
  state: {
    _id: '5e3af5a4548644000823794e'
  }
}

The differences look slight here, but imagine an object with multiple keys, such as another workItem (we’ll come to that later).

Trading places

There are two obvious trade-offs between embedding vs referencing data: efficiency and accuracy. If we embed state on a workItem, then whenever we want to fetch that workItem, all the information we want about its state will already be there, so we only need to make one call to the database.

If we just have a reference to the state (e.g., the _id), then we’ll have to make two calls: one to the workitems collection for the workItem and another to the states collection for the corresponding state.

To be clear: the workItem would have a state object on it, containing just an _id field. Once we’ve retrieved the workItem from the database, we’ll then use state._id to find the corresponding state, retrieve state.name and add it to the workItem object. (Mongoose does have a .populate() method to handle this for us, which I’ll come to later, but that’s still effectively what it’s doing.)

Seems easy then: embed the data. Sure, that means that there will be lots of copies of the same data (once in the state collection, and also a copy on each workItem), but in the vast majority of cases - and certainly with Notches - the impact will be negligible, and you’re reducing the number of calls you’re making to the database.

But what happens when you want to change the name of the state from “Open” to “Available”? You update the state.name in the states collection, but what about all the copies that exist on all the workItems? The state.name on those will still be “Open” because there’s no direct relation between the state collection and the state information on the workItem.

What if we’d referenced - or normalised - the data instead? We’d use the state._id reference on the workItem to also get the corresponding state information every time we fetched a workItem, and therefore we’d always get the most up-to-date state information.

How to choose which one to use then? For Notches, I chose to embed the state data, along with the data for businessValue and sprint, even though the data existed in separate collections. But why? Let’s look at another example of relations between workItems in the database with the parent object.

I get that reference

The parent object describes the relationships between the types of workItem: features, bugs and tasks. There is a hierarchy to these workItems, where a feature can be a parent to either a bug or a task. There also exist other parents throughout Notches, such as projects that have features as children, but we don't need to worry about those here.

A typical parent object will look like this:

{ entity: feature, _id: '5d8b4d9125f75e2405665505' }

This is an example of how we’d set up information for a normalised database. The only information about a task’s parent feature that exists on it is the feature’s _id, so every time we get a task and we also want information about its feature, we use the _id on the parent object to get the feature, then add that feature information on to the task elsewhere in the service.

You may have noticed as well that a workItem has a parent object, but not a children array. If we get a feature, how do we then get all the bugs and tasks that are its children? Simple - we take the feature._id and use that to list all the bugs and tasks where parent._id matches feature._id, and then add them to the feature object. This maintains a “one-to-many” relationship among the data. It means that the relations object - in this case parent - only needs to contain one _id. If instead we kept a record of all a feature’s children, we’d have a potentially longer array of child _ids to iterate through and a more complex series of Mongoose queries.

So why embed - or denormalise - state data and reference - or normalise - relations between workItems? One big factor in deciding which technique to use for Notches came down to how often the data was going to be updated. Things like states may change, but they’re not likely to change very often. We can write an updateMany query that will update the state information on workItems at the same time as it updates the state in the state collection, and the chances of the database being out of sync is minimal.

A task, however, is likely to be changing all the time, and we don’t want to be having to update things more than once every time we make a change. It’s relatively expensive, and the chances of something going wrong increase as well. It’s much more reliable to update it in one place, and then just retrieve the latest version using the _id whenever we need it in relation to another workItem elsewhere. Ultimately, it’s a read/write, efficiency/integrity trade-off.

Are you going to be reading the data often, do you want it as quick as possible, and are you rarely going to be updating the relations between the data? Then you probably want to denormalise your data. If the related data are going to be updated relatively often and you want to reduce redundancy and inconsistency in your database, then you’ll probably want to normalise your data.

.populate()

For referencing between models or collections in a database, Mongoose has the populate() method, and it handles filling in the details of the referenced item. You can see how it works in the Mongoose documentation.

We ended up not using .populate() for Notches for a few reasons: it didn’t offer the flexibility we wanted in choosing how to populate all keys on workItem throughout the service (such as repositories, which exist in another service and collection); the complexity of the syntax - and the subsequent bugs - grew exponentially once we wanted to .populate() beyond the top level of a document; and we also ran into bugs trying to implement it alongside Mongoose discriminators, a very useful method for differentiating between documents in the same collection (in this case, it allowed us to differentiate between features, bugs and tasks in the workitems collection).

I’ll be going over how we did our own version of populating in the Notches service in a subsequent blog.

The easy bits

And what about effort and comments? These are the simple ones. An effort or a comment will only ever relate to a single workItem. It’s not data that needs to be shared between documents or collections, we don’t want the data to exist after we delete a workItem, and we only ever want to retrieve the data in the context of its parent workItem. This is the kind of data that can easily be embedded.

Share me!