Audit trail handling with MongoDB

04. August 2021

A few years ago, we faced the important task of implementing an audit trail to the CRM web application. The term may sound fancy, but it basically means a history of changes. It doesn't really matter if the company is a subject of an audit, it is always useful to have a list of changes in the system with the author name and timestamp attached. I was excited to implement the audit trail because, at that time, I was working in my free time on a different project for which I envisioned a system based on storing just differences.

We have several different entities in the CRM and I didn't want to make a specific implementation for each one. The ideal state was, that the audit trail could be easily connected to any entity in the system without any additional knowledge of tracked entities.  I and Tibor designed a service that takes advantage of MongoDB. This popular NoSQL database stores data as JSON-like documents. We took advantage of this fact by implementing a service that calculates the difference between an old state a new state of the tracked entity. This difference is represented as JSON patch and saved to a chronologically sorted list.

By applying these JSON patches to the current document, we can travel back to any version of the document. This system will also preserve schema changes, which is neat. The older the version is, the more performance heavy the operation is due to the application of patches. To mitigate this problem, only a limited amount of patches is saved together with the full document - we call this entity an Audit Page. This means the number of operations is always limited by the maximum count of patches on the Audit Page. Getting the historical version means finding the correct Audit Page and application of patches to the full document from the page. The example above shows only 4 patches per page, but we use way more in the CRM. The specific number depends on how large (complicated) is the tracked document.

Some of our tracked entities are more complex, some of them less. Showing all types of them in the same format in GUI was tricky, but we agreed the simple text representation would be enough. We implemented a transformation function per each document type, which transforms JSON to a suitable text representation. In case of an update, both texts are rendered next to each other, and differences are highlighted similarly to how GitHub highlights the changes.

I recently attended Data + AI Summit 2021 and learned, how Delta Lake storage works internally. Delta Lake is a storage layer for Spark, which saves tables as files. It was nice to realize this interesting technology uses a similar system of differences interlaced with full snapshots under the hood. It allows the same time traveling capabilities, but on a table level.

Author: Luděk Novotný