Releasing diepvries, a Data Vault framework for Python

Matthieu Caneill
Picnic Engineering
Published in
4 min readAug 26, 2021

--

diepvries is a simple Python interface that generates complex SQL queries used to load a complete Data Vault model.

On behalf of the Picnic Data Engineering team, I’m very happy to announce the first public release of diepvries under the MIT license.

If looking at source code works better for you than reading a blog post, you can already head over to the Github repository. Otherwise, hold your breath for some background on this framework.

Background: Data Vault

Before diving into our solution, a prime reminder about Data Vault modeling. This modeling method aims to capture data from operational systems, for long-term storage. It scales very well in terms of data volume and data model changes, with the goal of archiving “all the data, all of the time”. Among its main characteristics, we can find the organization of tables into Hubs, Links, and Satellites, all having metadata columns storing attributes such as data sources and capture timestamps.

At Picnic, we use Data Vault extensively. It is at the core of the Data Warehouse, and powers many applications. For example, you can read more about how it’s used for Data Science or as the foundation of a “lakeless” approach.

What is diepvries?

diepvries is the name of a Python library that we are releasing today under a free software license. It automates the data loading process for Data Vault and avoids the maintenance of repetitive SQL queries for ETL jobs. Its main functionalities are:

  • Data Vault programmability. Through the representation of a Data Vault model using Python objects such as Hub, Satellite, Link or Field, diepvries allows a data model to be programmable, shifting the focus of otherwise cumbersome and repetitive work to more interesting challenges, such as model validations, sanity checks, or documentation generation.
  • SQL generation. Given a defined data model, and some data in a staging state, diepvries can automatically generate SQL statements to load this data in a chosen list of Data Vault target tables — the hubs, links, and satellites modeled around the data.
  • Data model deserialization. Maintaining a Data Vault model both in SQL and Python is error-prone and would duplicate data structures. To avoid this, diepvries can automatically deserialize SQL tables into Python objects, using database metadata.

Although this might seem a lot, the scope of this library is constrained by design to be a Python interface for an SQL Data Vault, tailored for data loading. This makes diepvries very flexible, as it can easily be integrated into ETL jobs without fundamentally changing their architecture.

Typical Data Vault pipeline. diepvries automates data loading from the staging area to the Data Vault structures.

To implement these features, diepvries is based on the following principles:

  • Convention over configuration. Following strong naming conventions for tables and columns brings many benefits, such as disambiguation and consistency. Semantic names put humans and computers on the same page: knowing that h_customer is the Hub for customers, or hs_customer the Satellite, allows Data Engineers to immediately know what they’re talking about, and allows diepvries to deserialize SQL structures into the right Python objects.
  • Automatic field mapping. Following naming conventions mentioned above, using the same column names for the source (operational datastores) and target (Data Vault) systems enables diepvries to automatically map them — without having to manually specify every field mapping.
  • Self-contained and restricted scope. diepvries does one thing and does it well — hence it doesn’t rely on a lot of dependencies. In fact, it only has one, the Snowflake connector for Python. This keeps the library lightweight and creates less dependency conflicts when integrated into other components.

Currently, diepvries only supports Snowflake as the Data Vault backend. Indeed, the generated SQL is not database-agnostic, and reading table and column metadata works differently for every engine. But this is only the first step, and we might consider supporting more databases in the future.

diepvries at Picnic

The development of diepvries started in 2019 when the Data Engineering team was looking for a scalable solution to maintain dozens of ETL Data Vault jobs, without duplicated SQL queries or boilerplate code. It has been part of the transition from Pentaho to Python, and after the library was deemed successful, we decided to release it to a wider audience, joining a growing list of free software developed at Picnic.

You might wonder where the name comes from. diepvries is Dutch for freezer, which resonates well with a Data Vault (once data enters it, it is frozen there forever), and of course with the business of Picnic, the delivery of groceries (including frozen products!). Oh, and after realizing the acronym DV can stand both for Data Vault and DiepVries, it was a no-brainer to pick this name!

diepvries has been running in production at Picnic for almost 2 years, and powers the entirety of our Data Vault, which comprises thousands of tables.

What’s the future of diepvries?

While we have many ideas for the directions diepvries can take in the future, there is not a defined roadmap. A first public release will change how the library is developed, as we expect to receive external feedback and contributions.

We believe in the fundamental ideas behind the library, namely convention over configuration (semantic names make everything easier), and restricted scope (do one thing and do it well). That being said, we are looking forward to seeing how diepvries can do these things better, and we aim to make Data Vault programming a little bit simpler at every iteration.

We hope to see diepvries adopted by other Data Vault practitioners, and to collaborate with them to define and design its future. Directions might be uncertain, but certainly exciting!

Want to learn more? Be sure to check out the source code repository and the documentation website.

--

--