How to use YAML Schema to validate your YAML files

Published in

Picnic Engineering

5 min readNov 29, 2017

At Picnic, we often use JSON or YAML files to manage the configuration of parts of our services. This approach enables employees with a non-engineering background to configure the essential properties for how our systems work. However, mistakes in such files are easily made.

These mistakes could have catastrophic consequences if the services which rely on these files are not implemented defensively enough. Moreover, engineer support is often still required to understand the full potential of the configuration files.

We have been investigating how to introduce stricter validation of our configuration files. This is intended to mitigate the risk of invalid configuration files, whilst simultaneously providing file users with more informative feedback.

In this article, I intend to give you a practical example of how to do some simple validations, using the open-source JSON/YAML validation tool called Rx. We will use YAML for all our examples in this article, as it’s easier to read. Since YAML is a superset of JSON, the same approach can also be applied to JSON files.

The concept of a Schema

In order to validate something, one should first describe the rules of conformity. In the context of validating a YAML file, this is referred to as a “schema”. A schema defines the general structure to which files should adhere. A specific file can then be checked against that schema to verify whether it is valid or not.

You might wonder which superior language has been cooked up to write such schema files? The answer is simple, and awesome. You write them in YAML. How meta! 🤓

Getting our hands dirty

Let’s get our hands dirty by writing our first schema and validating a file against it. We’ll be using Ruby here, but implementations of Rx in many languages are available on GitHub. You can pick a language of your choosing.

The first thing we need to implement is a simple wrapper around Rx, in order to use it from the command line. The following will suffice for the scope of this article:

Our first schema

Let’s pretend that we’re building a static site generator which allows a user to generate a weblog from a set of YAML files, each representing a post. Whenever the user saves a YAML file, we’ll validate the file to ensure that it’s correct. As a first step, we define a valid blogpost as a YAML file that contains two entries: a ‘title’ and a ‘body’ with the following schema:

What this schema simply defines is that we expect a valid YAML file to contain a record (//rec) as top-level element. This record must then contain two keys; a ‘title’ and ‘body’ with a value of type ‘string’ (//str).

We can now write our first blogpost:

… and validate this post against our schema with the help of the validator we wrote in the previous section:

~ ./validator.rb post.yaml schema.yaml
Loading schema to validate against
    ✅  Schema loaded successfully
Validating
    ✅  File is according to schema.

Now, suppose our user publishes this post but does not see it go viral immediately. They have been playing the content marketing game for quite some time, so they know how to fix it and change the YAML file as follows:

Clickbait subtitle, check! Emoji, check! Profit guaranteed! Let’s validate that file and publish the story:

~ ./validator.rb updated-post.yaml schema.yaml
Loading schema to validate against
    ✅  Schema loaded successfully
Validating
    ❌  An error occurred validating the file against the schema
        Hash had extra keys: ["subtitle"] (/rec)

Great, we caught an issue before it hit production 🎉 . However, this subtitle thing seems like something we might wish to support in our blogging platform.

So, we update our schema as follows:

With this schema, both posts will pass validation; the one without a subtitle, and also the updated post with a subtitle.

A note on Rx core types

In the examples above, we used two types available in Rx: //rec and //str. In total, Rx supports 13 core types. The most notable are as follows:

//bool: accepts true and false.
//num: accepts any numerical value. This type can be parameterised with an exact value or a range.
//int: same as //num, but accepts integers only.
//str: accepts any string of arbitrary length. This type can be parameterised with an exact value, a minimum and a maximum length.
//arr: accepts a list of values, all of which must have the same type. This type can be parameterised with a minimum and maximum array length.
//seq: accepts a list of values that don’t necessarily have the same type.
//rec: accepts a map of keys to values, where the value for each key needs to conform to its own schema.
//map: accepts a map of keys to values, where the value for all keys needs to conform to the same schema.

With the help of these core types, we can already make our schema a bit more interesting. For example, we can add a required field ‘draft’ with a boolean value, indicating whether or not our blogpost is a draft and allow for a list of tags to be specified as follows:

The following will now pass validations:

Whereas the following won’t, as the tags provided are not strings:

Moving beyond core types

The core types defined by Rx are nice, but it’s easy to come up with cases for which no core type is available. To overcome this limitation, schema languages allow for the specification of custom types by so-called “meta schemas”.

As an example of where this would come in handy, consider that we’re very happy with what the schema for posts allows us to achieve; and so we also decide to specify our landing page with a YAML configuration. Our landing page consists of two parts; a sticky post at the top, and a list of posts below.

Let’s introduce a schema for that:

As you can see, there is significant duplication within this schema file. Moreover, we also have the same schema in the schema definition we validate our blogposts against. Let’s overcome this duplication by lifting the blogpost schema to a type.

To that end, we introduce a new type, blogpost, with the following meta-schema:

Note that we prepend the name of our type with ‘tag:’. This is an implementation detail required for Rx to recognise the type. With this set, we still need to add support for meta schemas to our validator, by adding the following snippet:

We can now adapt our schema for the landing page as follows:

… and validate it using our adapted validator:

~ ./validator2.rb page.yaml landing-page-schema.yaml blogpost.yaml
Learning meta schemas
    ✅  Learned meta scheme tag:blogpost
Loading schema to validate against
    ✅  Schema loaded successfully
Validating
    ✅  File is according to schema.

Wrap-up

I hope that with this article, I have covered the basics that allow you to quickly write your own schemas and validate your YAMLs. In my experience, it is an extremely useful tool to protect the boundaries between your system and the outside world.

Please note that Rx is by no means the only option available for all your validation needs. In fact, if this article excites you, you may want to check out JSON Schema, of which draft 7 is currently under review by the Internet Engineering Task Force. Happy validating!