mirror why YAML is bad
The YAML doc from hell¶
source
written byâRuud van Asseldonk
publishedâ11 January, 2023
For a data format, YAML is extremely complicated. It aims to be a human-friendly format, but in striving for that it introduces so much complexity, that I would argue it achieves the opposite result. Yaml is full of footguns and its friendliness is deceptive. In this post I want to demonstrate this through an example.
This post is a rant, and more opinionated than my usual writing.
Yaml is really, really complex¶
Json is simple.âThe entire json specâconsists of six railroad diagrams. Itâs a simple data format with a simple syntax and thatâs all there is to it. Yaml on the other hand, is complex. So complex, thatâits specificationâconsists ofâ10 chaptersâwith sections numbered four levels deep and a dedicatedâerrata page.
The json spec is not versioned. There wereâtwo changesâto it in 2005 (the removal of comments, and the addition of scientific notation for numbers), but it has been frozen since â almost two decades now. The yaml spec on the other hand is versioned. The latest revision is fairly recent, 1.2.2 from October 2021. Yaml 1.2 differs substantially from 1.1: the same document can parse differently under different yaml versions. We will see multiple examples of this later.
Json is so obvious that Douglas Crockford claimsâto have discovered itââ not invented. I couldnât find any reference for how long it took him to write up the spec, but it was probably hours rather than weeks. The change from yaml 1.2.1 to 1.2.2 on the other hand, wasâa multi-year effort by a team of experts:
This revision is the result of years of work by the newâYAMLâlanguage development team. Each person on this team has a deep knowledge of the language and has written and maintains important open sourceâYAMLâframeworks and tools.
Furthermore this team plans to actively evolve yaml, rather than to freeze it.
When you work with a format as complex as yaml, it is difficult to be aware of all the features and subtle behaviors it has. There isâan entire websiteâdedicated to picking one ofâthe 63 different multi-line string syntaxes. This means that it can be very difficult for a human to predict how a particular document will parse. Letâs look at an example to highlight this.
The yaml document from hell¶
Consider the following document.
server_config:
port_mapping:
# Expose only ssh and http to the public internet.
- 22:22
- 80:80
- 443:443
serve:
- /robots.txt
- /favicon.ico
- *.html
- *.png
- !.git # Do not expose our Git repository to the entire world.
geoblock_regions:
# The legal team has not approved distribution in the Nordics yet.
- dk
- fi
- is
- no
- se
flush_cache:
on: [push, memory_pressure]
priority: background
allow_postgres_versions:
- 9.5.25
- 9.6.24
- 10.23
- 12.13
Letâs break this down section by section and see how the data maps to json.
Sexagesimal numbers¶
Letâs start with something that you might find in a container runtime configuration:
Huh, what happened here? As it turns out, numbers from 0 to 59 separated by colons areâsexagesimal (base 60) number literals. This arcane feature was present in yaml 1.1, but silently removed from yaml 1.2, so the list element will parse asâ1342
âorâ"22:22"
âdepending on which version your parser uses. Although yaml 1.2 is more than 10 years old by now, you would be mistaken to think that it is widely supported: the latest version libyaml at the time of writing (which is used among others byâPyYAML) implements yaml 1.1 and parsesâ22:22
âasâ1342
.
Anchors, aliases, and tags¶
The following snippet is actually invalid:
Yaml allows you to create anâanchorâby adding anâ&
âand a name in front of a value, and then you can later reference that value with anâalias: aâ*
âfollowed by the name. In this case no anchors are defined, so the aliases are invalid. Letâs avoid them for now and see what happens.
Now the interpretation depends on the parser you are using. The element starting withâ!
âis aâtag. This feature is intended to enable a parser to convert the fairly limited yaml data types into richer types that might exist in the host language. A tag starting withâ!
âis up to the parser to interpret, often by calling a constructor with the given name and providing it the value that follows after the tag. This means thatâloading an untrusted yaml document is generally unsafe, as it may lead to arbitrary code execution. (In Python, you can avoid this pitfall by usingâyaml.safe_load
âinstead ofâyaml.load
.) In our case above, PyYAML fails to load the document because it doesnât know theâ.git
âtag. Goâs yaml package is less strict and returns an empty string.
The Norway problem¶
This pitfall is so infamous that it became known as âthe Norway problemâ:
What is thatâfalse
âdoing there? The literalsâoff
,âno
, andân
, in various capitalizations (but not any capitalization!), are allâfalse
âin yaml 1.1, whileâon
,âyes
, andây
âare true. In yaml 1.2 these alternative spellings of the boolean literals are no longer allowed, but they are so pervasive in the wild that a compliant parser would have a hard time reading many documents. Goâs yaml library thereforeâmade the choiceâof implementing a custom variant somewhere in between yaml 1.1 and 1.2 that behaves differently depending on the context:
The yaml package supports most ofâYAMLâ1.2, but preserves some behavior from 1.1 for backwards compatibility.âYAMLâ1.1 bools (yes/no, on/off) are supported as long as they are being decoded into a typed bool value. Otherwise they behave as a string.
Note that it only does that since version 3.0.0, which was released in May 2022.âEarlier versions behave differently.
Non-string keys¶
While keys in json are always strings, in yaml they can be any value, including booleans.
Combined with the previous feature of interpretingâon
âas a boolean, this leads to a dictionary withâtrue
âas one of the keys. It depends on the language how that maps to json, if at all. In Python it becomes the stringâ"True"
. The keyâon
âis common in the wild becauseâit is used in GitHub Actions. I would be really curious to know whether GitHub Actionsâ parser looks atâ"on"
âorâtrue
âunder the hood.
Accidental numbers¶
Leaving strings unquoted can easily lead to unintentional numbers.
Maybe the list is a contrived example, but imagine updating a config file that lists a single value of 9.6.24 and changing it to 10.23. Would you remember to add the quotes? What makes this even more insidious is that many dynamically typed applications implicitly convert the number to a string when needed, so your document works fine most of the time, except in some contexts it doesnât. For example, the following Jinja template accepts bothâversion: "0.0"
âandâversion: 0.0
, but it only takes the true-branch for the former.
Runners-up¶
There is only so much I can fit into one artifical example. Some arcane yaml behaviors that did not make it in areâdirectives, integers starting withâ0
âbeing octal literals (but only in yaml 1.1),â~
âbeing an alternative spelling ofânull
, andâ?
âintroducing aâcomplex mapping key.
Syntax highlighting will not save you¶
You may have noticed that none of my examples have syntax highlighting enabled. Maybe I am being unfair to yaml, because syntax highlighting would highlight special constructs, so you can at least see that some values are not normal strings. However, due to multiple yaml versions being prevalent, and highlighters having different levels of sophistication, you canât rely on this. Iâm not trying to nitpick here: Vim, my blog generator, GitHub, and Codeberg, all have a unique way to highlight the example document from this post. No two of them pick out the same subset of values as non-strings!
Templating yaml is a terrible, terrible idea¶
I hope it is clear by now that working with yaml is subtle at the very least. What is even more subtle is concatenating and escaping arbitrary text fragments in such a way that the result is a valid yaml document, let alone one that does what you expect. Add to this the fact that whitespace is significant in yaml, and the result is a format that isâmeme-worthilyâdifficult to template correctly. I truly do not understand whyâtools based on such an error-prone practiceâhave gained so much mindshare, when there is a safer, easier, and more powerful alternative: generating json.
Alternative configuration formats¶
I think the main reason that yaml is so prevalent despite its pitfalls, is that for a long time it was the only viable configuration format. Often we need lists and nested data, which rules out flat formats like ini. Xml is noisy and annoying to write by hand. But most of all, we need comments, which rules out json. (As we saw before, json had comments very early on, but they were removed because people started putting parsing directives in there. I think this is the right call for a serialization format, but it makes json unsuitable as a configuration language.) So if what we really need is the json data model but a syntax that allows comments, what are some of the options?
- Tomlââ Toml is similar to yaml in many ways: it has mostly the same data types; the syntax is not as verbose as json; and it allows comments. Unlike yaml it is not full of footguns, mostly because strings are always quoted, so you donât have values that look like strings but arenât. Toml is widely supported, you can probably find a toml parser for your favorite language. Itâs even in the Python standard library â unlike yaml! A weak spot of toml is deeply nested data.
- Json with comments,âJson with commas and commentsââ There exist various extensions of json that extend it just enough to make it a usable config format without introducing too much complexity. Json with comments is probably the most widespread, as it is used as the config format for Visual Studio Code. The main downside of these is that they havenât really caught on (yet!), so they arenât as widely supported as json or yaml.
- A simple subset of yamlââ Many of the problems with yaml are caused by unquoted things that look like strings but behave differently. This is easy to avoid: always quote all strings. (Indeed, you can tell that somebody is an experienced yaml engineer when they defensively quote all the strings.) We can choose to always useâ
true
âandâfalse
ârather thanâyes
âandâno
, and generally stay away from the arcane features. The challenge with this is that any construct not explicitly forbidden will eventually make it into your codebase, and I am not aware of any good tool that can enforce a sane yaml subset.
Generating json as a better yaml¶
Often the choice of format is not ours to make, and an application only accepts yaml. Not all is lost though, because yaml is a superset of json, so any tool that can produce json can be used to generate a yaml document.
Sometimes an application will start out with a need for just a configuration format, but over time you end up with many many similar stanzas, and you would like to share parts between them, and abstract some repetition away. This tends to happen in for example Kubernetes and GitHub Actions. When the configuration language does not support abstraction, people often reach for templating, which is a bad idea for the reasons explained earlier. Proper programming languages, possibly domain-specific ones, are a better fit. Some of my favorites are Nix and Python:
- Nixââ Nix is the language used by theâNix package manager. It was created for writing package definitions, but it works remarkably well as a configuration format (and indeed it is used to configure NixOS). Functions, let-bindings, and string interpolation make it powerful for abstracting repetitive configuration. The syntax is light like toml, and it canâexport to jsonâor xml. It works well for simplifying a repetitive GitHub Actions workflow file, for example.
- Pythonââ Json documents double as valid Python literals with minimal adaptation, and Python supports trailing commas and comments. It has variables and functions, powerful string interpolation, andâ
json.dump
âbuilt in. A self-contained Python file that prints json to stdout goes a long way!
Finally there are some tools in this category that I havenât used enough to confidently recommend, but which deserve to be mentioned:
- Dhallââ Dhall is like Nix, but with types. It is less widespread, and personally I find the built-in function names unwieldy.
- Cueââ Like Dhall, Cue integrates type/schema information into the config format. Cue is a superset of json, but despite that, I find the files that actually use Cueâs features to look foreign to me. Cue is on my radar to evaluate further, but I havenât encountered a problem where Cue looked like the most suitable solution yet.
- Hashicorp Configuration Languageââ I havenât usedâHCLâextensively enough to have a strong opinion on it, but in the places where I worked with it, the potential for abstraction seemed more limited than what you can achieve with e.g. Nix.
Conclusion¶
Yaml aims to be a more human-friendly alternative to json, but with all of its features, it became such a complex format with so many bizarre and unexpected behaviors, that it is difficult for humans to predict how a given yaml document will parse. If you are looking for a configuration format, toml is a friendly format without yamlâs footguns. For cases where you are stuck with yaml, generating json from a more suitable language can be a viable approach. Generating json also opens op the possibility for abstraction and reuse, in a way that is difficult to achieve safely by templating yaml.
Backlinks¶
- config
- article discusses config language issues
YAML is popular but has issues mirror - why YAML is bad
- article discusses config language issues