A place to cache linked articles (think custom and personal wayback machine)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1 年之前
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
  1. title: The yaml document from hell
  2. url: https://ruudvanasseldonk.com/2023/01/11/the-yaml-document-from-hell
  3. hash_url: 3b05eb0d7d0409bcfd53b4cdf6c20daa
  4. <article id="content" itemscope><p><span class="run-in">For a data format</span>, yaml is extremely complicated. It aims to be a human-friendly format, but in striving for that it introduces so much complexity, that I would argue it achieves the opposite result. Yaml is full of footguns and its friendliness is deceptive. In this post I want to demonstrate this through an example.</p><p>This post is a rant, and more opinionated than my usual writing.</p><h2 id="yaml-is-really-really-complex"><a href="#yaml-is-really-really-complex"></a>Yaml is really, really complex</h2><p>Json is simple. <a href="https://www.json.org/json-en.html">The entire json spec</a> consists of six railroad diagrams. It’s a simple data format with a simple syntax and that’s all there is to it. Yaml on the other hand, is complex. So complex, that <a href="https://yaml.org/spec/1.2.2/">its specification</a> consists of <em>10 chapters</em> with sections numbered four levels deep and a dedicated <a href="https://yaml.org/spec/1.2/errata.html">errata page</a>.</p><p>The json spec is not versioned. There were <a href="https://youtu.be/-C-JoyNuQJs?t=965">two changes</a> to it in 2005 (the removal of comments, and the addition of scientific notation for numbers), but it has been frozen since — almost two decades now. The yaml spec on the other hand is versioned. The latest revision is fairly recent, 1.2.2 from October 2021. Yaml 1.2 differs substantially from 1.1: the same document can parse differently under different yaml versions. We will see multiple examples of this later.</p><p>Json is so obvious that Douglas Crockford claims <a href="https://www.youtube.com/watch?v=-C-JoyNuQJs">to have discovered it</a> — not invented. I couldn’t find any reference for how long it took him to write up the spec, but it was probably hours rather than weeks. The change from yaml 1.2.1 to 1.2.2 on the other hand, was <a href="https://yaml.com/blog/2021-10/new-yaml-spec/">a multi-year effort by a team of experts</a>:</p><blockquote><p>This revision is the result of years of work by the new <abbr>YAML</abbr> language development team. Each person on this team has a deep knowledge of the language and has written and maintains important open source <abbr>YAML</abbr> frameworks and tools.</p></blockquote><p>Furthermore this team plans to actively evolve yaml, rather than to freeze it.</p><p>When you work with a format as complex as yaml, it is difficult to be aware of all the features and subtle behaviors it has. There is <a href="https://yaml-multiline.info/">an entire website</a> dedicated to picking one of <a href="https://stackoverflow.com/a/21699210/135889">the 63 different multi-line string syntaxes</a>. This means that it can be very difficult for a human to predict how a particular document will parse. Let’s look an example to highlight this.</p><h2 id="the-yaml-document-from-hell"><a href="#the-yaml-document-from-hell"></a>The yaml document from hell</h2><p>Consider the following document.</p><pre><code>server_config:
  5. port_mapping:
  6. # Expose only ssh and http to the public internet.
  7. - 22:22
  8. - 80:80
  9. - 443:443
  10. serve:
  11. - /robots.txt
  12. - /favicon.ico
  13. - *.html
  14. - *.png
  15. - !.git # Do not expose our Git repository to the entire world.
  16. geoblock_regions:
  17. # The legal team has not approved distribution in the Nordics yet.
  18. - dk
  19. - fi
  20. - is
  21. - no
  22. - se
  23. flush_cache:
  24. on: [push, memory_pressure]
  25. priority: background
  26. allow_postgres_versions:
  27. - 9.5.25
  28. - 9.6.24
  29. - 10.23
  30. - 12.13</code></pre><p>Let’s break this down section by section and see how the data maps to json.</p><h2 id="sexagesimal-numbers"><a href="#sexagesimal-numbers"></a>Sexagesimal numbers</h2><p>Let’s start with something that you might find in a container runtime configuration:</p><pre><code>port_mapping:
  31. - 22:22
  32. - 80:80
  33. - 443:443</code></pre><div class="sourceCode" id="cb3"><pre class="sourceCode json"><code class="sourceCode json"><span id="cb3-1"><span class="fu">{</span><span class="dt">"port_mapping"</span><span class="fu">:</span> <span class="ot">[</span><span class="dv">1342</span><span class="ot">,</span> <span class="st">"80:80"</span><span class="ot">,</span> <span class="st">"443:443"</span><span class="ot">]</span><span class="fu">}</span></span></code></pre></div><p>Huh, what happened here? As it turns out, numbers from 0 to 59 separated by colons are <a href="https://yaml.org/spec/1.1/#id858600">sexagesimal (base 60) number literals</a>. This arcane feature was present in yaml 1.1, but silently removed from yaml 1.2, so the list element will parse as <code>1342</code> or <code>"22:22"</code> depending on which version your parser uses. Although yaml 1.2 is more than 10 years old by now, you would be mistaken to think that it is widely supported: the latest version libyaml at the time of writing (which is used among others by <a href="https://pypi.org/project/PyYAML/6.0/">PyYAML</a>) implements yaml 1.1 and parses <code>22:22</code> as <code>1342</code>.</p><p>The following snippet is actually invalid:</p><pre><code>serve:
  34. - /robots.txt
  35. - /favicon.ico
  36. - *.html
  37. - *.png
  38. - !.git</code></pre><p>Yaml allows you to create an <em>anchor</em> by adding an <code>&amp;</code> and a name in front of a value, and then you can later reference that value with an <em>alias</em>: a <code>*</code> followed by the name. In this case no anchors are defined, so the aliases are invalid. Let’s avoid them for now and see what happens.</p><pre><code>serve:
  39. - /robots.txt
  40. - /favicon.ico
  41. - !.git</code></pre><div class="sourceCode" id="cb6"><pre class="sourceCode json"><code class="sourceCode json"><span id="cb6-1"><span class="fu">{</span><span class="dt">"serve"</span><span class="fu">:</span> <span class="ot">[</span><span class="st">"/robots.txt"</span><span class="ot">,</span> <span class="st">"/favicon.ico"</span><span class="ot">,</span> <span class="st">""</span><span class="ot">]</span><span class="fu">}</span></span></code></pre></div><p>Now the interpretation depends on the parser you are using. The element starting with <code>!</code> is a <a href="https://yaml.org/spec/1.2.2/#3212-tags">tag</a>. This feature is intended to enable a parser to convert the fairly limited yaml data types into richer types that might exist in the host language. A tag starting with <code>!</code> is up to the parser to interpret, often by calling a constructor with the given name and providing it the value that follows after the tag. This means that <strong>loading an untrusted yaml document is generally unsafe</strong>, as it may lead to arbitrary code execution. (In Python, you can avoid this pitfall by using <code>yaml.safe_load</code> instead of <code>yaml.load</code>.) In our case above, PyYAML fails to load the document because it doesn’t know the <code>.git</code> tag. Go’s yaml package is less strict and returns an empty string.</p><h2 id="the-norway-problem"><a href="#the-norway-problem"></a>The Norway problem</h2><p>This pitfall is so infamous that it became known as “<a href="https://hitchdev.com/strictyaml/why/implicit-typing-removed/">the Norway problem</a>”:</p><pre><code>geoblock_regions:
  42. - dk
  43. - fi
  44. - is
  45. - no
  46. - se</code></pre><div class="sourceCode" id="cb8"><pre class="sourceCode json"><code class="sourceCode json"><span id="cb8-1"><span class="fu">{</span><span class="dt">"geoblock_regions"</span><span class="fu">:</span> <span class="ot">[</span><span class="st">"dk"</span><span class="ot">,</span> <span class="st">"fi"</span><span class="ot">,</span> <span class="st">"is"</span><span class="ot">,</span> <span class="kw">false</span><span class="ot">,</span> <span class="st">"se"</span><span class="ot">]</span><span class="fu">}</span></span></code></pre></div><p>What is that <code>false</code> doing there? The literals <code>off</code>, <code>no</code>, and <code>n</code>, in various capitalizations (<a href="https://yaml.org/type/bool.html">but not any capitalization</a>!), are all <code>false</code> in yaml 1.1, while <code>on</code>, <code>yes</code>, and <code>y</code> are true. In yaml 1.2 these alternative spellings of the boolean literals are no longer allowed, but they are so pervasive in the wild that a compliant parser would have a hard time reading many documents. Go’s yaml library therefore <a href="https://github.com/go-yaml/yaml/tree/v3.0.1#compatibility">made the choice</a> of implementing a custom variant somewhere in between yaml 1.1 and 1.2 that behaves differently depending on the context:</p><blockquote><p>The yaml package supports most of <abbr>YAML</abbr> 1.2, but preserves some behavior from 1.1 for backwards compatibility. <abbr>YAML</abbr> 1.1 bools (yes/no, on/off) are supported as long as they are being decoded into a typed bool value. Otherwise they behave as a string.</p></blockquote><p>Note that it only does that since version 3.0.0, which was released in May 2022. <a href="https://github.com/go-yaml/yaml/commit/b145382a4cda47600eceb779844b8090b5807c4f">Earlier versions behave differently</a>.</p><h2 id="non-string-keys"><a href="#non-string-keys"></a>Non-string keys</h2><p>While keys in json are always strings, in yaml they can be any value, including booleans.</p><pre><code>flush_cache:
  47. on: [push, memory_pressure]
  48. priority: background</code></pre><div class="sourceCode" id="cb10"><pre class="sourceCode json"><code class="sourceCode json"><span id="cb10-1"><span class="fu">{</span></span>
  49. <span id="cb10-2"> <span class="dt">"flush_cache"</span><span class="fu">:</span> <span class="fu">{</span></span>
  50. <span id="cb10-3"> <span class="dt">"True"</span><span class="fu">:</span> <span class="ot">[</span><span class="st">"push"</span><span class="ot">,</span> <span class="st">"memory_pressure"</span><span class="ot">]</span><span class="fu">,</span></span>
  51. <span id="cb10-4"> <span class="dt">"priority"</span><span class="fu">:</span> <span class="st">"background"</span></span>
  52. <span id="cb10-5"> <span class="fu">}</span></span>
  53. <span id="cb10-6"><span class="fu">}</span></span></code></pre></div><p>Combined with the previous feature of interpreting <code>on</code> as a boolean, this leads to a dictionary with <code>true</code> as one of the keys. It depends on the language how that maps to json, if at all. In Python it becomes the string <code>"True"</code>. The key <code>on</code> is common in the wild because <a href="https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#on">it is used in GitHub Actions</a>. I would be really curious to know whether GitHub Actions’ parser looks at <code>"on"</code> or <code>true</code> under the hood.</p><h2 id="accidental-numbers"><a href="#accidental-numbers"></a>Accidental numbers</h2><p>Leaving strings unquoted can easily lead to unintentional numbers.</p><pre><code>allow_postgres_versions:
  54. - 9.5.25
  55. - 9.6.24
  56. - 10.23
  57. - 12.13</code></pre><div class="sourceCode" id="cb12"><pre class="sourceCode json"><code class="sourceCode json"><span id="cb12-1"><span class="fu">{</span><span class="dt">"allow_postgres_versions"</span><span class="fu">:</span> <span class="ot">[</span><span class="st">"9.5.25"</span><span class="ot">,</span> <span class="st">"9.6.24"</span><span class="ot">,</span> <span class="fl">10.23</span><span class="ot">,</span> <span class="fl">12.13</span><span class="ot">]</span><span class="fu">}</span></span></code></pre></div><p>Maybe the list is a contrived example, but imagine updating a config file that lists a single value of 9.6.24 and changing it to 10.23. Would you remember to add the quotes? What makes this even more insidious is that many dynamically typed applications implicitly convert the number to a string when needed, so your document works fine most of the time, except in some contexts it doesn’t. For example, the following Jinja template accepts both <code>version: "0.0"</code> and <code>version: 0.0</code>, but it only takes the true-branch for the former.</p><pre><code>{% if version %}
  58. Latest version: {{ version }}
  59. {% else %}
  60. Version not specified
  61. {% endif %}</code></pre><h2 id="runners-up"><a href="#runners-up"></a>Runners-up</h2><p>There is only so much I can fit into one artifical example. Some arcane yaml behaviors that did not make it in are <a href="https://yaml.org/spec/1.2.2/#68-directives">directives</a>, integers starting with <code>0</code> being octal literals (but only in yaml 1.1), <code>~</code> being an alternative spelling of <code>null</code>, and <code>?</code> introducing a <a href="https://yaml.org/spec/1.2.2/#example-mapping-between-sequences">complex mapping key</a>.</p><h2 id="syntax-highlighting-will-not-save-you"><a href="#syntax-highlighting-will-not-save-you"></a>Syntax highlighting will not save you</h2><p>You may have noticed that none of my examples have syntax highlighting enabled. Maybe I am being unfair to yaml, because syntax highlighting would highlight special constructs, so you can at least see that some values are not normal strings. However, due to multiple yaml versions being prevalent, and highlighters having different levels of sophistication, you can’t rely on this. I’m not trying to nitpick here: Vim, my blog generator, GitHub, and Codeberg, all have a unique way to highlight the example document from this post. No two of them pick out the same subset of values as non-strings!</p><h2 id="templating-yaml-is-a-terrible-terrible-idea"><a href="#templating-yaml-is-a-terrible-terrible-idea"></a>Templating yaml is a terrible, terrible idea</h2><p>I hope it is clear by now that working with yaml is subtle at the very least. What is even more subtle is concatenating and escaping arbitrary text fragments in such a way that the result is a valid yaml document, let alone one that does what you expect. Add to this the fact that whitespace is significant in yaml, and the result is a format that is <a href="https://twitter.com/memenetes/status/1600898397279502336">meme-worthily</a> difficult to template correctly. I truly do not understand why <a href="https://helm.sh/docs/chart_best_practices/templates/">tools based on such an error-prone practice</a> have gained so much mindshare, when there is a safer, easier, and more powerful alternative: generating json.</p><h2 id="alternative-configuration-formats"><a href="#alternative-configuration-formats"></a>Alternative configuration formats</h2><p>I think the main reason that yaml is so prevalent despite its pitfalls, is that for a long time it was the only viable configuration format. Often we need lists and nested data, which rules out flat formats like ini. Xml is noisy and annoying to write by hand. But most of all, we need comments, which rules out json. (As we saw before, json had comments very early on, but they were removed because people started putting parsing directives in there. I think this is the right call for a serialization format, but it makes json unsuitable as a configuration language.) So if what we really need is the json data model but a syntax that allows comments, what are some of the options?</p><ul><li><a href="https://toml.io/en/"><strong>Toml</strong></a> — Toml is similar to yaml in many ways: it has mostly the same data types; the syntax is not as verbose as json; and it allows comments. Unlike yaml it is not full of footguns, mostly because strings are always quoted, so you don’t have values that look like strings but aren’t. Toml is widely supported, you can probably find a toml parser for your favorite language. It’s even in the Python standard library — unlike yaml! A weak spot of toml is deeply nested data.</li><li><a href="https://code.visualstudio.com/docs/languages/json#_json-with-comments"><strong>Json with comments</strong></a>, <a href="https://nigeltao.github.io/blog/2021/json-with-commas-comments.html"><strong>Json with commas and comments</strong></a> — There exist various extensions of json that extend it just enough to make it a usable config format without introducing too much complexity. Json with comments is probably the most widespread, as it is used as the config format for Visual Studio Code. The main downside of these is that they haven’t really caught on (yet!), so they aren’t as widely supported as json or yaml.</li><li><strong>A simple subset of yaml</strong> — Many of the problems with yaml are caused by unquoted things that look like strings but behave differently. This is easy to avoid: always quote all strings. (Indeed, you can tell that somebody is an experienced yaml engineer when they defensively quote all the strings.) We can choose to always use <code>true</code> and <code>false</code> rather than <code>yes</code> and <code>no</code>, and generally stay away from the arcane features. The challenge with this is that any construct not explicitly forbidden will eventually make it into your codebase, and I am not aware of any good tool that can enforce a sane yaml subset.</li></ul><h2 id="generating-json-as-a-better-yaml"><a href="#generating-json-as-a-better-yaml"></a>Generating json as a better yaml</h2><p>Often the choice of format is not ours to make, and an application only accepts yaml. Not all is lost though, because yaml is a superset of json, so any tool that can produce json can be used to generate a yaml document.</p><p>Sometimes an application will start out with a need for just a configuration format, but over time you end up with many many similar stanzas, and you would like to share parts between them, and abstract some repetition away. This tends to happen in for example Kubernetes and GitHub Actions. When the configuration language does not support abstraction, people often reach for templating, which is a bad idea for the reasons explained earlier. Proper programming languages, possibly domain-specific ones, are a better fit. Some of my favorites are Nix and Python:</p><ul><li><a href="https://nixos.org/manual/nix/stable/language/index.html"><strong>Nix</strong></a> — Nix is the language used by the <a href="https://nixos.org/">Nix package manager</a>. It was created for writing package definitions, but it works remarkably well as a configuration format (and indeed it is used to configure NixOS). Functions, let-bindings, and string interpolation make it powerful for abstracting repetitive configuration. The syntax is light like toml, and it can <a href="https://nixos.org/manual/nix/stable/language/builtins.html#builtins-toJSON">export to json</a> or xml. It works well for simplifying a repetitive GitHub Actions workflow file, for example.</li><li><a href="https://www.python.org/"><strong>Python</strong></a> — Json documents double as valid Python literals with minimal adaptation, and Python supports trailing commas and comments. It has variables and functions, powerful string interpolation, and <a href="https://docs.python.org/3/library/json.html?highlight=json%20dump#json.dump"><code>json.dump</code></a> built in. A self-contained Python file that prints json to stdout goes a long way!</li></ul><p>Finally there are some tools in this category that I haven’t used enough to confidently recommend, but which deserve to be mentioned:</p><ul><li><a href="https://dhall-lang.org/"><strong>Dhall</strong></a> — Dhall is like Nix, but with types. It is less widespread, and personally I find the built-in function names unwieldy.</li><li><a href="https://cuelang.org/"><strong>Cue</strong></a> — Like Dhall, Cue integrates type/schema information into the config format. Cue is a superset of json, but despite that, I find the files that actually use Cue’s features to look foreign to me. Cue is on my radar to evaluate further, but I haven’t encountered a problem where Cue looked like the most suitable solution yet.</li><li><a href="https://github.com/hashicorp/hcl"><strong>Hashicorp Configuration Language</strong></a> — I haven’t used <abbr>HCL</abbr> extensively enough to have a strong opinion on it, but in the places where I worked with it, the potential for abstraction seemed more limited than what you can achieve with e.g. Nix.</li></ul><h2 id="conclusion"><a href="#conclusion"></a>Conclusion</h2><p>Yaml aims to be a more human-friendly alternative to json, but with all of its features, it became such a complex format with so many bizarre and unexpected behaviors, that it is difficult for humans to predict how a given yaml document will parse. If you are looking for a configuration format, toml is a friendly format without yaml’s footguns. For cases where you are stuck with yaml, generating json from a more suitable language can be a viable approach. Generating json also opens op the possibility for abstraction and reuse, in a way that is difficult to achieve safely by templating yaml.</p></article>