2023-09-17

JSON vs YAML vs TOML

A detailed guide to the differences between three of the most popular data serialization formats.


TL;DR

An introduction to JSON, YAML, and TOML.

The JSON (JavaScript Object Notion) and YAML (Yaml Ain't Markup Language) formats both came about in the early 2000s as an alternative to XML for data storage and transport. They're both less verbose than XML making them easier to read and write. JSON's syntax follows on from JavaScript and is so simple that Douglas Crockford claims to have discovered it and not invented it. YAML on the other hand follows Python indentation style syntax and has evolved to become a superset of JSON. Going far beyond JSON, YAML supports extra types and constructs enabling complex data structures like acyclic graphs. These features in addition to some interesting type choices lead to a lot of unnecessary complexity in YAML files for most use cases. So, in 2013 a guy named Tom released TOML (Tom's Obvious Minimal Language) - a language that's almost as readable as YAML but does away with a lot of the complications. Syntactically it's based upon .ini files but with the bonus of a formal specification.

If you're about to write an application for web, desktop or mobile, chances are you'll encounter at least one of the three formats. Throw in a CI/CD pipeline to deploy it and you'll be up to two. It's inevitable you'll come across them so let's dive into the key differences between them.

General Structure

All three formats implement a system of Key/Value pairs to build data of base types, such as integers and strings, into more complex hierarchical structures. The idea of each is that it will be serialized into some known struct or as a object/dictionary/hashmap. When specifying a key, JSON restricts you to strings, TOML and YAML then extend this to numbers, and then YAML goes a step further and supports complex keys such as sequences and maps. The below example demonstrates how these different structures are represented in JSON, YAML, and TOML.

JSON
{
    "key": "value1",
    "123": "value2",
    "escaped": "value3",
    "fruit": {
        "apple": "tasty",
        "orange": "yuck"
    }
}  
    
YAML
key: value1
123: value2
"escaped": "value3"
fruit:
  apple: tasty
  orange: yuck

# Mapping as a key. No equivalent
? - West Ham
  - Brighton & Hove
: - 2023-09-21

# Sequence as a key. No equivalent
? [ Manchester United,
    Bayern Munich ]
: [ 2021-07-02, 2021-08-12,
    2021-08-14 ]
    
TOML
key = "value1"
123 = "value2"
"escaped" = "value3"
fruit.apple = "tasty"
fruit.orange = "yuck"
    

YAML has yet another advanced feature. If you wish to reuse a block of data throughout your file you can use anchors and aliases to reference the same data. In JSON and TOML you'll unfortunately have to copy and paste.

JSON
{
  "common-config": {
    "port": 8765,
    "label": "MyPort"
  },
  "server-one": {
    "image": "postgres",
    "config": {
      "port": 8765,
      "label": "MyPort"
    }
  },
  "server-two": {
    "image": "node",
    "config": {
      "port": 8765,
      "label": "MyPort"
    }
  }
} 
    
YAML
common-config: &common
  port: 8765
  label: MyPort

server-one:
  image: postgres
  config: *common

server-two:
  image: node
  config: *common
    
TOML
[common-config]
port = 8765
label = "MyPort"

[server-one]
image = "postgres"

[server-one.config]
port = 8765
label = "MyPort"

[server-two]
image = "node"

[server-two.config]
port = 8765
label = "MyPort"
    

YAML also supports inheritance, naming variables for reuse, strict ordering of maps, and multiple "documents" in a single file. It's honestly such a vast, complex format that I can't cover it all here. So now we've gone over the generic, structural differences, let's move on to more granular differences.

The elephant in the room: comments

JSON doesn't support comments. When you're using JSON to transport data from client to server a lack of comments isn't an issue at all. Where the issues lie are almost entirely with configuration and manifest files. When writing out the configuration for an application a comment with the rationale behind certain decisions can be invaluable. This is especially true in the case of a manifest file where dependencies and their versions are being declared. People do like to find workarounds though, such as using a "__comment" key to store their comments - but we all know this is a terrible idea. It was this sort of misuse which caused comments to be removed from JSON in the first place. Early on comments were supported but were being used to store parsing directives, a move which Douglas Crockford thought would ruin the portability of the language and so, he removed them. Very annoying. However, if you're tied to JSON for any reason then json-c, JSON5, and Hjson all support comments as well as being generally more human friendly, though they are not as widely supported yet.

JSON
{"no": "comments",
 "bad": "luck"}  
    
YAML
# Comments begin with a hash
supports: comments # More comments!
    
TOML
# Comments begin with a hash
supports = "comments" # More comments!
    

Data types

While YAML and TOML support similar types JSON supports a very limited set. This was by design and its simplicity is a likely reason for its popularity, but since the early 2000s a lot has changed. YAML and TOML both attempt to address what's lacking and the following sections discuss why that's both a good and bad idea. But before diving into that, let's discuss how each format tackles typing itself.

JSON and TOML use explicit typing. When you write a string, you surround it with quotes making it unambiguously a string. This isn't true for YAML. The data type is inferred from the content's appearance leading to a few quirks which can cause some pervasive bugs. Consequently, YAML has a tagging system which can be used to explicitly declare data a certain type. To make a tag you simply define your type next to your value with exclamation marks, for example "!!str hi". What happens then if you want to store a string which starts with an exclamation mark? Well, remember to escape it with quotation marks otherwise you'll get some unwanted behavior!

---
a-string-number: !!str 1234
not-a-date: !!str 2023-03-03

# Ensure strings starting with ! are escaped.
do-not-display: "!.private"

Let's demonstrate some of the problems caused by implicit typing starting with the well known "Norway Problem".

Booleans and the Norway problem

JSON, YAML, and TOML all obviously support booleans. YAML 1.1 though also supports boolean equivalents such as 'on' or 'off' and 'yes' or 'no'. Write these in your YAML file and send them off to be parsed and they'll be automatically converted to booleans. This becomes an issue if you've written Norway's two letter country code 'NO' without escaping it. They have since removed this behavior in YAML 1.2, but popular parsers such as PyYAML still mainly support 1.1 so the problem is still present in a large amount of codebases. Both my VsCode highlighter and highlight.js which is highlighting the code you see below highlight it as a boolean still!

countries:
    - UK
    - EE
    - NO

Sexagesimal

Continuing on a theme of YAML adding unnecessary complexity, 1.1 supported Sexagesimal (base 60) numbers. If you write numbers under 60 separated by colons such as, 3:25:45, they are interpreted as an integer, 12345. This becomes a problem when you're writing your port configuration for Docker and you write something like,

redis:
  build:
    context:
      dockerfile: Dockerfile-redis
      port: 12:12

and your port gets parsed as 732. Make sure you're escaping your ports with quotation marks or better still, using a language where it's enforced. Much like the different boolean interpretations Sexagesimal support has been cleaned up in YAML 1.2. But again, what use is this if it still hasn't been widely adopted?

Multiline strings

Let's stop tormenting YAML for a moment and focus on some of the missing components of JSON. If you want a multiline string in JSON, you're out of luck. Of course you can throw a '\n' into a string to split strings over multiple lines but in the format itself, there's no valid way to display a string over multiple lines. This isn't an issue for data transport but when it comes to configuration and manifest files a long string can become very difficult to read. Both TOML and YAML have support for various types of multiline strings and clear the problem up as too does JSON5 and Hjson.

JSON
{
  "multiline-string": "A multiline\n string"
}  
    
YAML
multiline-string: | 
    A multiline
    string
    
TOML
multiline-string = """
A multiline
string
"""
    

Datetime

JSON doesn't support datetimes at all, TOML supports RFC 3339 formatted datetimes, and YAML supports ISO 8601 datetimes. ISO 8061 is the most comprehensive set supporting durations and stand-alone year numbers etc., RFC 3339 then supports a subset of ISO 8601 which must represent a complete date, time, or datetime. It also supports some alternative formatting outside of ISO 8601. Here's an excellent site that succinctly visualizes the different datetime formats. Personally, I don't like the idea of leaving the parsing of a datetime from a string up to the implementation of the serializer as JSON does. I'd much rather have a defined format with which to represent it. TOML has a good middle ground between the two and defines datetimes much more clearly in its specification than does YAML. If your data contains a lot of datetimes and you want to ensure it's easier to read and leave less room for misunderstanding, I'd definitely choose TOML. Unfortunately there's still going to be some discrepancies between different implementations on how datetime parsing is handled, so instead of leaving you with a lovely example, I'm going to tell you to read the docs of your chosen serializer.

Floats

For floating point numbers, theres a small but very annoying issue with JSON. No support for inf or NaN. If the result of a calculation is undefined, for example tan(90 deg) you're stuck with representing it as null or a string containing "NaN". Null generally signifies that an object or value is empty or yet to be assigned. NaN however stands for "Not a Number" and is used to represent mathematical operations which have no real number answer, for instance dividing by 0 or the square root of -1. This is the first issue that affects JSON for data transport and not just a dig at people using it for configuration files. JSON5, Hjson, TOML and YAML all support NaN in addition to inf.

As for the format of floating point numbers JSON and TOML do not allow you to have trailing/leading decimal places. I think this strictness is better - the tiny sacrifice in utility is more than made up for with readability. 0.12 is much easier to read than .12 especially when you're skimming down a file. YAML supporting this just seems unnecessary. YAML and TOML do however enable explicit "+" signs which, in files with lots of negative numbers, is extremely helpful.

JSON
{
    "inf1": "inf",
    "inf2": "+inf",
    "inf3": "-inf",
    "nan1": "nan",
    "nan2": "+nan",
    "nan3": "-nan",
    "dec1": 2,
    "dec2": 0.23,
    "dec3": 0.1
}
    
YAML
inf1 : inf  
inf2 : +inf 
inf3 : -inf 

nan1 : nan  
nan2 : +nan 
nan3 : -nan 

dec1: +2
dec2: .23
dec3: +.1
    
TOML
inf1 = inf
inf2 = +inf
inf3 = -inf

nan1 = nan
nan2 = +nan
nan3 = -nan

dec1 = +2
dec2 = +0.23
dec3 = +0.1
    

Hexadecimal & Octal

YAML and TOML support hexadecimal and octal notation to represent integers and JSON does not. Once again, for data transfer I really don't believe this is an 'issue' as a simple conversion can be done if needed. In configuration files octal can be handy to represent linux file permissions and hexadecimal can be handy for writing memory addresses, IP addresses and colours. The extra readability offered by this simple feature is great.

Binary

JSON is used to send a lot of binary data though it doesn't actually support it. The binary has to be escaped and embedded in a string, normally Base64 encoded. It's possible that someone may be sending you Base85 encoded data, or any other encoding for that matter, so you do need to be careful about interpreting it. TOML and YAML have in-built support for binary; however, TOML only allows simple binary such as integers whereas with YAML type tags you can define data as being binary. Again, there's no formal specification for how it should be encoded but they do recommend Base64.

JSON
{
  "hex": 2023,
  "oct": 2023,
  "bin": 2023,
  "bin-encoded-str": "ImhpIg==",
}
    
YAML
hex: 0x7E7
oct: 0o3747
bin: 0b11111100111

bin-encoded-str: ImhpIg==
bin-encoded-typed: !!binary ImhpIg==
    
TOML
hex = 0x7E7
oct = 0o3747
bin = 0b11111100111

bin-encoded-str = "ImhpIg=="
    

Versioning

Something which was has cropped up especially with YAML is versioning. YAML has different versions which support different features. 1.2 addresses some common pitfalls but adoption is still slow. TOML too is versioned, but it's backwards compatible and to my knowledge no features have been removed, only added. Now it's hit 1.0 it's unlikely to undergo any major changes which is great because it's the widely adopted version. On the other hand, JSON isn't versioned. The spec hasn't been changed in almost 20 years and it's unlikely it ever will change now. More likely than not a new format will come along and replace it, like json-c, JSON5, or Hjson.

The fact that YAML 1.2 has removed features, albeit bad features, is a red flag as this has the potential to break hundreds of older implementations. I'll guess we'll just have to see how adoption pans out. Like the JSON adaptions, StrictYAML addresses a lot of YAML issues but still leaves YAML more complex than TOML.

Conclusions

Clearly JSON is a simple but excellent format. Using it for data transfer now and in the foreseeable future is a great choice. Supporting minimal types leads to quick parsing and the ability to easily spot where a mistake has been made. Conversely, using JSON for configuration files is a disaster. Lack of comments alone is enough to come to this conclusion, but no support for advanced types and a lot of syntactic noise put the final nails in the coffin. That leaves you with YAML or TOML for writing your configuration files. By far YAML is the more complex of the two and includes features which leave you open to making easy mistakes that could take you quite some time to find and rectify. Furthermore YAML 1.2, which patches some footguns in the language, has still not been widely adopted despite being nearly 15 years old. This leaves you with TOML for your configuration files. A good rigid language with a clear specification and strictly essential features. Though when things get complex, TOML's syntax can make it difficult to see hierarchies in your data so make sure to make good use of indentations that the parser will later ignore. I don't think the perfect configuration language has been written yet and I'm not sure it ever will as in the end it comes down to personal preference. And before you think about going and creating your own configuration language I'll leave you with this.



Competing Standards Meme.

Useful tools

On this site I also have a range of useful tools for working with JSON, YAML, and TOML:



Software
JSON
YAML
TOML