YAML Gotchas

Camel

CAMLs Ain't a Markup Language either

At Genius.com, we use YAML to create fixture files for testing DB dependencies. YAML is a great way to easily store many kinds of data in a text file, especially database entries. Despite the incredible ease with which we can write fixtures using YAML, we have found that occasionally YAML does not work quite the way we would expect because of how it parses some data types. Below are several of the “YAML Gotchas” we have run into and a couple more we found while researching data types. Hopefully these can help you avoid some of the debugging that we’ve gone through and illuminate some of YAML’s more interesting features. You can find a full definition of all of the YAML types on YAML’s website.

Note that we’ve come across most of these using the YAML parser Syck for PHP. Keep in mind that although YAML has a specification, not all implementations follow it exactly.

Booleans

Let’s say you have a survey stored in the database where one column can hold strings, either Yes, No, or Maybe. Your YAML file will look something like this:

survey:
    recommendAFriend: Yes

After loading this file, you may expect that within survey, you would have a key-value mapping of recommendAFriend to the string Yes. However, you will find that the value Yes has been interpreted by YAML as the boolean value true. In fact, there are many values that YAML will parse into booleans:

y, Y, yes, Yes, YES
n, N, no, No, NO
true, True, TRUE
false, False, FALSE
on, On, ON
off, Off, OFF

If you want to use any of the above as strings, make sure to explicitly tell YAML to parse it as a string, either by quoting or explicitly casting:

survey:
    recommendAFriend1: 'Yes'
    recommendAFriend2: "Yes"
    recommendAFriend3: !!str Yes

Times and colons

In this survey, you also ask the user what time they usually go to sleep, which you will store in a MySQL time column.

survey:
    timeSleep: 01:30:00

You may expect this to parse the string 01:30:00 as the value for timeSleep, but instead you will find that it’s the integer 5400. This is because YAML will parse numbers separated by colons as sexagesimal (base 60). This can become even stranger when you try to insert this value into a MySQL database, because MySQL will interpret this integer as a time in the HHMMSS format or even MMSS if it makes sense as a time. In the above example, 5400 will go into the database as 00:54:00. Again, this possible problem can be solved by ensuring that you explicitly cast your times as strings so that they don’t mistakenly get interpreted as integers.

Octal

Starting with 0 will cause the number to be parsed in octal as long as you don’t use any digits greater than 7.

survey:
    customerCode: 01234567

The value for customerCode will parse to the integer 342391.

Underscores

Though it isn’t mentioned in the main specification, YAML allows the use of underscores for digit grouping, which can make visually interpreting large numbers easier.

survey:
    phoneNumber: 650_212_2050

This feature is not handled by are YAML implementations equally – PHP’s Syck parser interprets the above mentioned phoneNumber key as the string 650_212_2050.

Maximum integer size

Remember that depending on which implementation and which language you use, integers may be bound by the maximum integer size. For example, on a 32-bit machine, any values larger than 2,147,483,647 may be silently converted to that value. This is particularly important to if you use a mixture of 32-bit and 64-bit machines.

Null

A null

According to YAML’s specification: ~, null, Null, NULL, and an empty line are all interpreted as a null value in both values and keys. With Syck in PHP, null keys and their corresponding values are silently ignored because PHP cannot have null as a key. However, with Ruby’s YAML module, null keys will be parsed.

Conclusions

While sometimes helpful, the automatic translation of data types in the YAML specification can be perplexing if you aren’t well versed in what those special data types are. In order to save frustration, it is safest to explicitly mark all data types or at least be familiar with the common pitfalls mentioned above. For sanity’s sake, when debugging applications remember that even simple complicated things like YAML parsers can be sneaky behind the scenes.

  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Twitter
  • Google Bookmarks
  • DZone
  • HackerNews
  • LinkedIn
  • Reddit
  • http://olabini.com/blog Ola Bini

    As someone who has written several YAML parsers, let me first correct your last statement: YAML parsers are NOT simple things. … =)

    Very good list of pitfalls. Of course, these pitfalls are generally only pitfalls for handwritten YAML – since the implementations are all pretty good at roundtripping. One good way of guarding against things like that is to have roundtripping tests. Meaning, have different example data, load them in, write them out again and see if they generate equivalent output. If they don’t, you probably have a pitfall like this.

    In more advanced YAML parsers, you can generally plug in different algorithms for the handling of data types. YAML 1.1 doesn’t explicitly require the above types. In JvYAMLb you can avoid things like this by using the BaseConstructorImpl instead of SafeConstructorImpl or ConstructorImpl. The latter two defines construction of different types such as the boolean problems you mentioned.

  • http://www.genius.com Ryan Ausanka-Crues

    Ola, you’re totally right. We’ve updated the post to reflect reality: while YAML is really easy to use, the parsing of YAML is extremely complicated. I’ve never written a YAML parser (thanks to wonderful people like you) but looking at the YAML spec overwhelms me so I can only imagine how complicated it must be to write a parser that even approaches the standard set by the 1.2 spec. Thanks for your correction!

    In regards to your comment about the pitfalls only being a problem when writing YAML by hand, that’s very true and exactly what we are doing when we write DB fixtures. Your suggestion about roundtripping is a good one. Unfortunately, the wisdom of using roundtripping is usually only gained after beating your head against a cash box trying to figure out an issue.

  • http://endofline.wordpress.com Adam Sanderson

    Not sure how this will come out, but if you’re using an editor like TextMate, you can write a pretty handy preview command like this:


    #!/usr/bin/env ruby
    require 'yaml'
    require 'pp'

    def print_code(code)
    $stdout.write("

    ")
    	PP.pp(code, $stdout)
    	$stdout.write("

    ")
    end

    begin
    yaml = YAML.load(STDIN.read)
    $stdout.write "Valid YAML"
    print_code(yaml)
    rescue ArgumentError => ex
    $stdout.write "Invalid YAML"
    print_code(ex.message)
    end

    I found it helps enormously when coming across edge cases.

  • Oren Ben-Kiki

    This made it to reddit, so I responded there as follows:

    As a member of the “YAML triumvirate”, I can say that while the problems described in the article are very real, they result more from the current state of YAML implementations than from the YAML specification itself.

    In the “bad old days”, the YAML spec did not specify any default types. We did have a “type repository” to help people define their types in a consistent way, but none of the types were mandatory (well, except the core mapping, sequence and string).

    However implementers – long-suffering and greatly appreciated for their patience – did use these types by default all over the place. This turned out to be not-quite-a-good thing, and anyone using an old implementation (syck is practically ancient) is SOL.

    Luckily for all of us, JSON burst on the scene a few years after. This allowed us to modify the spec to do include a non-controversial default set of recommended types. We also fixed all the little nasty JSON incompatibility bugs. The result is YAML 1.2, which is 99.99% backward compatible with YAML 1.1 (syck implements 1.0 which is truly ancient and should not be used :-). The 1.2 spec is available in http://yaml.org/. It is being actively reviewed right now to become a final “formal” spec somewhen in the next few weeks.

    Under the 1.2 spec, any well behaved YAML parser will accept any JSON data without a hitch (and will not suffer from the annoying issues listed in the article). The spec still allows for additional/custom types to be used, if so desired. This is something often lost on people (implementers included). The set of types used in a YAML document (its “schema”) need not be the same everywhere. The spec does lay down guidelines as to recommended core types that everyone should play nice with, and we intend to use the type repository to align additional optional custom types between implementations.

    A well behaved YAML parser allows configuring the set of used types (and their formats), so if someone wants (say) automatic recognition of “localtime” style dates, he can have it – without forcing someone else to complain that all the date-looking strings in his data are not loaded as strings.

    YAML 1.2 implementations should roll out “soon” (e.g., Xitology is maintaining libyaml which is almost 1.2-compliant). We realize this is small comfort for someone using syck, but that’s the best we can do with limited resources. Anyone want to volunteer to replace syck with libyaml? :-)

    As to the need and usefulness of yet-another-data-format, YAML’s goals are different from XML and JSON. JSON is a least-common-denominator machine-oriented wire format. Like XML, it is only nominally readable. Also, like XML, you can’t serialize arbitrary types without some additional magic (e.g., XML’s SOAP).

    YAML is first and foremost a readable format (that is, a format allowing one to write readable files). It also allows serializing arbitrary data (e.g., graphs with cycles) without requiring an additional definition layer. Yes, the combination makes the YAML spec intimidating (as the author, let me tell you writing it is even more intimidating :-). It is almost as bad as Perl’s syntax – except that YAML’s syntax does have a formal definition, gnarly though it may be, and a reference implementation based on it (the YamlReference Haskell package).

    It turns out that for many use cases (configuration files in particular), YAML is a great tool, warts and all – and we believe YAML 1.2 is as wart-free as humanly possible at this point.

    In short, we feel that YAML does have its place, and being a superset of JSON is a good place to be. Its just a matter for the tools to catch up, that’s all. Also, as usual, YMMV, pick the right tool for the job, and all that.

  • Rob Desbois

    One that got me for ages when hand-editing a file is that tab characters are never allowed as indentation. Bit of a pain when your editor converts leading spaces to tabs, took me a while to find that one :-(

    –rob