Serialization is one of those things you can easily do without until all of a sudden you really need it one day. That’s pretty much how it went with me. I was happily using and learning Ruby for months before I ever ran into a situation where serializing a few objects really would have made my life easier. Even then I avoided looking into it, you can very easily convert the important data from an object into a string and write that out to a file. Then when you need to, you just read the file, parse the string and recreate the object, what could be simpler? Of course, it could be much simpler indeed, especially when you’re dealing with a deep hierarchy of objects. I think being weaned on languages like Java, you come to expect operations like serialization to be non-trivial. Don’t get me wrong, it is not really difficult in Java, but neither is it simple and if you want your serialized object to be human-readable, then you’re into 3rd party library land and things can get easier or harder depending on your needs. Suffice to say, bad experiences in the past don’t fill you with a lot of enthusiasm for the future.
When I started looking into serialization in Ruby, I fully expected to have to look into 3rd party solutions – surely the serialization mechanisms built into the language couldn’t possibly, easily fit my needs. As usual, I was pleasantly surprised. Now, like that proverbial hammer, serialization seems to be useful all the time :). Anyway, I’ll let you judge for yourself, let’s take a look at the best and most common options you have, when it comes to serialization with Ruby.
Human-Readable Objects
Ruby has two object serialization mechanisms built right into the language. One is used to serialize into a human readable format, the other into a binary format. I will look into the binary one shortly, but for now let’s focus on human readable. Any object you create in Ruby can be serialized into YAML format, with pretty much no effort needed on your part. Let’s make some objects:
|
|
In A: hello world, 5 In A: hello world, 5
As you can see, according to the output the objects before and after serialization are the same. You don’t even need to require anything :). The thing to watch out for when outputting multiple Marshalled objects to the same file, is the record separator. Since you’re writing binary data, it is not inconceivable that you may end up with a newline somewhere in a record accidentally, which will stuff everything up when you try to read the objects back in. So two rules of thumb to remember are:
- don’t use puts when outputting Marshalled objects to a file (use print instead), this way you avoid the extraneous newline from the puts
- use a record separator other than newline, you can make anything unlikely up (if you scroll down a bit you will see that I used ‘——’ as a separator_)
The disadvantage of Marshal is the fact the its output it not human-readable. The advantage is its speed.
Which One To Choose?
It’s simple, if you need to be able to read your serializable data then you have to go with one of the human-readable formats (YAML or JSON). I’d go with YAML purely because you don’t need to do any work to get your custom objects to serialize properly, and the fact that it serializes each object as a multiline string is not such a big deal (as I showed above). The only times I would go with JSON (aside the whole wide support and sending it over the wire deal), is if you need to be able to easily edit your data by hand, or when you need human-readable data and you have a lot of data to deal with (_see benchmarks below_).
If you don’t really need to be able to read your data, then always go with Marshal, especially if you have a lot of data.
Here is a situation I commonly have to deal with. I have a CSV file, or some other kind of data file, I want to read it, parse it and create an object per row or at least a hash per row, to make the data easier to deal with. What I like to do is read this CSV file, create my object and serialize them to a file at the same time using Marshal. This way I can operate on the whole data set or parts of the data set, by simply reading the serialized objects in, and it is orders of magnitude faster than reading the CSV file again. Let’s do some benchmarks. I will create 500000 objects (a relatively small set of data) and serialize them all to a file using all three methods.
|
|
YAML: Time: 45.9780583381653 sec JSON: Time: 5.44697618484497 sec Marshal: Time: 2.77714705467224 sec
What about deserializing all the objects:
|
|
YAML: Array size: 500000 Time: 19.4334170818329 sec JSON: Array size: 500000 Time: 18.5326402187347 sec Marshal: Array size: 500000 Time: 14.6655268669128 sec
As you can see, it is significantly faster to serialize objects when you’re using Marshal, although JSON is only about 2 times slower. YAML gets left in the dust. When deserializing, the differences are not as apparent, although Marshal is still the clear winner. The more data you have to deal with the more telling these results will be. So, for pure speed – choose Marshal. For speed and human readability – choose JSON (at the expense of having to add methods to custom objects). For human readability with relatively small sets of data – go with YAML.
That’s pretty much all you need to know, but it is not all I have to say on serialization. One of the more interesting (and cool) features of Ruby is how useful blocks can be in many situations, so you will inevitably, eventually run into a situation where you may want to serialize a block and this is where you will find trouble! We will deal with block serialization issues and what (if anything) you can do about it in a subsequent post. More Ruby soon :).
Images by Andrew Mason and just.Luc