Merging Ruby Hashes And Bang Method Usage

MergeThe other day something curious happened that made me question how I use bang methods (and their non-bang cousins). Here is how I normally look at using bang methods in Ruby. Always use the non-bang version, unless you have a really, really good reason not to. It just seems somehow safer this way, like when you’re passing an object (String, Array etc.) around as a parameter, don’t modify the parameter itself unless that is exactly the behaviour you’re looking for. Use methods that don’t modify the original object, but rather work on a copy, and then return the copy to the calling code. Maybe I am being too asinine about this, do come and let me know if I am :). Regardless, I have trained myself to avoid bang methods unless I really need them, but a few days ago I needed to merge some hashes.

You see, I had a file full of smaller hashes (serialized) that I wanted to read in and merge into one big hash. As usual, I tried using the non-bang version of the merge method and it just crawled. Let’s recreate this code:

ruby main_hash = {} time = Benchmark.realtime do (1..10000).each do |number| simple_hash = {} simple_hash[number.to_s] = number main_hash = main_hash.merge(simple_hash) end end puts "Time elapsed #{time} seconds" puts "Main hash key count: #{main_hash.keys.count}"

This produces:

Time elapsed 13.7789120674133 seconds
Main hash key count: 10000

As you can see, it took 13 seconds to merge only 10000 tiny hashes into one. I had significantly more than ten thousand and mine were somewhat larger. The surprising thing was that when you replace the merge method with its bang equivalent, the output produced is:

Time elapsed 28.7179946899414 milliseconds
Main hash key count: 10000

That’s 28 MILLIseconds which is about 500 times faster. Now admittedly, this use case was probably a good candidate for using the bang version of merge from the start, but that is not immediately obvious, and such a major disparity in performance made me question my non-bang convention. Was this the case with all the bang/non-bang pairs of methods in Ruby? Should I perhaps forget about the non-bang methods all together and switch to bang exclusively? Well, the first question is easy to answer, since we can test it out empirically. Let’s pick another bang/non-bang pair and see if we get the same performance disparity. We’ll use Array flatten:

ruby time = Benchmark.realtime do (1..1000).each do |number| string = "string#{number}" array << string inner_array = [] (1..50).each do |inner_number| inner_array << inner_number end array << inner_array array = array.flatten end end puts "Time elapsed #{time} seconds"

When using the non-bang version, the output is:

Time elapsed 5.95429491996765 seconds

But when we switch to the bang version (array.flatten!):

Time elapsed 6.41582012176514 seconds

Hmm, there is almost no difference, infact the non-bang version is a little faster. I tried similar code with String’s reverse bang/non-bang pair and the results were similar to what we got for the array, the performance differences were negligible. What gives?

The good news is that using non-bang methods is fine if, like me, this is your preference. Unless, of course, you’re trying to merge hashes in which case we need to dig deeper still. Let’s hook up a profiler to our original code and see what we get (I’ll cover Ruby profiling in more detail in later post, for now, just bear with me). Here is our new code:

ruby main_hash = {} time = Benchmark.realtime do Profiler__::start_profile (1..10000).each do |number| simple_hash = {} simple_hash[number.to_s] = number main_hash = main_hash.merge(simple_hash) end Profiler__::stop_profile Profiler__::print_profile($stderr) end puts "Time elapsed #{time} seconds" puts "Main hash key count: #{main_hash.keys.count}"

The profiler output looks like this:

%   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 92.95    12.78     12.78    10000     1.28     1.28  Hash#initialize_copy
  4.51    13.40      0.62        1   620.00 13750.00  Range#each
  1.89    13.66      0.26    10000     0.03     1.30  Hash#merge
  0.51    13.73      0.07    10000     0.01     0.01  Fixnum#to_s
  0.15    13.75      0.02    10000     0.00     0.00  Hash#[]=
  0.00    13.75      0.00        1     0.00 13750.00  #toplevel
Time elapsed 14.436450958252 seconds
Main hash key count: 10000

The culprit here is clear – _initializecopy, that’s where we are spending the vast majority of our time. So, what is _initializecopy? It is basically a hook method that Ruby provides which is invoked after an object has been cloned (using dup or clone). When you clone or dup an instance in Ruby, it is usually a shallow copy and so the fields in the cloned instance will still be referencing the same objects as the fields in the original instance. You can override _initializecopy to fix this issue. More info on cloning and initialize_copy can be found here. So we can now intuitively say that when the merge method creates a copy of the hash to work on, it will need to iterate though all the members of the original hash and copy them into the new hash, to make sure they are not still referencing the same object. Only after it has done this will it be able to go through the second hash (_i.e. otherhash) and merge the values from there, into this new copy of our original hash. Of course, as the hash gets bigger from merging more and more keys into it, it takes longer and longer to copy the values across and so we spend more time inside _initializecopy which becomes our bottleneck. On the other hand, if we use the bang version of merge, we never need to clone our hash and this bottleneck does not exist, so everything is fast. Makes sense, right?

It does, the only problem I have is this. Why does the same thing not happen when we try to flatten the array? When we try to profile our array flattening code we get output similar to the following:

%   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 48.73     1.34      1.34     1000     1.34     1.34  Array#flatten
 31.27     2.20      0.86     1001     0.86     4.04  Range#each
 10.18     2.48      0.28    21000     0.01     0.01  Fixnum#to_s
  9.82     2.75      0.27    22000     0.01     0.01  Array#<<
  0.00     2.75      0.00        1     0.00  2750.00  #toplevel
Time elapsed 3.29541611671448 seconds

As you can see _initializecopy is not even in the picture. So does that mean we can clone an array (when using the non-bang version of flatten) without having to then go over all the elements to make sure they are not referencing the same object? I had a quick look at the C code that backs both the Array and the Hash implementation in Ruby, but I don’t know Ruby native code well enough to get a definitive answer without spending an inordinate amount of time on it (*sigh* another thing to go on the list of things to learn :)). If you do know your way around Ruby’s C code, could you please enlighten the rest of us regarding this interesting issue. In the meantime the lesson is this, non-bang methods are fine, but if you’re going to merge a bunch of hashes into one, use the bang version of merge. The other lesson is the fact that C is still as relevant as ever, if you want to know interesting things :). Although, I probably need to go through my copy of K&R before digging into Ruby’s C code – it has been a while. Anyways, that’s a story for another time.

Image by rachel_r

The Most Annoying Habit Of A Software Manager

MineI really hate it when managers refer to people (developers) as resources! I am not sure if this is an issue in other fields, but I do know software and it is rampant. Everyone is always concerned with resources.

“We’re going to need more resources”

“Are you sure we have the resources?”

It really is hard to get good resources these days. The longer I spend building software, the more I find myself annoyed when I hear this talk of resources. Hardware is a resource, so is possibly computing power, certainly crude-oil; people are not resources!

Referring to people as resources, creates an impression that developers are plug-and-play components. Worse than that it makes it seem as if there is a readily available and inexhaustible supply of these “resources”. Of course, these days we all know that even real resources such as oil, gas are not inexhaustible or as readily available as they have been in the past. But the attitude fostered by using the word remains the same.

The problem with this one is that, it’s quite insidious. That’s how all the big boys talk, you want to play with the big boys, you gotta pick up the lingo. Any fresh-faced young manager or developer can instantly make themselves sound more “with-it” by throwing the R word around. And when everyone around you is doing it, you can’t help but fall into it as well. It happens to me all the time, so I have to mentally kick myself every time I catch myself doing it.

It’s about respect you see. Like calling the waiter serving you in a restaurant – “garcon”. Noone likes being referred to as “boy” and they like it even less if you equate them to an inanimate carbon rod. If you’re going to treat your developers as amorphous balls of goo, don’t be surprised when they don’t buy into your “corporate vision” and couldn’t give a rat’s ass about the products you’re building.

If you’re a manager or even a developer with a penchant for calling people resources, please stop! If you have non-verbal references (in spreadsheets, schedules etc), go and change them all to the names that should have been there in the first place. If you hear others using it, pull them up on it. I am a believer in the fact that a lot of small changes, over time, can add up to making a big difference (more on that later) and this one small change will make a big difference all on its own – I guarantee it.

What I like to do these days, every time I hear the word “resources”, is ask the question:

“You mean people, right?”

Cause you never know, they could mean gold bullion, in which case I would agree – those things are hard to come by and you can never have too much.

Image by Uncle Kick-Kick

The Raven 2.0

RavenThe other day, during the course of my web browsing, I stumbled upon “The Raven” by Edgar Allan Poe. I love that poem, if you’ve never read it, go ahead and do so, it’s a classic. For some reason I was feeling a bit creative at the time (that _happens to me_ sometimes), so I decided to write my own version. Now, I know “The Simpsons” did a version of it, but that was for mass consumption, mine was going to be strictly for programmer consumption :). Anyway, I few hours later (many more than I would have expected, it a freaking long poem), I ended up with the following. Hope you like it (it really helps to read the original before you read this).

Once upon a midnight dreary, while I pondered weak and weary,
Over many a steaming pile of spaghetti code galore,
While I nodded bored and napping, suddenly there came a tapping
As if some asshole sharply rapping, rapping at my office door
"'Tis my manager," I muttered, "tapping at my office door -
I'll ignore him, nothing more."

Ah distinctly I remember frantic unit test refactors
As my dual screens and laptop cast their light upon the floor
Eagerly I wished for coffee, vainly I had sought for hours
From the web some help or answer for Issue number 424
For that stupid, f*cking issue that Jira named 424,
I had solved it once before.

And that blinking red uncertain, of the build-light by the curtain
Galled me - filled me with annoyance I had never felt before;
With my flow interrupted, at the screen I harshly swore
"Tis my manager the asshole, whose very presence I abhor -
What's he doing at this hour, rapping on my office door; -
I will kill him, that's for sure"

My annoyance growing stronger; hesitating then no longer,
"WHAT!" I thought "just enter, will you, but stop knocking or I'll kill you
Fact is, I was trying to fix this stupid Issue 424
Then my flow was interrupted, when some asshole started rapping, loudly on my office door
How about you knock some more - cause I don't think they heard you on the 22nd floor"
Here I opened wide the door; - Darkness there, and nothing more.

Deep into that darkness peering, long I stood there scowling, sneering
Dreaming up four-letter words, no mortal dared to dream before
But, the silence was unbroken, so with a final vulgar token
I turned around a whispered loudly, "Back to Issue 424"
And the silent gloomy office echoed - "... Issue 424"
Merely this and nothing more.

Back to my computer turning, with my indignation burning,
Soon again I heard the rapping somewhat louder than before
"WTF, it's my computer, not some asshole at my door
Let me look then at this problem and this mystery explore -
I'll take some deep breaths for a moment and this mystery explore; -
F*ck!!! The build just failed once more!

I cracked my box without a stutter, but then my heart did do a flutter
As with a final fateful sputter a fan fell out on to the floor
And without even trying, it rolled across the whole office
And stopped itself against the wall in a corner by the door
Right below the pic of Dijkstra in the corner by the door
Stopped, and fell, and nothing more.

Then this broken fan beguiling my annoyance into smiling
I thought - "laying in the corner it enhances the decor"
"That was perhaps not unexpected, since I’ve oft before suspected"
That the hardware we're using is from the saintly days of yore,
Tell me how I'll now deal, with Jira Issue 424
Since the build without doubt was as broken as before

Much I marvelled at the errors on my laptop screen so plain,
Though they to my addled brain - little relevancy bore;
For we cannot help agreeing that no living human being
Should be ever cursed with seeing errors at 12:44,
Using nothing but his laptop to fix Issue 424
And the build as yet remained just as broken as before

So there I was just sitting lonely, staring at my laptop only,
Flinging curses, as if my soul in those words I did outpour
"Stupid management", I uttered, "Buys crappy hardware", I muttered
Now any chance I had of sleep was in the corner by the door,
On the morrow I planned, to give those bastards the 'what for'
And the build then failed once more

Startled at the stillness broken by the build-noise loudly spoken
"Doubtless," said I, "this issue - seems a little too hardcore
This code is a complete disaster, I should revert it all to master"
The dudes to blame for this regression have a lot to answer for
"No! I can't waste the work of hours, I know I've solved it once before"
So, I'll revert it nevermore

But the errors from compiling, were still my tired brain beguiling
So I went and got a Coke and slammed it down before the door
Then upon my Aeron sinking I betook myself to thinking
Maybe there was a header that I forgot to link before
Or maybe I forgot an option when I built the code before
But I think that's unrelated to Jira Issue 424

Thus I sat engaged in guessing, it was really quite depressing
And it seemed I'd never find, the key to Issue 424,
This and more I sat divining, was it variable assigning?
No, I have been down that track, 9 hours before
"Just start working!!!" I implore, then I try to build once more
But the build-light remains, as red and blinking as before

Then, methought, the air grew hotter, did the aircon stop working?
Cause that's exactly what I needed, it's as if I'm in a war
"Sh*t," I cried, "what's next, an earthquake - that would make things much simpler
It's as if some deity hates me and is invoking Murphy's Law!
I give up; I'll quaff a Coke and watch my code explode once more",
Sure enough - it dumped the core.

"Laptop, laptop on the table, am I really, truly able
Will I ever comprehend how I solved it all before?
Full of Coke but still undaunted, in a deserted office haunted -
By smelly code full of horrors - tell me truly, I implore -
Will I ever find the key to Issue number 424? - tell me - tell me, I implore"
But the build just fails once more.

"Laptop, laptop on the table, am I really truly able,
By the space that bends above us - by the science we both adore -
Tell this soul with sorrow laden, if within the next few hours,
It shall find by intuition the key to Issue 424 -
Or maybe I just need more tests to cover Issue 424?"
But the build just fails once more.

"Screw this crap!" I shrieked upstarting, "That's my cue to be departing -
This can wait until tomorrow, for new hardware galore!"
I should have left it for tomorrow and saved myself a bunch of sorrow,
When the fan from my computer rolled across the office floor!
"I am leaving!" I announce, but then to spite me to the core,
The build decides to fail once more.

And the broken fan unwitting, still is sitting, still is sitting
Right below the pic of Dijkstra in the corner by the door;
And the bug still needs resolving, but my brain isn't working,
And the red and blinking build-light still casts shadows on the floor;
I'm still sure I'll fix the issue that Jira named 424
But only after I've calmed down, and only once I've slept some more!

That’s it, I’d love to hear what you guys think, just general thoughts or if you have adaptions to make it better, any feedback is welcome. Don’t forget to subscribe to my feed if you haven’t already :).

Image by Atli Harðarson