Writing A More Ruby-ish Array Intersection Function And Sorting Structs

09/03/2010 1774 words 9 min read

In my previous post on boolean search, we wrote an efficient list (array) intersection function. But, to be honest, I wasn’t very happy with it. Considering the fact that we were using Ruby, it just wasn’t very Ruby-like. So, this time I am going to try and show how that function can be re-written to not only be more Ruby-ish but to also be tighter and easier to understand. In addition I will look at a couple of interesting aspects of OpenStruct, which will lead nicely into my next post on skip pointers (I’ve spoken about covering skip pointers before).

The Ruby-ish Intersection Function

The last time we implemented an array intersection function it looked like this:

```ruby def intersect_lists(list1, list2) final_list = [] current_list1_index = 0 current_list2_index = 0

while(current_list1_index < list1.length && current_list2_index < list2.length) if list1[current_list1_index] == list2[current_list2_index] final_list << list1[current_list1_index] current_list1_index += 1 current_list2_index += 1 elsif list1[current_list1_index] < list2[current_list2_index] current_list1_index += 1 else current_list2_index += 1 end end final_list end```

There is actually nothing wrong with the code itself. It is an efficient implementation. The reason it is efficient is due to the fact that we only need to walk through each of the lists once, in order to produce the intersected list. So if the sizes of our two lists are x and y, the intersection takes x+y operations, i.e. the complexity of the algorithm is O(x+y). If we were writing that function in C, everything would be fine at this point, but we’re using Ruby. What we really want our function to look like is something along the lines of:

ruby def intersect_lists(list1, list2) list1.each do |item1| list2.each do |item2| #stuff end end end

This would actually work, but we go back to the naive array intersection implementation where we scan through the whole of one list for each of the elements in the other. What we really want is to be able to exit the inner each iterator to allow the outer one to advance, but then restart where we left off the next time we enter the inner one. This way we can compare all the values in both arrays to produce the intersection while still scanning both lists only once. Fortunately with Ruby, we can simply open up the array class and create just such an iterator:

```ruby class Array def each_continued if !@current_index reset_current_index end while @current_index < self.size next_index = yield(self[@current_index],@current_index) if next_index @current_index = next_index break end @current_index += 1 end end

def reset_current_index @current_index = 0 end end```

As you can see our new iterator uses a class level variable (_@currentindex) to keep track of where the iteration is at. We stop the iteration based on the value that is returned from the block that needs to be passed to this iterator. If the value returned from the block is nil, we continue to iterate. When the value returned is a number, we set it to be the index from which we will restart the iteration next time the method is called. This allows us to not only stop and restart the iteration where we left off, but also roll the iteration forward or back if necessary. This is actually important as we will see shortly. There is only one issue with this implementation, the fact that we need to use a class level variable. The problem is the fact that when one full iteration through the array completes, you need to remember to rewind (i.e. must call reset_current_index) – not great. The only way to overcome this would be keep track of the current position of the iteration externally and pass this value in as a parameter to the iterator, but we won’t worry about that for now, for our purposes, resetting the array is not such a big deal.

We can now re-implement our intersection function, it will look like this:

ruby def rubyish_intersect_lists(list1, list2) list3 = [] list1.each do |value1| list2.each_continued do |value2, index2| if value1 == value2 list3 << value2 index2 + 1 elsif value1 < value2 index2 end end end list3 end

You’ll probably agree that’s much tighter and more Ruby-ish looking. There are two important things to note, in order to understand how it works.

When the values from both lists match, we copy the result into our intersection array, we then need to advance both lists. In order to do this, we get the block of the inner iterator to return the index of the next value in the list. This will cause the inner iterator to terminate, but will also prime it to restart from the index we return out of the block (see what I mean about the importance of giving our iterator the ability to restart the iteration from any index we want). This way we advance our inner iterator and naturally advance the outer one, since the inner one has exited.
When the value of the array in the inner loop is bigger than the outer one, we want to keep advancing the outer iterator, but keep the inner one where it is. We do this very easily, by making the block of the inner iterator return the current index of the list. This has the effect of exiting the inner iterator and priming it to restart in exactly the same place on the next iteration, and of course it naturally advances the outer iterator.

The only other scenario is when the value of the outer loop iterator is bigger then the inner one, in which case we want to advance the inner one. This will happen by default, and so needs no special code to handle it.

Sorting OpenStructs

As I said, this post actually a lead into another post, which will be about skip pointers. This in itself, is no big deal, we still want to intersect lists, the only issue is, they have to be lists of objects rather than lists of primitive integers. After all you actually need somewhere to put the skip pointers (if you don’t really get it, don’t worry, I’ll explain in the next post). When I was playing around with skip pointers initially I decided to use OpenStruct to wrap the integers (which are the values of the arrays we were going to intersect). In hindsight I should have created an object, but I was a bit too far along and didn’t want to go around changing a bunch of code. Here is the problem.

In order to test my intersection function, I generate arrays of integers, using the following function.

ruby def create_raw_mock_postings_list(min_postings, max_postings, corpus_size) postings_list = [] num_postings = min_postings + rand(max_postings) num_postings.times do postings_list << rand(corpus_size) + 1 end postings_list.sort.uniq end

This allows me to dummy up what a posting list for a term would look like (a particular number of document ids each of which is a random number between 1 and the size of the corpus). The important bits are in the end i.e. _postingslist.sort.uniq. It is very important that the posting list is sorted and all values are distinct. This is fine when your value are integers, it all just works. It would also be fine if the value were objects, since all you would need to do would be to implement the spaceship operator (<=>, so that you can compare them to each other) for everything to once again work (which is why I should have used an object). I, of course, was using OpenStruct and I don’t know of any way to implement the comparison operator.

When it comes to sort, this is reasonably easy to overcome, since that function takes a block that it can use for comparison, but when it comes to uniq, we’re out of luck. I needed a function similar to the following:

ruby def create_wrapped_mock_postings_list(min_postings, max_postings, corpus_size) postings_list = [] num_postings = min_postings + rand(max_postings) num_postings.times do wrapped_id = OpenStruct.new wrapped_id.value = rand(corpus_size) + 1 postings_list << wrapped_id end postings_list.sort{|x,y| x.value <=> y.value}.block_uniq{|item| item.value} end

Ruby goodness came to the rescue, since we can, once again, open up Array and create our own uniq function. Like so:

ruby class Array def block_uniq uniqness_hash = {} uniqed_array = [] self.each do |value| if block_given? uniqness_value = yield(value) else uniqness_value = value end if !uniqness_hash.key?(uniqness_value) uniqed_array << value uniqness_hash[uniqness_value] = nil end end uniqed_array end end

We use a hash, to make sure the items are unique (i.e. if it is already in the hash, discard it). If no block is given to this function it will just operate like a normal uniq function, but when the block is there, it will use the return value of the block rather than the array items themselves to do it’s job. Implementing this function allows our create_wrapped_mock_postings_list function to work as intended. One last thing to note is the fact that our new _blockuniq function preserves the order of the elements of the array we are uniquing, always good to keep things neat.

Intersecting Structs

Now that we’re dealing with structs rather than integers, our previous intersection function is no longer correct (for exactly the same reason, why sort and uniq were such a pain, I really should have used an object instead of OpenStruct, it’s a good learning experience, but let that be a lesson to me for later).

We could just slightly alter our intersection function to work with our structs:

ruby def rubyish_intersect_for_wrapped(list1, list2) list3 = [] list1.each do |item1| list2.each_continued do |item2, index2| if item1.value == item2.value list3 << item2 index2 + 1 elsif item1.value < item2.value index2 end end end list3 end

This is actually what I did, but is not necessarily a good solution, since we can augment the original function to work either with integers or with structs or anything else, in a similar way to how we implemented our _blockuniq function. For example:

ruby def rubyish_intersect_for_wrapped(list1, list2) list3 = [] list1.each do |item1| list2.each_continued do |item2, index2| if yield(item1) == yield(item2) list3 << item2 index2 + 1 elsif yield(item1) < yield(item2) index2 end end end list3 end

We can now call this function like so, when we want to work with structs:

```ruby

when we have just plain integers, we would do:

rubyish_intersect_for_wrapped(list1, list2) {|item| item}```
Of course, it would be nicer if we had used the _blockgiven? method so that we don’t have to supply any block at all in the second case, but I’ll leave that one as an exercise :).
We’re now ready to tackle skip pointers and see if can actually get better performance (than O(x+y)) from our intersection function. This is coming up very shortly. 
Image by Thomas Hawk

Why Your Developers Don’t Want To Swap Pairs – Practical Agility

08/03/2010 1272 words 6 min read

Much of the agile process literature is very general. There is a lot of focus on practices and philosophy and empowering teams (nothing wrong with that by the way), but there is also often a lack of practical advice on how to solve the day-to-day problems you run into when working in an agile environment. It is difficult, as every team encounters different problems and even similar problems often have a different root cause depending on the environment. Regardless, I can see a need for a cookbook type of text, but in the meantime I’d like to weigh on an issue that I have seen quite often with agile teams, that of pair programming and the apparent reluctance of many developers to actually swap pairs regularly.

If this kind of thing is of interest to you then remember to subscribe to my RSS feed as I often write about this sort of stuff.

A Novice Agile Team

I don’t really want to spend too much time here as the issues in the case of novice agile teams are reasonably clear. It is likely a people or a team dynamics problem. I would look at one of two things:

Did you introduce all the agile practices at once (the big bang approach)? If so, you’ve likely found your problem. It is often best to introduce practices slowly to allow everyone to acclimatize. Otherwise people start to feel adrift and try to anchor themselves somehow. The easiest way to do that (anchor yourself) is to make sure you complete the stories you start and so people avoid swapping pairs. It may be worth ignoring the lack of pair swapping and giving the team some more time to get used to the new process (focus on pair swapping a little later).
If people seem comfortable with the practices then your problem is probably lack of buy-in. The team doesn’t understand the benefits of pair swapping and sees no need for it. Once again, there is a certain level of discomfort involved with pair swapping. You’re leaving a story in a half-finished state and developers don’t like to do that. You’re also potentially leaving someone you get along with really well (you current pair) for a person who is less familiar to you. Without buy-in there is nothing to counter-balance that discomfort and pair swapping suffers. Your path here is clear, you need to sell people on the benefits that pair swapping brings. Communicate with the team and explain, until you get to the point where people start to prompt each other to swap pairs.

It is that simple when it comes to a novice agile team.

The Experienced Team

When it comes to an experienced agile team, the problems are not as easy to diagnose. The people are fully aware of the benefits and everyone is comfortable with the process and yet pair swapping is still an issue. It is likely not a people or a team dynamics problem, the culprit is elsewhere in your process. Consider the fact that your stories might be too big – it takes too long for a pair to finish a story.

In order to comfortably swap pairs a developer needs to reach a certain level of understanding regarding a story:

have a clear idea of the business requirements (where the current story is concerned)
be comfortable with how everything will be implemented from a technical perspective
have a good idea of how everything will look in a finished state

To get to this level of understanding you need to be involved with a story until it is approximately half-finished (based on personal experience). Otherwise you feel like you don’t really have a handle on the situation, haven’t yet made enough of a contribution and so will be reluctant to move on to something else.

One of the benefits of pair swapping is to get more people across a particular story, so it is actually beneficial to swap half way through a story. Unfortunately, not all stories are the same size even when the estimated point value is the same, so all stories reach the half-way point at different times. This is why most teams try to swap pairs at regular time intervals (i.e. half way through the day, start of every day etc.). But, if your stories take two or more days to implement, there will be a lot of reluctance to swap half way through the day. You can certainly make the swap interval longer, but then the new person needs to make up so much ground as far as context is concerned that it is almost counter-productive, they become a passenger. Not to mention the fact that when a story is long and complex you want to see it through to the end to get that satisfied, “job-well-done” feeling.

You would be better off recalibrating the point values for your estimates so that a 1-pointer becomes a lot smaller (less complex and so takes less time to implement) and consequently larger point-value stories are also smaller. I would go as far as to say that the majority of the stories should be 1-pointers with a few 2-pointers, anything bigger than that should be broken up.

If all of this is unconvincing, here are some reasons why smaller stories are better stories.

They allow the team to check in more often (not as many conflicts)
Smaller stories are easier to wrap your head around
Go along way to preventing situations where you’re half way through multiple stories at the end of an iteration (because when you underestimate a small 1-pointer, you’re only a few hours off, when you underestimate a huge 3-pointer, you could be off by more than a day)

And of course, people will be a lot happier to swap pairs when the stories are smaller as it takes a lot less time to reach that level of understanding (that I talked about above), where a pair swap is no longer a pain.

The “Rules” For Breaking Up Functionality

I have found that people tend to write stories according to a set of pet rules, e.g.:

each story must be a vertical
each story must have tangible business value (must be tied directly to requirements)
you must be able to express a story in the form – “As an X I want to be able to do Y”
etc.

The thing to remember is that all those are guidelines. If sticking to those rules makes a story too big then break them. Don’t worry too much about business value and it is ok not to make a story a complete vertical. When you break a story up, remember that stories don’t have to be completely independent. Agile teams often try to maintain the illusion that each story is independent and the system is built up from a bunch of discrete verticals. This is never true, there are always dependencies, so if you need to, you can reorganise your task board to make these explicit, this way it becomes easier to break up stories, since you no longer have to try and make each story make sense in isolation.

The lesson here is this, if you see an issue in an experienced agile team (like the reluctance to pair swap), don’t automatically assume it is a people problem or lack of understanding. Look to other parts of your process, there may be a connection you wouldn’t expect. Do you think there are other reasons why people are often reluctant to pair swap? If so, please leave a comment, I’d love to hear them.

Image by mikebaird

The Most Common Things You Do To A Large Data File With Bash

07/03/2010 465 words 3 min read

I find that whenever I get a large data file from somewhere (i.e. extract some data from a database, crawl some sites and dump the data in a file) I always need to do just that little bit of extra processing before I can actually use it. This processing is always just non-trivial enough and I do it just uncommonly enough for me to always forget exactly how to go about it. Of course, this is to be expected, if you learn something and want it to stick you have to keep doing it. It’s all part and parcel of how our brain works when it comes to learning new skills, but that doesn’t make it any less annoying.

Back to our data file, for me I find that I almost always need to do 3 things (amongst others) before doing anything else with my file.

delete the first line (especially when pulling data out of the database)
delete the last line
remove all blank lines

Don’t ask me why but for whatever reason, you always get an extraneous first line and unexpected blank lines (and less often an extraneous last line) no matter how you produce the file :).

Anyways, my tool of choice in the matter is bash – it is just too trivial to use anything else (plus I love the simplicity and power of the shell). So, to make sure I never forget again here is the easiest way of doing all the three things above using sed:

sed 1d input_file | sed '$d' | sed '/^$/d' > output_file

Update: As Evan pointed out in the comments, it would be more efficient to do the following:

sed -e 1d -e ‘$d’ -e ‘/^$/d’ input_file > output_file

_This way the file doesn’t have to go through multiple pipes.

_

Of course since we’re using bash, there should be numerous ways of doing the above.

You can remove the first line using awk:

awk 'FNR>1'

but I don’t know how to remove the last line using awk. Anyone?

You can use head or tail to get rid of the first and last line:

head --lines=-1 input_file | tail --lines=+2

but not to remove blank lines.

You can use grep to remove blank lines

grep -v "^$" input_file

but it would be silly to try and use it to remove the first and last line (possible though).

If you know of an easier way to do the above three things in a one-liner using bash – do share it.

What are some of the most common (but non-trivial enough) things that you find yourself doing with bash when it comes to pre-processing that large data file?

For more tips and opinions on software development, process and people subscribe to skorks.com today.

Image by rachel_thecat