In my previous post on boolean search, we wrote an efficient list (array) intersection function. But, to be honest, I wasn’t very happy with it. Considering the fact that we were using Ruby, it just wasn’t very Ruby-like. So, this time I am going to try and show how that function can be re-written to not only be more Ruby-ish but to also be tighter and easier to understand. In addition I will look at a couple of interesting aspects of OpenStruct, which will lead nicely into my next post on skip pointers (I’ve spoken about covering skip pointers before).
The Ruby-ish Intersection Function
The last time we implemented an array intersection function it looked like this:
```ruby def intersect_lists(list1, list2) final_list = [] current_list1_index = 0 current_list2_index = 0
while(current_list1_index < list1.length && current_list2_index < list2.length) if list1[current_list1_index] == list2[current_list2_index] final_list << list1[current_list1_index] current_list1_index += 1 current_list2_index += 1 elsif list1[current_list1_index] < list2[current_list2_index] current_list1_index += 1 else current_list2_index += 1 end end final_list end```
There is actually nothing wrong with the code itself. It is an efficient implementation. The reason it is efficient is due to the fact that we only need to walk through each of the lists once, in order to produce the intersected list. So if the sizes of our two lists are x and y, the intersection takes x+y operations, i.e. the complexity of the algorithm is O(x+y). If we were writing that function in C, everything would be fine at this point, but we’re using Ruby. What we really want our function to look like is something along the lines of:
ruby
def intersect_lists(list1, list2)
list1.each do |item1|
list2.each do |item2|
#stuff
end
end
end
This would actually work, but we go back to the naive array intersection implementation where we scan through the whole of one list for each of the elements in the other. What we really want is to be able to exit the inner each iterator to allow the outer one to advance, but then restart where we left off the next time we enter the inner one. This way we can compare all the values in both arrays to produce the intersection while still scanning both lists only once. Fortunately with Ruby, we can simply open up the array class and create just such an iterator:
```ruby class Array def each_continued if !@current_index reset_current_index end while @current_index < self.size next_index = yield(self[@current_index],@current_index) if next_index @current_index = next_index break end @current_index += 1 end end
def reset_current_index @current_index = 0 end end```
As you can see our new iterator uses a class level variable (_@currentindex) to keep track of where the iteration is at. We stop the iteration based on the value that is returned from the block that needs to be passed to this iterator. If the value returned from the block is nil, we continue to iterate. When the value returned is a number, we set it to be the index from which we will restart the iteration next time the method is called. This allows us to not only stop and restart the iteration where we left off, but also roll the iteration forward or back if necessary. This is actually important as we will see shortly. There is only one issue with this implementation, the fact that we need to use a class level variable. The problem is the fact that when one full iteration through the array completes, you need to remember to rewind (i.e. must call reset_current_index) – not great. The only way to overcome this would be keep track of the current position of the iteration externally and pass this value in as a parameter to the iterator, but we won’t worry about that for now, for our purposes, resetting the array is not such a big deal.
We can now re-implement our intersection function, it will look like this:
ruby
def rubyish_intersect_lists(list1, list2)
list3 = []
list1.each do |value1|
list2.each_continued do |value2, index2|
if value1 == value2
list3 << value2
index2 + 1
elsif value1 < value2
index2
end
end
end
list3
end
You’ll probably agree that’s much tighter and more Ruby-ish looking. There are two important things to note, in order to understand how it works.
- When the values from both lists match, we copy the result into our intersection array, we then need to advance both lists. In order to do this, we get the block of the inner iterator to return the index of the next value in the list. This will cause the inner iterator to terminate, but will also prime it to restart from the index we return out of the block (see what I mean about the importance of giving our iterator the ability to restart the iteration from any index we want). This way we advance our inner iterator and naturally advance the outer one, since the inner one has exited.
- When the value of the array in the inner loop is bigger than the outer one, we want to keep advancing the outer iterator, but keep the inner one where it is. We do this very easily, by making the block of the inner iterator return the current index of the list. This has the effect of exiting the inner iterator and priming it to restart in exactly the same place on the next iteration, and of course it naturally advances the outer iterator.
The only other scenario is when the value of the outer loop iterator is bigger then the inner one, in which case we want to advance the inner one. This will happen by default, and so needs no special code to handle it.
Sorting OpenStructs
As I said, this post actually a lead into another post, which will be about skip pointers. This in itself, is no big deal, we still want to intersect lists, the only issue is, they have to be lists of objects rather than lists of primitive integers. After all you actually need somewhere to put the skip pointers (if you don’t really get it, don’t worry, I’ll explain in the next post). When I was playing around with skip pointers initially I decided to use OpenStruct to wrap the integers (which are the values of the arrays we were going to intersect). In hindsight I should have created an object, but I was a bit too far along and didn’t want to go around changing a bunch of code. Here is the problem.
In order to test my intersection function, I generate arrays of integers, using the following function.
ruby
def create_raw_mock_postings_list(min_postings, max_postings, corpus_size)
postings_list = []
num_postings = min_postings + rand(max_postings)
num_postings.times do
postings_list << rand(corpus_size) + 1
end
postings_list.sort.uniq
end
This allows me to dummy up what a posting list for a term would look like (a particular number of document ids each of which is a random number between 1 and the size of the corpus). The important bits are in the end i.e. _postingslist.sort.uniq. It is very important that the posting list is sorted and all values are distinct. This is fine when your value are integers, it all just works. It would also be fine if the value were objects, since all you would need to do would be to implement the spaceship operator (<=>, so that you can compare them to each other) for everything to once again work (which is why I should have used an object). I, of course, was using OpenStruct and I don’t know of any way to implement the comparison operator.
When it comes to sort, this is reasonably easy to overcome, since that function takes a block that it can use for comparison, but when it comes to uniq, we’re out of luck. I needed a function similar to the following:
ruby
def create_wrapped_mock_postings_list(min_postings, max_postings, corpus_size)
postings_list = []
num_postings = min_postings + rand(max_postings)
num_postings.times do
wrapped_id = OpenStruct.new
wrapped_id.value = rand(corpus_size) + 1
postings_list << wrapped_id
end
postings_list.sort{|x,y| x.value <=> y.value}.block_uniq{|item| item.value}
end
Ruby goodness came to the rescue, since we can, once again, open up Array and create our own uniq function. Like so:
ruby
class Array
def block_uniq
uniqness_hash = {}
uniqed_array = []
self.each do |value|
if block_given?
uniqness_value = yield(value)
else
uniqness_value = value
end
if !uniqness_hash.key?(uniqness_value)
uniqed_array << value
uniqness_hash[uniqness_value] = nil
end
end
uniqed_array
end
end
We use a hash, to make sure the items are unique (i.e. if it is already in the hash, discard it). If no block is given to this function it will just operate like a normal uniq function, but when the block is there, it will use the return value of the block rather than the array items themselves to do it’s job. Implementing this function allows our create_wrapped_mock_postings_list function to work as intended. One last thing to note is the fact that our new _blockuniq function preserves the order of the elements of the array we are uniquing, always good to keep things neat.
Intersecting Structs
Now that we’re dealing with structs rather than integers, our previous intersection function is no longer correct (for exactly the same reason, why sort and uniq were such a pain, I really should have used an object instead of OpenStruct, it’s a good learning experience, but let that be a lesson to me for later).
We could just slightly alter our intersection function to work with our structs:
ruby
def rubyish_intersect_for_wrapped(list1, list2)
list3 = []
list1.each do |item1|
list2.each_continued do |item2, index2|
if item1.value == item2.value
list3 << item2
index2 + 1
elsif item1.value < item2.value
index2
end
end
end
list3
end
This is actually what I did, but is not necessarily a good solution, since we can augment the original function to work either with integers or with structs or anything else, in a similar way to how we implemented our _blockuniq function. For example:
ruby
def rubyish_intersect_for_wrapped(list1, list2)
list3 = []
list1.each do |item1|
list2.each_continued do |item2, index2|
if yield(item1) == yield(item2)
list3 << item2
index2 + 1
elsif yield(item1) < yield(item2)
index2
end
end
end
list3
end
We can now call this function like so, when we want to work with structs:
```ruby
when we have just plain integers, we would do:
rubyish_intersect_for_wrapped(list1, list2) {|item| item}```
Of course, it would be nicer if we had used the _blockgiven? method so that we don’t have to supply any block at all in the second case, but I’ll leave that one as an exercise :).
We’re now ready to tackle skip pointers and see if can actually get better performance (than O(x+y)) from our intersection function. This is coming up very shortly.
Image by Thomas Hawk