Lets Roll Our Own Boolean Query Search Engine

boolean Of course, we’re not going to write a full-on search engine in this one post, that would take at least 2 :). But, surprisingly enough, given our knowledge of inverted indexes (which I talked about previously), we can actually cobble together a very basic boolean information retrieval system relatively easily, all we need is a little bit more knowledge and a couple of algorithms, so guess what this post is going to be about.

Boolean Queries

So, what are boolean queries? I guess the best way to explain it is to contrast boolean queries with the type of search we know best – web search. Web search is an example of a ranked retrieval system. In a ranked system users typically use free text queries to define their search parameters and the results returned by the system are ranked in order of relevance (hence the name). A boolean system on the other hand has the following properties:

  • users employ a special syntax (i.e. operators such as AND, OR, NOT etc) to define their queries
  • the results are not ranked by relevance (being a boolean system)

Let us use an example to illustrate the point. In web search the users searching for ‘christmas tree’, is basically asking the following:

“Give me all the documents which contain the phrase ‘christmas tree’ in order from most relevant to least relevant.”

On the other hand, if we’re dealing with a boolean system the user would have to frame his query in a language the system can understand e.g.:

christmas AND tree – “Give me all documents that contain the word christmas and the word tree”

christmas OR tree – “Give me all documents that contain the word christmas or the word tree”

etc.

You get the picture. Given a collection of documents which we have used to create an inverted index, we need minimal effort (relatively) to expand our index into a basic boolean retrieval system.

Basic Boolean Queries And The Scope Of Our System

All we really need is a query parser which will understand the syntax of our queries and some sort of query executor which can access our index and spit out the results that we’re after. So, lets get implementing and comment as we go.

First thing first, we need an inverted index. We have the one from my previous post, but it was a tiny toy index and won’t let us appreciate the scope of the problem, we could of course find a bunch of documents to index, but that will add a whole level of complexity that we don’t really need. The best thing to do therefore is to dummy up an inverted index. We don’t need real documents, all we need is a dictionary of terms with a postings list (of dummy document ids) for each term – living on disk. This will allow us to generate posting lists of a size which will let us appreciate the scope of the search problem. Let’s dummy up our inverted index:

```ruby POSTINGS_LIST_LOCATION = “/home/alan/tmp” MIN_POSTIN_LIST_SIZE = 8000 MAX_POSTING_LIST_SIZE = 10000

def create_mock_postings_list(min_postings, max_postings) postings_list = [] num_postings = min_postings + rand(max_postings-min_postings) num_postings.times do postings_list << rand(max_postings) end postings_list.uniq.sort end

def write_mock_postings_list postings, file_name File.open(file_name, “w”) do |file| file.write(postings) end end

def file_name_for_word(path, word) “#{path}/#{word}.postings” end

def create_test_inverted_index(dictionary, path_to_posting_files) inverted_index = {} dictionary.each do |word| postings_list = create_mock_postings_list(MIN_POSTIN_LIST_SIZE, MAX_POSTING_LIST_SIZE) value = OpenStruct.new value.postings_file = file_name_for_word(path_to_posting_files, word) value.document_frequency = postings_list.length write_mock_postings_list(postings_list.join(“,”), value.postings_file) inverted_index[word] = value end inverted_index end

dictionary = %w{hello world ruby quick fox lazy dog random stuff blah yadda} index = create_test_inverted_index(dictionary, POSTINGS_LIST_LOCATION)```

As you can see our dictionary contains only a few terms:

  • we create a posting list of random size for each term
  • we’ve set our posting list maximum size to be 10000 and minimum size to be 8000 to make sure there is sufficient crossover between the posting lists for each term, it also makes the posting lists large enough to make them interesting since we’re essentially saying that our collection is 10000 documents (in real life document collections are usually much larger, but 10000 is enough for our purposes)
  • our inverted index is a little “_musclier_” since it no longer keeps posting lists in memory, they are instead living on disk with the index only keeping the on-disk location

We’re now ready to start writing our query executor (we’ll look at query parsing a little later). We’ll only concern ourselves with very simple queries that use one or two terms and one or more of three operators (AND, OR, NOT) e.g.:

  • hello AND world
  • lazy OR dog
  • random AND NOT stuff
  • blah OR NOT yadda

Yes, it does mean we can only handle four types of queries (unless you count single term queries), but this will be enough to illustrate the complexity of the problem as well as providing a reasonably usable system in the end.

Naive Vs Efficient Implementations

Lets look at our all the possible queries our system will need to handle in order of complexity.

Single Term

A single term query is trivial, all we need to do is retrieve the postings list for the term from our inverted index. At this point all we would need to do is fetch the documents for all the ids and return them to the user (for our purposes, once the list of ids is returned we consider the query satisfied). Here is the implementation (without calling code):

```ruby def load_postings_list_for_word(word) postings = [] postings_list_file = file_name_for_word(POSTINGS_LIST_LOCATION, word) File.open(postings_list_file).each do |file| file.each do |line| line.split(‘,’).each {|id| postings << id.to_i} end end postings end

def single_term_query(word) load_postings_list_for_word(word) end```

We will be able to reuse the load_postings_list_for_word, helper for other query types as well.

2 Term AND

These are queries such as (hello AND world), where we basically want to find all documents which contain both terms, in boolean algebra terms – a conjunction.

and In the case of our inverted index, it simply means we want to intersect the posting lists for the two terms and return the result. Of course, things are a little more complicated than that. We need to do this efficiently. Unlike enterprise software development, where implementing things naively is often a good thing (keeps things simple) – in search you will quickly pay for being naive - with an unusable system. A naive way to implement the postings list intersection might be something like this:

ruby def naive_and_words(word1, word2) final_list = [] postings_list1 = load_postings_list_for_word(word1) postings_list2 = load_postings_list_for_word(word2) postings_list1.each do |id| if postings_list2.include? id final_list &lt;&lt; id end end final_list end

This would mean we’re potentially scanning the whole of one posting list for each value in the other, considering the fact that we’re dealing with very large lists, this will grind our system to halt. We can be smarter about our implementation where we scan both posting lists only once:

```ruby def intersect_lists(list1, list2) final_list = [] current_list1_index = 0 current_list2_index = 0

while(current_list1_index < list1.length && current_list2_index < list2.length) if list1[current_list1_index] == list2[current_list2_index] final_list << list1[current_list1_index] current_list1_index += 1 current_list2_index += 1 elsif list1[current_list1_index] < list2[current_list2_index] current_list1_index += 1 else current_list2_index += 1 end end final_list end

def and_words(word1, word2) postings_list1 = load_postings_list_for_word(word1) postings_list2 = load_postings_list_for_word(word2)

intersect_lists(postings_list1, postings_list2) end```

This is somewhat more complex, but will allow our system to scale to very large posting lists. Of course we can simply use Ruby’s built-in operators to perform the intersection:

ruby def ruby_and_words(word1, word2) postings_list1 = load_postings_list_for_word(word1) postings_list2 = load_postings_list_for_word(word2) postings_list1 & postings_list2 end

This is infact much simpler and works just as well, but we don’t learn any interesting algorithms from doing it this way :). Here is a sample run where we timed the intersection of the posting lists for two words using all three methods.

Elapsed time (naive): 1.3064 sec
Elapsed time (our AND): 0.027263 sec
Elapsed time (ruby AND): 0.012857 sec

As you can see our “good” implementation is much faster than the naive one, but still more than twice as slow as Ruby’s one, but then again we have to remember that Ruby’s implementation is written in C :).

Challenge/Question!

Can you write an implementation of the intersection algorithm, in pure Ruby, that would approach the speed of Ruby’s core one (which is written in C)?

2 Term OR

These are queries such as (hello OR world), where we want all the documents that contain either or both of the words – a disjunction:

or I won’t bother with the naive implementation (the story is the same just a different operator). Here we need to walk through the posting lists for the two terms and add everything we find to our result list – without duplicates:

```ruby def add_lists(list1, list2) final_list = [] current_list1_index = 0 current_list2_index = 0

while(current_list1_index < list1.length || current_list2_index < list2.length) if current_list1_index >= list1.length final_list << list2[current_list2_index] current_list2_index += 1 elsif current_list2_index >= list2.length final_list <<; list1[current_list1_index] current_list1_index += 1 elsif list1[current_list1_index] == list2[current_list2_index] final_list << list1[current_list1_index] current_list1_index += 1 current_list2_index += 1 elsif list1[current_list1_index] < list2[current_list2_index] final_list << list1[current_list1_index] current_list1_index += 1 else final_list << list2[current_list2_index] current_list2_index += 1 end end final_list end

def or_words(word1, word2) postings_list1 = load_postings_list_for_word(word1) postings_list2 = load_postings_list_for_word(word2) add_lists(postings_list1, postings_list2) end```

Ruby will of course let us do this even more easily (if we’re dealing with sets or arrays that is):

ruby def ruby_or_words(word1, word2) postings_list1 = load_postings_list_for_word(word1) postings_list2 = load_postings_list_for_word(word2) result = postings_list1 + postings_list2 end

The result of comparing the time will be similar is this case as well, our implementation does well, but Ruby’s C implementation is more than 2 times faster.

2 Term AND NOT

Now we’re introducing the NOT operator (hello AND NOT world). The situation is similar (again :)) to the regular AND, but the algorithm is slightly different in that we want to find all the documents that contain the first term and don’t contain the second:

```ruby def list_difference(list1, list2) final_list = [] current_list1_index = 0 current_list2_index = 0

while(current_list1_index < list1.length) if current_list2_index >= list2.length || list1[current_list1_index] < list2[current_list2_index] final_list << list1[current_list1_index] current_list1_index += 1 elsif list1[current_list1_index] == list2[current_list2_index] current_list1_index += 1 current_list2_index += 1 else current_list2_index += 1 end end final_list end

def and_not_words(word1, word2) postings_list1 = load_postings_list_for_word(word1) postings_list2 = load_postings_list_for_word(word2)

list_difference(postings_list1, postings_list2) end```

Ruby has a handy operator for this also:

ruby def ruby_and_not_words(word1, word2) postings_list1 = load_postings_list_for_word(word1) postings_list2 = load_postings_list_for_word(word2) postings_list1 - postings_list2 end

As usual Ruby’s C implementation is faster, but our implementation scales well also.

2 Term OR NOT (also single term NOT)

These last two are a little trickier. We want to find all documents that contain the first term or the documents that don’t contain the second, which basically means we want to find all documents in our collection that don’t contain the second term, which seems to be equivalent to a single term NOT (i.e. hello OR NOT world = NOT world). Do feel free to correct me if I am wrong here! Why is this trickier? Because our inverted index only contains associations between terms and the documents they belong two, there are no associations for the documents a term doesn’t belong to. Unless we augment our index with further information when we construct it, we really only have two ways out of this situation that I can see:

  1. if we assume that we know how many documents our collection contains in total and our documents have consecutive ids then we can create a list of all document ids and take away the ids we don’t want (i.e. perform an AND NOT operation)
  2. we go through the whole dictionary, retrieve each posting list and construct a list of all document ids in the collection that way, then we once again perform the AND NOT operation against the posting list of the term we don’t want – this doesn’t seem very efficient.

Is augmenting the index at construction time the only real way to solve this problem or can anyone see a better way of doing this?

Query Parsing And Optimization Of Complex Queries

Since we so handily restricted the possible queries we want to process, writing a parser would be relatively trivial. However there are caveats even here. When dealing with large document collections (almost always for search problems), we always try to minimize the time we take to generate our results as much a we can. So is there anything more we can do for our queries? Well, it turns out that at least in the case of AND queries we can. If we sort the query terms by frequency (remember we store document frequency as part of index construction) we can begin processing our terms by taking the smallest frequency term first. If we look at our implementation of the AND query, we can see that we stop once the smaller list is exhausted. Therefore starting with the smallest frequency term first allows us to do the least amount of work and speed up our processing even more.

We’re now in the realm of query optimization. For a two term query this may not be such a big win, but if we decide to allow more complex queries (i.e. AND with arbitrary number of terms), the savings will begin to add up. If we were to allow arbitrary boolean queries we can quickly end up with a parser and optimizer that is quite complex, consider:

((a AND b OR C) AND NOT d) OR e AND f OR NOT g AND (h AND NOT i) – scary and ugly

The good news is that the query above can be optimized. For example we can re-write it in conjunctive normal form or disjunctive normal form (err, probably - I haven’t actually tried).

Challenge!

Have a go at writing a parser/optimizer that will take an arbitrary query that uses AND, OR, NOT and break it down into one of the normal forms. For some extra points re-arrange it for the most efficient query execution time. This is a non-trivial problem, I hope you remember your boolean algebra :). De Morgan’s law and distributive law would be a good place to start I think.

Problems With Boolean Systems

As fun as boolean systems are, we can immediately spot some issues.

  • you need to learn the syntax, for some system this can get extremely complex, especially when more than just three operators are involved, infact people have made careers from being expert at using a particular boolean system
  • when using AND type queries you will usually get very relevant results but few of them (high precision, low recall), while when using OR type queries will usually net you low precision and high recall, there doesn’t seem to be a decent way to combine these to get a happy medium
  • searching for multi-word phrases that denote a single concept becomes problematic e.g. toilet paper, unless the system includes some type of proximity operator and even then it is not perfect

Ok, so I lied, we didn’t really put together a complete boolean retrieval system in this post. But you know what - we got close, and we learned a bit about search and boolean queries, and we got to practice our coding skills, and we gained an appreciation for how much complexity is hidden behind even seemingly simple search concepts. I do believe this is more than enough for one sitting. I’d love to hear any comments/criticisms you may have and if you can expand on anything I’ve said then please do, I don’t fancy myself an expert by any stretch of the imagination, so any knowledge sharing is welcome. If you enjoyed this post and would like to read more of my musings then do consider subscribing.

Image by rikhei

Thoughts On TDD (A Case Study With Ruby And RSpec)

test_driven Oh yeah, we do TDD, after all we’re an agile team! That’s what we tell our peers and it is even true, it is just not true 100% of the time. But everyone kind of agrees not to dig too deep – after all they are in exactly the same boat – and we all get to feel good about our process and how we do things in our neck of the woods. Let’s face it, we all fall back into non-TDD practices every day, that doesn’t mean we don’t write tests it just means we don’t always write the tests first. For some reason people often feel like they need to cover this up, as if they loose some credibility by not being a TDD maniac and that’s patent nonsense. In the kind of work we do as developers, it is perfectly natural to not be doing TDD all the time, the breadth of technology we work with on a daily basis almost demands this. Let’s examine a typical TDD scenario (or at least typical for me) and perhaps things will get a little clearer.

A Typical TDD Scenario

We have a class and we feel the need for a new method, we write an empty method definition and we’re now almost ready to TDD.

ruby class FileOperations def read_file_and_print_line end end

Of course we don’t just start hacking away, we need to build a picture in our head of what we want our new method to do. In this case we have the following:

  • find the path of the file we want to open
  • open the new file and read it line by line
  • when we find the relevant line we want to print it out
  • the line is considered relevant if it matches a certain condition

We’re now ready to get testing:

```ruby describe “FileOperations” do before(:each) do @file_ops = FileOperations.new end describe “read_file_and_print_line” do it “should match a line in a file” do

1
2
3
end
it "should not match a line in a file" do
end

end end```

The fact that we’re trying to match against a condition is an immediate flag that we can have a positive and negative outcome so we require 2 tests. If you haven’t got the second test you will easily be able to implement the method incorrectly without your test picking this up. Alright, lets fill in our tests and then pick them apart.

```ruby describe “FileOperations” do before(:each) do @file_ops = FileOperations.new end

describe “read_file_and_print_line” do it “should match a line in a file” do @file_ops.should_receive(:find_path_of_file_to_open).and_return(“path to file”)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
  mock_file = mock(File)
  File.should_receive(:open).with("path to file").and_yield(mock_file) 

  mock_file.should_receive(:each).with(no_args()).and_yield("string1").and_yield("string2").and_yield("string3") 

  @file_ops.should_receive(:line_matches_condition).with("string1").and_return(false)
  @file_ops.should_receive(:line_matches_condition).with("string2").and_return(true)
  @file_ops.should_receive(:line_matches_condition).with("string3").and_return(false)
  @file_ops.should_receive(:puts).once().with("string2") 

  @file_ops.read_file_and_print_line
end 

it "should not match a line in a file" do
  @file_ops.should_receive(:find_path_of_file_to_open).and_return("path to file") 

  mock_file = mock(File)
  File.should_receive(:open).with("path to file").and_yield(mock_file) 

  mock_file.should_receive(:each).with(no_args()).and_yield("string1").and_yield("string2").and_yield("string3") 

  @file_ops.should_receive(:line_matches_condition).with("string1").and_return(false)
  @file_ops.should_receive(:line_matches_condition).with("string2").and_return(false)
  @file_ops.should_receive(:line_matches_condition).with("string3").and_return(false)
  @file_ops.should_not_receive(:puts) 

  @file_ops.read_file_and_print_line
end

end end```

There are many things going on so we’ll start at the beginning.

  • We want to cover our method with a minimum number of tests, this allows us to keep our methods small and tight and makes it easier for everyone. This means that whenever there is an opportunity to push some functionality into a collaborator we need to take it. In the case of our tests we did this twice _find_path_of_file_toopen and line_matches_condition. We don’t care how these collaborator methods work, we can figure it out later, for now we simply mock how we expect these methods to behave.
  • Because we know a little bit about Ruby file system access we know that when we open a file we can yield to a block so we need to represent this in our test (to enforce this behaviour on our method implementation). We also know that we can call each on our file which will allow us to yield each line of the file to a block. Enforcing things like this can potentially make the test brittle if we decide to change our mind about the internal implementation of our method, but it also protects us from implementing the method incorrectly and having our tests still pass (more on this later).
  • We’ve set our tests up is such a way that we know how many strings should match and so we know that one line should be printed out in the first test and no lines in the second, we need to be explicit about this, once again, this ensures that the actual implementation can only go one way.

Essentially our two tests ensure that the easiest way to implemented the method is also the correct way, which is the goal of TDD. If you find that your tests allow you a much easier path to implement your method incorrectly (and still have the tests pass) then you need to tweak your tests. Oh and here is our finished method, only one way to write it now:

```ruby class FileOperations def read_file_and_print_line currently_script_file = find_path_of_file_to_open File.open(currently_script_file) do |file| file.each do |line| puts line if line_matches_condition line end end end

private def find_path_of_file_to_open end

def line_matches_condition line end end```

The two collaborators are still empty, we can now TDD them and fill in their implementation.

What We Learned

Having gone through the above exercise a couple of things become abundantly clear:

  1. We form a more or less complete picture in our head of where we want our method to go before we start writing the tests or the implementation
  2. We need to have a reasonably good understanding of the tech we’re working with in order for us to write our tests first (i.e. how Ruby IO works, blocks, etc.)

What if you’re new to Ruby and have no idea how file IO works, what picture will you form in your head (a blank page) and how will you evolve the functionality through tests (with great difficulty). You’re much more likely to go off and do a little bit of research and try some stuff out in order to understand the tech you’re dealing with and of course as you’re trying stuff out, the solution to your current problem will naturally take shape and your method is more or less done. Once you’re fairly confident of what you’re doing you can delete and start over or you can just write the tests after the fact. In my opinion either way is correct. Of course I picked something simple to illustrate my example (Ruby file IO), but we run into this kind of situation all the time in our day to day work as developers. I need to write some code using a framework, API, language I am not really familiar with or am not an expert in, how do you fit TDD into that scenario – not easy?

And before we bring up the whole Spike argument, lets be clear. The research you do into tech you’re working on (but are not very familiar with) is not really a Spike, it is certainly Spike-like but it is not explicit. A spike has to be explicit (i.e. there is card for it). On the whole I consider this “research”, part of natural knowledge acquisition, so there is no need to throw your code out after you finished “doodling” (although you certainly can if you’re so inclined and are not pressed for time).

The amount of APIs, libraries, DSLs, formats, standards that a typical project deals with can be truly formidable, noone can be an expert on all of it. And by the time you start to get a handle on things you move on to a different project with a different tech stack and you’re “adrift” again. And when you come back to more familiar ground again it is not so familiar any more and so the cycle continues. Is it any wonder at this point that we tend to fall back into non-TDD from time to time?

Being Careless

The argument is that TDD lets you evolve your tests so that both your code and your tests are better as a result. I say being careful and thinking about what you’re doing is what allows you to have better code and better tests. Lets say we wrote the method above first and then decided to write the tests afterwards, we could potentially end up with something like this:

```ruby describe “bad tests read_file_and_print_line” do it “should forget to yield” do @file_ops.should_receive(:find_path_of_file_to_open).and_return(“path to file”)

1
2
3
4
mock_file = mock(File)
File.should_receive(:open).with("path to file").and_return(mock_file) 

@file_ops.read_file_and_print_line

end

it “should forget to output” do @file_ops.should_receive(:find_path_of_file_to_open).and_return(“path to file”)

1
2
3
4
5
6
mock_file = mock(File)
File.should_receive(:open).with("path to file").and_yield(mock_file) 

mock_file.should_receive(:each).with(no_args()).and_yield("string1").and_yield("string2").and_yield("string3") 

@file_ops.read_file_and_print_line

end end```

Both of those tests pass, but are clearly nowhere near as good as our original pair. But being careful and conscientious developers we know this already, we wouldn’t leave them in this state. If we didn’t know our tech well enough and thought that these two tests were fine, no amount of TDD would have helped. It’s not about the TDD it’s about the knowledge, practice, attitude and experience.

The worst thing in my opinion is when you try to force the use of TDD where you would be better off without it. You don’t know the tech and yet you try to force the tests (which end up crap anyway), spend exorbitant amounts of time on them and get nowhere. In this situation TDD can seriously slow you down and when deadline is of the essence can you really afford that?

Look, TDD can be a great tool in your arsenal as a developer (especially if you can manage to TDD your acceptance or integration tests i.e. black box) but there is no need to be a purist about it and there is no need to feel guilty when you choose not to employ this particular tool.

For more tips and opinions on software development, process and people subscribe to skorks.com today.

Image by onkel_wart

The Perfect Size For An Agile Team – 1 Person – It’s Crazy!

Whenever several people get together to form a team, issues always arise, that’s a fact. Developers are just not a very homogenous bunch (or is that humans), everyone has an opinion, everyone thinks their way of achieving the goal (whatever it happens to be) is best, it is a recipe for confrontation. Of course we all learn to work together eventually, teams gel or at the very least learn to function, it takes a while but we live with it, after all it is just the forming, norming, storming before we get to the chunky goodness that is performing. However, just because there is no obvious dysfunction doesn’t mean the team is a well oiled machine. We find a common framework, but our opinions and values are still the same and potentially different from those of our team members; and so the tension simmers.

As agilists we know the importance of getting along with everyone, of putting people first and being pragmatic – does that mean we’re immune? Have you ever worked on an agile team, it’s all hat tipping and tea parties, isn’t it?

“After you George.”
“No, no I couldn’t possibly, after you Bill.”

Sound familiar? I didn’t think so. If anything we argue and debate even more than non-agile teams, we’re pragmatic, we know it’s not personal so why not get everything out in the open for a better outcome. Our ‘problem’, is that we take too much interest in the project we work on, we invest ourselves in our work and so feel even more strongly about making it the best it can be, according to our values, knowledge and experience (sort of like the Army but we try to tone down the killing). At least this is better than not taking any interest in the work you’re doing, but no matter how good the final outcome all this back-and-forth does waste time and energy. It is a necessary waste, the final product is improved as a result, but what if we could make this waste unnecessary.

Intrigued?

Then walk with me.

Lets suppose we could create a team with a full-on agile mindset minus the inevitable tension, conflict and argument that cuts into our precious productivity. It’s possible, we just need to make sure there is noone to argue with, if our team only has one person, we’re golden!

“Hahaha Skorks, you’re such a kidder! Everyone knows you can’t have a team …”

Why not though (I really gotta work on the whole interrupting thing, when you start doing it to yourself, while writing, you know it’s an issue), we’ve all read those studies that say the best programmers are 10 times more productive than the average ones and we all know a guy who worked with a guy who was an absolute gun. I mean, I personally have never worked with anyone who was 10 times more productive than me. Hang on that must mean … oh my god … can I truly not be aware of the extent of my awesomeness.

“Hahaha Skorks, you really are a kidder!”

Seriously though, I sure as hell am not 10 times more productive than the people I work with, which means those super-developers are either ultra rare or mythical. When I brought this point up with @mat_kelcey and @markmansour they assured me the super-devs exist since they’ve worked with some previously, so I will take their word for it and keep rolling. So let us say we obtain this ultra rare (take it back waiter, I prefer my stake well-done) dev, he is ten times more productive than your average bear and has a suitably agile mindset. Lets make him our team, we have our 5-7 obligatory developers with 3-5 spare. Just think of the awesomeness, the need for stand-up and retro is gone the guy will just adjust as he goes along. No need to justify his decisions to anyone, just start coding at line 10 and keep going until evening. All refactorings will be consistent, test coverage – perfect, no mish-mash ideas as far as code style, how and when to stub/mock etc. Pairing will be an issue for obvious reasons, but we’re agile, we adapt. The amount and quality of documentation, up-front design, estimation will be completely dependant on the conscientiousness of this person, but he’s got the can-do mindset and attitude, all these artefacts will be of just the right size with just enough information – sweet! Just think, this one person is doing the job of 5-7 people with less overhead and higher quality overall, if you’re any kind of manager you should be seeing big dollar signs in front of your eyes at this point.

Before we initiate the breeding program for our race of super-developers who will take the human race on a journey into a software nirvana, lets consider the risks. There are risks, the obvious and the not-so-obvious. First the obvious. No matter how good our guy is, he is still a single point of failure and a knowledge silo and a lack of succession planning – well not personally, but you know what I mean. What if this guy gets hit by a bus, or gets sick or quits or decides to go to Russia for 6 months cause he’s not getting any younger and there is more to life than code or so I am told? It will be a minor disaster, not a good situation which can easily be avoided if only he had some team members to start with … I feel like I am back at the beginning of this post.

“Damn you – space-time continuum!”

What about the not-so-obvious? You see, I am a bit of a student of human nature…

“I’ll have the 1937 Pinot-Noir my good man, it’s a fine vintage with a particularly pungent bouquet, har, har, har”

No seriously, I am not trying to be pretentious. The thing I’ve observed about developers/people is that we need others to push us to be better than we are by ourselves, this is true in absolutely everything. This means that no matter how agile a mindset we have individually, we will be taking shortcuts before we know it, unless there are others there to keep us honest. How many times have you taken shortcuts on personal projects, no tests, no source control, no need to refactor that class – only I will ever see it. We tell ourselves it is because we’re only working on personal stuff, it doesn’t matter, plus we don’t really have the time anyway, but those are all excuses. In reality we just can’t be bothered since there is noone there to guilt us into it, it is simply human nature. Our one-man-team super-developer will begin to fail as an agilists on day 2 if you’re lucky, if not – towards the end of day 1.

I know, a one man team is not really an option, still wouldn’t it be an interesting experiment to try? Can we prove conclusively that one person really can successfully do the work of a whole team?

Alright, what’s my point, or am I just rambling mindlessly? The point is, our whole industry is constantly stuck in a funk, once in a while something comes along to shake things up a little, but the majority of the time we’re in a deep rut. Why? Because everyone is constantly managing risk and trying to be responsible, we’re limiting ourselves with popular belief and best practice:

“Thou shalt have 5-7 developers in thy agile team,
Chaos will be sown in their passage,
So sayeth the wise Aloundo”

There is nothing wrong with managing risk and being responsible it makes things more predictable, but it also keeps you in that rut I talked about. The way to achieve the impossible is to be completely ignorant of the fact that it is impossible, but another way is to actively try to achieve the impossible while knowing full well what you’re doing. Like, for example, creating an agile software team with only one brilliant developer and seeing what they can produce :). You will likely fail again and again and again, but when you do succeed … oh baby – it will be all kinds of awesome.

Well I must say, this post has been a bit of a journey (it even freaks ME out sometimes – the way my mind works), which is not a bad thing – probably, but I do hope you enjoyed it. If not here is a tip for the future. If you ever find yourself sitting on top a rampaging buffalo, don’t try to fight it or steer or jump off – you’ll only hurt yourself. My advice is to get comfortable and enjoy the beautiful scenery, the place you end up may not be better than where you started but you’re bound to learn something along the way, such as how to ride a rampaging buffalo. Bring on your thoughts, ideas, rebuttals, comments. Peace.

Image by dedde`