A little while ago I wrote a tiny little crawler, at the time I promised myself that having dabbled in crawling I would also cover searching, indexing and other web-search related activities. Little did I know back then just how sparse my knowledge in the area actually was, although some of the comments I received regarding the limitations of my crawler should have probably clued me in :). I can’t rightly say that I am an expert now, but I did learn a little bit in the last few months, enough to know just how complex this area can be. Regardless, I feel that I have reached a milestone in that I can now write about search without making a complete idiot out of myself (I hope :)), which means it is time to fulfill that promise I made to myself.
I’d actually love to cover the ins and outs of writing a big-boy crawler, but that’s a story for another post, for the moment I’ll begin with the fundamentals:
- some basic terminology for you to throw around at parties
- the anatomy of a basic index, the fundamental underpinning of any search engine
Why Index?
Wherever we find a search engine, we also find a set of documents. You’re lucky if your set is small, but if it is small enough you can probably scan it by hand. If you need a search engine, chances are your document collection is big and if you’re looking at writing a search engine for the web, it is very, very big. It is ludicrous to expect a search engine to scan all the documents every time you do a query, it would take forever. To make your queries fast and efficient a search engine will pre-process the documents and create an index.
The Heart Of Every Search Engine
At the core of every modern search engine is an inverted index, this is a standard term the reason for which will become clear shortly. The most basic thing we want to do is be able to quickly tell if a word occurs in any of the documents in our collection. We can assign a set of ids to our documents and then associate all the words that occur in a particular document with it’s id, this is rather inefficient for obvious reasons (duplicate words and all). Instead we invert the concept. We take all the words/terms that occur in all the documents in our collection – this is called a vocabulary (also standard terminology) – we then map each term to a set of document ids it occurs in. Each document id is called a posting and a set of document ids is a postings list. So, the most basic inverted index is a dictionary of terms each of which is associated with a postings list.
It goes without saying that an inverted index is built in advance to support future queries. On the surface this is done in a manner you would expect:
- go through all the documents, assign each an id and tokenize each one for words
- process all the tokens (linguistic processing), to produce a list of normalized tokens
- for each token create a postings list, i.e. a list of document ids it occurs in
Of course those 3 simple steps hide infinite layers of complexity. What we want to end up with is a sorted list of terms each of which is associated with a list of document ids. We can also start storing some extra info even with this basic inverted index, such as the document frequency for each term (how many documents the term occurs in). This extra information will eventually become useful when we want to rank our search results.
Let’s Play With Some Code
Too much theory without any practice would be cruel and unusual considering we’re hardcore developer types, so a good exercise at this point would be to construct a small inverted index given a set of documents, to help crystallize the concepts. A small set (2-3 docs) is best so that results can be visually checked. Here is my 10 minute attempt in ruby (it actually took longer but it should have taken 10 minutes, need more skillz :)).
Before we begin the indexing we pre-process out documents to produce a dictionary of terms, while at the same time retaining the document ids for each term:
```ruby require ‘ostruct’
documents = [“the quick brown fox jumped over the lazy dog while carrying a chicken and ran away”, “a fox and a dog can never be friends a fox and a cat is a different matter brown or otherwise”, “i don’t have a dog as i am not really a dog person but i do have a cat it ran away but came back by itself”]
def pre_process_documents(documents) dictionary=[] documents.each_with_index do |document, index| document.split.each do |word| record = OpenStruct.new record.term = word record.doc_id = index dictionary << record end end sort_initial_dictionary dictionary end
def sort_initial_dictionary(dictionary) dictionary.sort do |record1, record2| record1.term <=> record2.term end end
initial_dictionary = pre_process_documents documents```
Some things to note for this phase are:
- the documents we are indexing live in an in-memory array, in real life they would live on disk or on the web or whatever, but I didn’t bother with that – for simplicity
- the ids that are assigned to the documents are simply a sequence which is assigned to a document as we encounter it for the first time, in our case we make do with an array index, this is infact analogous to how a real search engine index would do this
- we didn’t really have to sort in this initial phase, but sorting makes me happy so there
We are now ready to begin constructing our inverted index:
```ruby def index(dictionary) inverted_index_hash = {} dictionary.each do |record| postings_list = inverted_index_hash[record.term] || [] postings_list << record.doc_id inverted_index_hash[record.term] = postings_list.sort.uniq end finalize_index inverted_index_hash.sort end
def finalize_index index final_inverted_index = [] index.each do |term_array| final_record = OpenStruct.new final_record.term = term_array[0] final_record.postings = term_array[1] final_record.frequency = term_array[1].size final_inverted_index << final_record end final_inverted_index end
final_index = index initial_dictionary```
There are a few more things of note here:
- I used a hash to begin constructing my inverted index, which I then converted to an array for the final index, this was done for the sake of simplicity once again, but as you might imagine it is not really optimal, especially as your index gets larger. I should have been looking at using some sort of balanced binary tree, which would allow me to keep things sorted and allow reasonably quick access after
- The postings list was kept in memory together with the term, for a larger index these would probably live on disk and each term in memory would have a reference to the on-disk location of it’s postings list
- After the index is finished I go over it again to compute and store the document frequencies for the terms, this may not be terribly efficient, but it is a lot more efficient than computing these on the fly during query time, the larger your document collection the more the savings add up. Plus there are many other things we can compute and store at this point to potentially speed up our queries, it is a classic trade-off
All that’s left to do is print out the final index to make sure everything looks the way we expect.
```ruby def print_index(index) index.each do |term_struct| puts “#{term_struct.term} (#{term_struct.frequency}) -> #{term_struct.postings.inspect}” end end
print_index(final_index)```
This produces the following output:
a (3) -> [0, 1, 2] am (1) -> [2] and (2) -> [0, 1] as (1) -> [2] away (2) -> [0, 2] back (1) -> [2] be (1) -> [1] brown (2) -> [0, 1] but (1) -> [2] by (1) -> [2] came (1) -> [2] can (1) -> [1] carrying (1) -> [0] cat (2) -> [1, 2] chicken (1) -> [0] different (1) -> [1] do (1) -> [2] dog (3) -> [0, 1, 2] don't (1) -> [2] fox (2) -> [0, 1] friends (1) -> [1] have (1) -> [2] i (1) -> [2] is (1) -> [1] it (1) -> [2] itself (1) -> [2] jumped (1) -> [0] lazy (1) -> [0] matter (1) -> [1] never (1) -> [1] not (1) -> [2] or (1) -> [1] otherwise (1) -> [1] over (1) -> [0] person (1) -> [2] quick (1) -> [0] ran (2) -> [0, 2] really (1) -> [2] the (1) -> [0] while (1) -> [0]
We can visually verify that our index looks correct based on the input (our initial collection of 3 documents). There are many things we can note about this output, but for the moment only one is significant:
- As we can see, most of the words in our index only occur in one document and only a couple occur in all three, this is due to the size of our collection i.e. this is bound to happen when indexing a small number of documents. If we add more and more documents the postings lists for all the terms will begin to grow (pretty self-explanatory really).
You’re welcome to go through a similar exercise of constructing an index for some simple documents, it is actually a reasonably decent code kata, if you do I’d love to hear about it. If you want something a little bit more involved consider using some real text documents (rather than contrived ones) and fetching them from disks. Avoid using hashes and arrays for the final index, be a real man/woman and use some sort of tree. Make sure, with your final index, the postings lists live on disk while the dictionary lives in memory and everything still works. There are quite a number of ways to make the exercise a little bit more complex and interesting (unless you’re afraid of a little extra complexity, but then you wouldn’t be the developer I think you are, we eat extra complexity for breakfast, mmmmm complexity). Of course for a web search engine even the term dictionary is potentially too big to be kept in memory, but we won’t get into distributed stuff at this time, that’s a whole different kettle of fish.
This is a very basic explanation of very basic indexing, but surprisingly enough even Google’s mighty index is just a much more advanced descendant of just such an inverted index. I’ll try and get into how we can give our basic index a little bit more muscle in a later post. In the meantime if you want to know more about search, information retrieval and other topics along similar vein, here are some resources (these are the books that I used, so they are all tops :)). For a gentler more mainstream introduction have a look at Programming Collective Intelligence and Collective Intelligence In Action, but if you want to get straight into the nitty gritty and are not afraid of a little bit of maths try Introduction To Information Retrieval, this one is a textbook though with lots of exercises and references and stuff – you’ve been warned. As always, I welcome any comments/questions you might have, even if you just want to say hi :).
Image by koalazymonkey