Using Multiple Rubies Seamlessly On The One Machine With Rvm

If you’re into Ruby and are not yet using RVM (ruby version manager) you’re doing yourself a disservice. It’s a great little tool that allows you to easily have multiple Ruby installs on the one machine and will manage all the associated complexity for you (@markmansour originally put me onto it). You can switch between different Ruby versions instantly and if you need to make sure that your code works with multiple Ruby versions (e.g. 1.8 and 1.9, or 1.8 and JRuby), then you will really, really love it. Well, I hope you’re excited, so lets get you set up with your very own RVM install, you do need Linux (I am using Ubuntu), so if you need to work with multiple Rubies on windows, may god have mercy on your soul.

Installing RVM And Multiple Rubies

Ok, first thing first, RVM is a ruby gem so you will need to have some sort of Ruby install on your system already. It is a bit of a pain, but a small price to pay for the blessing you’re about to receive. Setting up rvm is pretty simple.

Firstly, install the gem:

gem install rvm

Once that’s done, we need to add some hooks, RVM comes with a convenient script, but unless your gem bin directory is in your path (which it isn’t in my case) you will need to go to the rvm installation directory to run the script:

```ruby

Replace the x.x.Y with the rvm version. The last thing to do is to add an extra line into your .bashrc:

echo 'if [[ -s "$HOME/.rvm/scripts/rvm" ]]  ; then source "$HOME/.rvm/scripts/rvm" ; fi' >> .bashrc```

At this point your rvm install is good to go and you can forget about using your original system Ruby from now on, instead lets install some rvm managed Rubies. I am going to install Ruby Enterprise (i.e. ree – all the other Rubies follow the same pattern):

rvm install ree

You will need to wait for rvm to do it’s thing:

Installing Ruby Enterprise Edition from source to: /home/alan/.rvm/rubies/ree-1.8.7-2010.01

Downloading ruby-enterprise-1.8.7-2010.01, this may take a while depending on your connection...

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 7295k  100 7295k    0     0   176k      0  0:00:41  0:00:41 --:--:--  154k

Extracting ruby-enterprise-1.8.7-2010.01 ...

Installing ree-1.8.7-2010.01, this may take a while, depending on your cpu(s)...

Installing rubygems dedicated to ree-1.8.7-2010.01...

Installing rubygems for /home/alan/.rvm/rubies/ree-1.8.7-2010.01/bin/ruby

Installation of rubygems ree-1.8.7-2010.01 completed successfully.

Installing rake

Installing gems for ree-1.8.7-2010.01.

Installing rake

Installation of gems for ree-1.8.7-2010.01 is complete.

To install other Rubies you can do the following:

rvm install 1.8.7
rvm install 1.9.1

```ruby

The above will install the latest version of the Rubies that you specified. After you have finished, check that all your Ruby installations are there:

rvm list

   ree-1.8.7-2010.01 [ x86_64 ]
   ruby-1.8.7-p248 [ x86_64 ]
   ruby-1.9.1-p378 [ x86_64 ]
   system [ ]```

As you can see, I have several Rubies installed, including a system one which is your original Ruby (the one under which rvm is installed), it is also the one that is currently used as the default Ruby installation by every shell that you open e.g.:

ruby –v

ruby 1.8.7 (2009-06-12 patchlevel 174) [x86_64-linux]

We can fix that however, lets say we want ree to be the default Ruby from now on, all we need is this:

rvm ree --default
rvm list

=> ree-1.8.7-2010.01 [ x86_64 ]
   ruby-1.8.7-p248 [ x86_64 ]
   ruby-1.9.1-p378 [ x86_64 ]
=> (default) ree-1.8.7-2010.01 [ x86_64 ]
   system [ ]

Now every shell we start will be using ree as it’s default Ruby:

rvm use default
ruby –v

ruby 1.8.7 (2009-12-24 patchlevel 248) [x86_64-linux], MBARI 0x6770, Ruby Enterprise Edition 2010.01

Pretty handy, but what if I want to quickly switch the Ruby version I am currently using in my shell. All you need to do is this:

rvm use 1.9.1

And, magically my shell is using a different ruby:

ruby –v

ruby 1.9.1p378 (2010-01-10 revision 26273) [x86_64-linux]

This is really all you need to know to start using rvm, there are lots of other more advanced commands, but for regular day-to-day usage I haven’t really found a need for any of them.

A couple of points to remember. Firstly, every time you use rvm to install a new Ruby version, gem and rake will come for free, i.e. rvm will install them for you for that particular Ruby installation. This of course means that all rvm Ruby installations have their own set of gems, so if you have 30 gems installed in one Ruby install and want to try your app out on another, you will need to install all those gems again for that Ruby – makes sense. The rvm site has all the info you need to work effectively, so go forth and explore if you feel like you need to know more.

Getting All The Rubies To Work With My IDE (Netbeans)

So you’re working happily with your multiple Rubies in the shell and then you crack open Netbeans for some of the more complex Ruby editing and find that none of your rvm-managed Ruby installation are there and there doesn’t seem to be any way to get them into Netbeans. Normally you would go to Tools->Ruby Platforms under Netbeans to add new Ruby versions:

imageThe easiest thing to do is to autodetect platforms, but this doesn’t seem to find any of your rvm managed Rubies. If you try to add platform manually you still run into trouble, rvm installs it’s Rubies in a sub-directory under .rvm in your home folder. Unfortunately Netbeans won’t let you dig into hidden directories to find Rubies – stupid.

The only way around it seems to be to launch Netbeans from a shell where the rvm-managed Ruby platform you want to add is the default Ruby. If you do that, then autodetect platforms seems to find the Ruby installation fine. So, to add our 1.9.1 Ruby install to Netbeans we do the following:

rvm use 1.9.1

cd /home/alan/programs/netbeans_6_8/bin ***or wherever your netbeans is installed

./netbeans

Once Netbeans opens go to Tools->Ruby Platforms and then press autodetect platforms, Netbeans should add Ruby 1.9.1 to it’s list. It’s a pain to do things this way when you have lots of Rubies you want to add to Netbeans, but that’s just the way it is. If you know of a better/faster way, then do share!

Well, there you go, we’re now set up with multiple Rubies in the shell and in our IDE, just imagine the fun we can have :).

For more tips and opinions on software development, process and people (and rubies :)) subscribe to skorks.com today.

Image by jaja_1985’s

Search Fundamentals – Basic Indexing

A little while ago I wrote a tiny little crawler, at the time I promised myself that having dabbled in crawling I would also cover searching, indexing and other web-search related activities. Little did I know back then just how sparse my knowledge in the area actually was, although some of the comments I received regarding the limitations of my crawler should have probably clued me in :). I can’t rightly say that I am an expert now, but I did learn a little bit in the last few months, enough to know just how complex this area can be. Regardless, I feel that I have reached a milestone in that I can now write about search without making a complete idiot out of myself (I hope :)), which means it is time to fulfill that promise I made to myself.

I’d actually love to cover the ins and outs of writing a big-boy crawler, but that’s a story for another post, for the moment I’ll begin with the fundamentals:

  • some basic terminology for you to throw around at parties
  • the anatomy of a basic index, the fundamental underpinning of any search engine

Why Index?

Wherever we find a search engine, we also find a set of documents. You’re lucky if your set is small, but if it is small enough you can probably scan it by hand. If you need a search engine, chances are your document collection is big and if you’re looking at writing a search engine for the web, it is very, very big. It is ludicrous to expect a search engine to scan all the documents every time you do a query, it would take forever. To make your queries fast and efficient a search engine will pre-process the documents and create an index.

The Heart Of Every Search Engine

At the core of every modern search engine is an inverted index, this is a standard term the reason for which will become clear shortly. The most basic thing we want to do is be able to quickly tell if a word occurs in any of the documents in our collection. We can assign a set of ids to our documents and then associate all the words that occur in a particular document with it’s id, this is rather inefficient for obvious reasons (duplicate words and all). Instead we invert the concept. We take all the words/terms that occur in all the documents in our collection – this is called a vocabulary (also standard terminology) – we then map each term  to a set of document ids it occurs in. Each document id is called a posting and a set of document ids is a postings list. So, the most basic inverted index is a dictionary of terms each of which is associated with a postings list.

It goes without saying that an inverted index is built in advance to support future queries. On the surface this is done in a manner you would expect:

  • go through all the documents, assign each an id and tokenize each one for words
  • process all the tokens (linguistic processing), to produce a list of normalized tokens
  • for each token create a postings list, i.e. a list of document ids it occurs in

Of course those 3 simple steps hide infinite layers of complexity. What we want to end up with is a sorted list of terms each of which is associated with a list of document ids. We can also start storing some extra info even with this basic inverted index, such as the document frequency for each term (how many documents the term occurs in). This extra information will eventually become useful when we want to rank our search results.

Let’s Play With Some Code

Too much theory without any practice would be cruel and unusual considering we’re hardcore developer types, so a good exercise at this point would be to construct a small inverted index given a set of documents, to help crystallize the concepts. A small set (2-3 docs) is best so that results can be visually checked. Here is my 10 minute attempt in ruby (it actually took longer but it should have taken 10 minutes, need more skillz :)).

Before we begin the indexing we pre-process out documents to produce a dictionary of terms, while at the same time retaining the document ids for each term:

```ruby require ‘ostruct’

documents = [“the quick brown fox jumped over the lazy dog while carrying a chicken and ran away”, “a fox and a dog can never be friends a fox and a cat is a different matter brown or otherwise”, “i don’t have a dog as i am not really a dog person but i do have a cat it ran away but came back by itself”]

def pre_process_documents(documents) dictionary=[] documents.each_with_index do |document, index| document.split.each do |word| record = OpenStruct.new record.term = word record.doc_id = index dictionary << record end end sort_initial_dictionary dictionary end

def sort_initial_dictionary(dictionary) dictionary.sort do |record1, record2| record1.term <=> record2.term end end

initial_dictionary = pre_process_documents documents```

Some things to note for this phase are:

  • the documents we are indexing live in an in-memory array, in real life they would live on disk or on the web or whatever, but I didn’t bother with that – for simplicity
  • the ids that are assigned to the documents are simply a sequence which is assigned to a document as we encounter it for the first time, in our case we make do with an array index, this is infact analogous to how a real search engine index would do this
  • we didn’t really have to sort in this initial phase, but sorting makes me happy so there

We are now ready to begin constructing our inverted index:

```ruby def index(dictionary) inverted_index_hash = {} dictionary.each do |record| postings_list = inverted_index_hash[record.term] || [] postings_list << record.doc_id inverted_index_hash[record.term] = postings_list.sort.uniq end finalize_index inverted_index_hash.sort end

def finalize_index index final_inverted_index = [] index.each do |term_array| final_record = OpenStruct.new final_record.term = term_array[0] final_record.postings = term_array[1] final_record.frequency = term_array[1].size final_inverted_index << final_record end final_inverted_index end

final_index = index initial_dictionary```

There are a few more things of note here:

  • I used a hash to begin constructing my inverted index, which I then converted to an array for the final index, this was done for the sake of simplicity once again, but as you might imagine it is not really optimal, especially as your index gets larger. I should have been looking at using some sort of balanced binary tree, which would allow me to keep things sorted and allow reasonably quick access after
  • The postings list was kept in memory together with the term, for a larger index these would probably live on disk and each term in memory would have a reference to the on-disk location of it’s postings list
  • After the index is finished I go over it again to compute and store the document frequencies for the terms, this may not be terribly efficient, but it is a lot more efficient than computing these on the fly during query time, the larger your document collection the more the savings add up. Plus there are many other things we can compute and store at this point to potentially speed up our queries, it is a classic trade-off

All that’s left to do is print out the final index to make sure everything looks the way we expect.

```ruby def print_index(index) index.each do |term_struct| puts “#{term_struct.term} (#{term_struct.frequency}) -> #{term_struct.postings.inspect}” end end

print_index(final_index)```

This produces the following output:

a (3) -> [0, 1, 2]
am (1) -> [2]
and (2) -> [0, 1]
as (1) -> [2]
away (2) -> [0, 2]
back (1) -> [2]
be (1) -> [1]
brown (2) -> [0, 1]
but (1) -> [2]
by (1) -> [2]
came (1) -> [2]
can (1) -> [1]
carrying (1) -> [0]
cat (2) -> [1, 2]
chicken (1) -> [0]
different (1) -> [1]
do (1) -> [2]
dog (3) -> [0, 1, 2]
don't (1) -> [2]
fox (2) -> [0, 1]
friends (1) -> [1]
have (1) -> [2]
i (1) -> [2]
is (1) -> [1]
it (1) -> [2]
itself (1) -> [2]
jumped (1) -> [0]
lazy (1) -> [0]
matter (1) -> [1]
never (1) -> [1]
not (1) -> [2]
or (1) -> [1]
otherwise (1) -> [1]
over (1) -> [0]
person (1) -> [2]
quick (1) -> [0]
ran (2) -> [0, 2]
really (1) -> [2]
the (1) -> [0]
while (1) -> [0]

We can visually verify that our index looks correct based on the input (our initial collection of 3 documents). There are many things we can note about this output, but for the moment only one is significant:

  • As we can see, most of the words in our index only occur in one document and only a couple occur in all three, this is due to the size of our collection i.e. this is bound to happen when indexing a small number of documents. If we add more and more documents the postings lists for all the terms will begin to grow (pretty self-explanatory really).

You’re welcome to go through a similar exercise of constructing an index for some simple documents, it is actually a reasonably decent code kata, if you do I’d love to hear about it. If you want something a little bit more involved consider using some real text documents (rather than contrived ones) and fetching them from disks. Avoid using hashes and arrays for the final index, be a real man/woman and use some sort of tree. Make sure, with your final index, the postings lists live on disk while the dictionary lives in memory and everything still works. There are quite a number of ways to make the exercise a little bit more complex and interesting (unless you’re afraid of a little extra complexity, but then you wouldn’t be the developer I think you are, we eat extra complexity for breakfast, mmmmm complexity). Of course for a web search engine even the term dictionary is potentially too big to be kept in memory, but we won’t get into distributed stuff at this time, that’s a whole different kettle of fish.

This is a very basic explanation of very basic indexing, but surprisingly enough even Google’s mighty index is just a much more advanced descendant of just such an inverted index. I’ll try and get into how we can give our basic index a little bit more muscle in a later post. In the meantime if you want to know more about search, information retrieval and other topics along similar vein, here are some resources (these are the books that I used, so they are all tops :)). For a gentler more mainstream introduction have a look at Programming Collective Intelligence and Collective Intelligence In Action, but if you want to get straight into the nitty gritty and are not afraid of a little bit of maths try Introduction To Information Retrieval, this one is a textbook though with lots of exercises and references and stuff – you’ve been warned. As always, I welcome any comments/questions you might have, even if you just want to say hi :).

Image by koalazymonkey

How To Speed Up Your Website By 80% Or More

SpeedIs what they should have called a 3 page whitepaper. Instead they wrote a 100 page book and sold it for $30 ($20 on Amazon). I am talking about High Performance Websites. I don’t like to rant about books, I believe you can never read too many, but in this case paying that much money for a 2 hour read stretched even my credulity. And still I would have been happy if it was 100 pages packed full of awesome content. But, you guessed it, in this case, if you cut out the filler you could really fit all the useful info into about 3 pages (which would have made those 3 pages a really awesome resource).

Still I can’t be 100% critical, the book did teach me a few things I didn’t know before, and if you’re predominantly a back-end developer you will probably pick out a few useful tidbits as well. Still, after you finish it you kind-of wish you stopped reading after the table of contents, which would have covered 80% of the useful info in the book (but of course you wouldn’t know this until you’ve read through the whole thing). Luckily, since I’ve already been through it, I can save many other people the time and the money and create a summary – which is what this book should have been to start with.

The Summary

If you examine the HTTP requests for how a web page is loaded in a browser, you will see that at least 80% of the response time is spent loading the components on the page (scripts, images, CSS etc.) and only about 20% is spent downloading the actual HTML document (that includes all the back-end processing). It therefore behooves us to spend some time on front-end optimization if we want to significantly speed up our website loading times. There are 14 main points to look at when we’re trying to do this:

1. Try to make fewer HTTP requests

  • try using image maps instead of having separate images
  • you may also try using CSS sprites instead of separate images
  • it is also sometimes possible to inline the images in your HTML page (base64 encoded)
  • if you have multiple JavaScript or CSS files, get your build process to combine these into one master file (one for CSS one for JavaScript)

2. Use a content delivery network (CDN)

  • a content delivery network is a collection of web servers distributed across multiple locations
  • this allows browsers to download from servers that are geographically closer, which can speed up download times
  • there are several CDN’s that major websites use e.g. Akamai, Mirror Image, Limelight etc.

3. Add a far future Expires header to all your resources

  • more specifically, add a far future expires header to allow the browser to cache resources for a long time
  • you can use apache mod_expires to take care of this for you
  • you don’t get the savings the first time users visit (obviously), only on subsequent visits
  • you should add far future expires headers for images, scripts and CSS
  • you should introduce revision numbers for your scripts and CSS to allow you to modify these resources and not worry about having to expire what is already cached by the browser
  • you can hook creating revision numbers for your scripts and CSS into your build process

4. Gzip components

  • you should gzip your HTML pages, scripts and CSS when they are sent to the browser
  • you can use apache mod_gzip (for 1.3) or mod_deflate (for 2.X) to handle all this for you

5. Put stylesheets at the top (in the document HEAD using the LINK tag)

  • we want the website to render progressively in the browser (i.e. to show content as it becomes available), but many browsers will block rendering until all stylesheets have loaded, so loading stylesheets as soon as possible is preferable
  • having CSS at the top may actually make the page load a little slower (since it can load stylesheets it doesn’t need), but it will feel faster to the users due to progressive rendering

6. Put scripts at the bottom

  • normally according to the HTTP spec a browser can make two parallel requests to the same hostname, splitting components across multiple hostnames can improve performance
  • scripts block parallel downloads, having scripts at the top will block all other components from downloading until the scripts have finished loading
  • having scripts at the bottom allows all other components to load and take advantage of parallel requests

7. Avoid CSS expressions

  • CSS expressions are evaluated very frequently, so can degrade page performance after it has loaded
  • instead use one-time expressions or better yet use event handlers

8. Make JavaScript and CSS external

  • if user visits infrequently, you’re better off inlining your CSS and JavaScript into your HTML as the page is unlikely to be in the browser cache anyway and this minimizes requests
  • if users visit frequently, you’re better off having separate files for your CSS and JavaScript as this allows the browser to cache these components and only need to fetch the HTML page which is smaller due to the fact that CSS and JavaScript are externalized

9. Reduce DNS lookups

  • if the browser or OS has a DNS record in it’s cache no DNS lookup is necessary which saves time
  • you can use Keep-Alive to avoid DNS lookups, if there is an existing connection no DNS lookup is needed
  • if there are fewer hostnames, fewer DNS loolups are needed, but more hostnames allow more parallel request

10. Minify your JavaScript

  • this means removing unnecessary stuff from your scripts, such as spaces, comments etc. this makes the scripts much smaller
  • you can also obfuscate, but the extra savings compared to minification are not worth it, especially if gzip is used
  • you can use JSMin to minify your JavaScript
  • you can also minify inline scripts
  • minifying CSS is possible but usually not worth it

11. Avoid redirects

  • redirects mean an extra request, and means that all other components are prevented from loading, this hurts performance
  • don’t use redirects to fix trivialities such as missing trailing slash, this can be done through apache configuration
  • you don’t need to use redirects for tracking internal traffic, you can instead parse Referer logs

12. Remove duplicate scripts

  • often duplicate scripts creep in, this can make web pages larger and require more requests which hurts performance
  • implement processes to make sure scripts are included only once

13. Configure or remove ETags

  • ETags are used by servers and browsers to validate cached components
  • if you’re using Expires headers the last modified date may be used by browsers to check if the component needs to be fetched, ETags are an alternative to the last modified date
  • the problem is that ETags are constructed to be specific to one server, when a site is distributed across several servers this is an issue
  • there are apache modules that can customize ETags to not be server specific
  • if the last modified date is good enough, it is best to remove ETags

14. Make AJAX cacheable

  • same rules (as above) apply to AJAX requests as to all the other requests, especially important to gzip components, reduce DNS lookups, minify JavaScript, avoid redirects and configure ETags
  • try to make the response to AJAX requests cacheable
  • add a far future expires header for your cachebale AJAX requests

That’s it, that was the whole book without having to spend $20-30. There are so many things that could have been expanded on in this book, examples, step-by step instructions, configuration snippets etc. With a bit more effort it could have been made into a valuable resource, worthy of it’s price tag. Alternatively it could have been priced in a fashion commensurate with the level/amount of content. The way it stands though, it was just a little annoying. Enjoy the summary.

__

For more tips and opinions on software development, process and people subscribe to skorks.com today.

Image by Alex C Jones