How A Good Manager Is Like A Ninja

Firstly, let me start by saying that I am not recommending you dress up in black pyjamas and go on a killing spree :). With that out of the way, this idea actually came to me when I was reading chapter 4 of Beautiful Teams (I’ve mentioned Beautiful Teams before). For whatever reason I must have had ninjas on my mind (as you do :)) and thought that it would be interesting to draw some parallels, since ninjas and managers have a lot more in common than the obvious potential scariness factor.

You see, ninjas are not all about killing. Yes, they may be assassins for hire, but they are, in essence, great enablers. They remove obstacles and through their actions make the impossible, probable. They remain unseen, in the shadows, but what they do has a tremendous impact. This is what every manager should be – an enabler. They should concentrate on removing obstacles to enable you to have the most impact as a professional. When there is an organisational obstacle, that you see no way to overcome, they should be able to use their skills and cunning to make the impossible happen. All this should be done without kicking up too much of a fuss. All you would need to do is watch, slightly bemused, as roadblocks, that were preventing you from doing what you needed to do, disappeared almost by magic. That is true management ninjutsu.

Let’s speak plainly. You help your people, but you allow them to shine. You don’t seek credit (how useful is an assassin that seeks fame?), you seek to achieve goals. You gain validation through the achievements of your team. What’s the main theme here? It is the fact that as a manager, you’re actually a servant – not a dictator. You serve your people and try to make their lives better, kinda like what a politician should be :). This in turn will allow your people to have a greater positive impact on the wider community (i.e. the company). Everyone benefits in the long run.

Enough Ninjas

The rest of what I have to say doesn’t really fit my metaphor, so I am going to abandon it at this point (since I don’t really want to engage in any kind of verbal contortionism :)). It is reasonably easy to distil the qualities that a good manager needs (even though everyone has different ideas here), even I have done it once before in a guest post. The difficult part is HOW do you come to embody those qualities, especially when some of them go directly against your personality?

It really comes down to only one factor, you need to genuinely care about the people who work for you (if you don’t give a shit about people you’re out of luck). And I don’t mean you need to ‘say’ that you care in a group meeting, you actually have to do it, on an individual basis and then back it up when it counts (i.e. when the person really needs your help). And you have to keep showing that you care – for each person as an individual – this is how you build trust. It is difficult for most people to talk about “touchy-feely” stuff in a one-on-one setting, makes us feel uncomfortable. So, we overcome it by avoiding it. Not a good strategy, all it gets you is a generic corporate atmosphere, it does not engender any loyalty. If you can generate loyalty, through building trust and genuinely caring for your people, then when you ask me to put in some extra effort, I am likely to do it, because I know you care and I know you wouldn’t ask frivolously.

A good manager is a good listener, more than that, they’re good at reading between the lines. Remember what I said about removing roadblocks? Well, you’re lucky if the issues are visible and clear-cut (e.g. “We need more machines”). Often the issues are subtle, two team-members don’t get along, people don’t feel empowered, personal problems are affecting work etc. You would be the luckiest manager in the world for someone to just come out and tell you that these problems exist. Often your people themselves aren’t even aware of the issue, or even when they are they may not be comfortable confiding, or simply can’t articulate it. But, through your individual interactions with the people you manage you should be able to deduce what the problems are and it is up to you to then do something about it (yeah you have to be almost a mind-reader – I didn’t say I had quick and easy answers :)).

Lastly, a good manager, just like a good developer should know his/her business. It doesn’t matter how you fell into your management role, it is up to you to become a professional in what is now your trade. It is more than just annoying when the only qualification a manager has is their age or being in the right place at the right time, it is dangerous. When you as a manager don’t try to improve yourself constantly, it paints an unflattering picture, it undermines everything you try to do. Like a fat personal trainer – it is tantamount to hypocrisy.

For more tips and opinions on software development, process and people subscribe to skorks.com today.

Image by Narisa

How Search Engines Process Documents Before Indexing

From my last two posts about searching (i.e. basic indexing and boolean queries), you might get an impression that writing a search engine is a fairly straight forward process. Infact nothing could be further from the truth. To highlight this I am going to explore one of the issues that I have glossed over previously in a little bit more detail – that of linguistic preprocessing of documents before indexing occurs. This is a massive topic and I have no hope of covering all of it in any kind of detail in just one post, instead I will try and touch on some of the main concepts and issues involved, to give everyone an idea of the scope that even this tiny area of information retrieval can cover.

So, let’s assume we have just fetched a document, what exactly do we need to do before we begin indexing?

Determining Document Encoding And Getting A Character Sequence

The first thing we need to be aware of is the fact that all we have is a byte stream. We don’t necessarily know what kind of document we have just fetched. Of course if we are building a custom search system where we know the kinds of documents that we’ll be dealing with then this is not the case. But, for a complex search engine that needs to deal with multiple document formats, before doing anything else, we need to determine the encoding of the document. We need to do this in order to correctly turn our byte stream into a character stream (some encoding systems use multiple bytes per character e.g. unicode). This may be simple for some documents as they will contain their encoding as metadata, otherwise it is somewhat more complex. I won’t go into too much more detail, you can look here for a little bit more info about determining Unicode, do look around if you’re interested and don’t forget to let me know what you find :).

Once we have determined the encoding, our work is not yet over. Our ultimate aim is to get a character stream, but the document may be an archive (e.g. zip) which we may need to extract in order to work with it’s contents. Alternatively a document may be a binary format (e.g. pdf, or MS Word document) where getting the character stream is quite a bit more involved. Even when this is not the case, most documents will contain spurious data which we don’t want to deal with (e.g. html or xml tags) and will therefore need to strip from our document. Having done all of that we eventually end up with a character sequence that we can process further. It is also worth noting that, this part of the system will potentially need to be highly extensible in order to cater for different encodings and document formats (including future ones).

Find The Document Unit

Having extracted the relevant characters from the raw document, we now need to determine exactly what constitutes a document for the purposes of our index. In other words we need to find what’s known as a document unit – i.e. the chunk of text that will be treated as a distinct document by our indexer. This is not as simple as it might appear. Consider – we have pulled-in a document which contains an e-mail conversation thread, do you treat each e-mail as a separate document, for indexing purposes, or do you just take the whole thread? What if some of the e-mails in the thread have attachments, do you ignore them, or treat each one as a separate document, or make it part of the e-mail that it was attached to? Taking this idea further, let’s say you’re indexing very large documents like books. Is the whole book one document? If someone does a multi-word query you don’t necessarily want the book to come back as a result if one of the words appears at the start and the other at the end – they may be completely unrelated and would not satisfy the query. So, do you index the book with each chapter being a document, how about each paragraph? If your document units are too granular however you potentially miss out on concepts that ARE related and would thereby fail to return relevant results. As you can see it is a precision versus recall trade-off. Depending on your document collection (what you’re trying to index), you may already have a clearly delineated document unit, or you may employ heuristic methods to work out the document units based on the content.

Tokenizing The Document

Once we know our document unit, it is time to tokenize. Once again this is deceptively complex. We want to split out document units into parts that will form the terms of our index. In the simplest case we want to split into words on whitespace and punctuation. But, this is only of limited utility. For example if we have words such as “are not” everything is fine, but what about “aren’t”. The semantic meaning is the same so we potentially want to get the same two words “are not” out of it, but if we simply split on punctuation and whitespace, we’ll get “aren t” – not quite as useful. There are countless examples like that one. But wait, there is more, what about names and terms such as C++, where discarding the plus signs will alter the meaning significantly. Then there are special cases such as URLs, dates, IP addresses, etc. that only make sense if taken as a single token. And if that wasn’t enough consider terms that are made up of multiple words which lose their meaning (or have different meanings) when split up, for example Baden-Baden, San Francisco, Sydney Opera House etc. There are many other similar considerations that we need to be aware of. The crucial thing to remember is this, whatever tokenizer we apply to our documents during indexing we also need to apply to our queries at search time, otherwise you will get at best unpredictable results – pretty obvious. As you can see, tokenization can be a complicated problem and if you’re working with documents in multiple languages, it’s an order of magnitude more complicated still, as each language presents it’s own set of challenges (if you speak a second language you’ll see what I mean, if you don’t, consider the fact that some Asian languages are written without spaces :)).

Removing Stop Words (Or Not)

Once we have tokenized successfully, we are almost ready to index, but indexing some of the words adds very little or no value. These are normally called stop words (think words such as: a, an, the etc.) and are often discarded from the index entirely. To find the stop words we can sort our list of terms by frequency and pick the most frequent ones according to their lack of semantic value (this can be done by hand as stop lists are normally quite small). While stop words usually don’t add much value and can be discarded safely, this is not always the case. For example, think of the phrase “to be or not to be” where every word is potentially a stop word, it is not the best outcome when the system returns no results for that kind of query. For this reason web search engines don’t usually employ stop lists. If a search engine does decide to use a stop list, it must once again be applied not only to the documents during indexing, but also to the query during the search. 

Token Normalization

If you think that we are pretty much ready to start indexing, you would be wrong. The next thing we want to do with our tokens is to normalize them. This means we turn each token we have into its canonical form. For example, as far as we are concerned the terms: co-operation and cooperation are equivalent. So we can say that the canonical form of both of those terms is – cooperation. We want searches for either of the terms (co-operation or cooperation) to return results applicable to both. We can do this in a couple of ways:

  • Equivalence classing – this is an implicit process that is based on a set of rules and heuristics. For example we can say that the canonical form of a word with a hyphen is the same word without one and we will therefore create an equivalence class between those words (e.g. co-operation, cooperation -> cooperation). These equivalence classes will be applied to the document terms during indexing as well as to the queries at runtime so that a query that contains one of the words in the equivalence class, will return documents that contains any of the words.
  • Synonym lists – this is a more explicit and labour intensive process, but can give more control. Synonym lists are constructed by hand (e.g. think – chair, stool, seat, throne). These synonym lists can then be applied in one of two ways. At indexing time you can index the same document under all the terms that are synonyms of each other, this requires a bit more space, but at query time any of the synonym terms will match the document. You may also apply the synonym lists at runtime whereby you expand the query to contain more terms (i.e. all the synonyms) which means the query takes a bit longer to execute. How you decide to do it is up to you, it is a classic space/time trade-off.

There are other normalization steps that can be applied to our terms, such as case folding, whereby we lower-case all our terms. This will of course need to be applied to the query as well, but can come back to bite us, especially where proper names are concerned. For example Crunchy Nut Cornflakes is a breakfast serial, crunchy nut cornflakes are just three random words. Despite this, case folding is normally used by most search engines.

Stemming And Lemmatization

We’ve already done a lot of processing on our documents and terms, but there is still more to do. Different words often derive from the same root e.g. differ, different, difference, differing etc. It make sense that a query for one of these words will potentially be satisfied by a different form of it. We can therefore try to equate all of these words for indexing purposes. For this we can use either stemming or lemmatization (which are, I guess, types of normalization but deserve a section of their own in my opinion).

  • Stemming basically concerns itself with chopping off suffixes using various heuristics to find the more of less correct base form of the word. There are several well known stemming algorithms. The Lovins algorithm is one of the earliest stemmers and the Porter stemmer is one of the best known. Stemming tends to increase recall while harming precision.
  • Lemmatization is a more complex process using morphological analysis and vocabularies to arrive at the base form.

Neither stemming or lemmatization tend to have a lot of benefit when dealing with English language documents (which doesn’t mean they are never used :)), however languages with richer morphologies will often benefit to a greater degree.

And there you have it. I hope this has given you an appreciation for how complex even a small area, like preprocessing, can be when it comes to search engines. Of course I am far from an expert, so if you have something to add or want to correct me :), then do feel free to leave a comment. I hope you’ve been enjoying my posts about searching, since there is more to come. In my previous post on $1.

Images by Shawn Econo and francescomucio

Using Ruby On Rails With Oracle And Deploying It All To Tomcat

oracleRuby on Rails projects usually use MySQL or PostgreSQL for their database, but in the corporate world, Oracle is king. As much as you might like to have a Postgres backend, the powers-that-be have decreed and you must obey. Don’t worry though, all is not lost, you don’t have to slink back to Java, here is how you can get your Rails app working with an Oracle database. Note: All of this works with Rails 2.3.5, I haven’t tried it with Rails 3, but feel free to give it a go and don’t forget to let me know how it works out.

Getting Rails To Play With Oracle

Alright, first thing first, I hope you’re using RVM, cause it’s awesome and will make your life easier. With that out of the way, you need to install two gems, before trying to use Oracle with Rails.

  1. ruby-oci8 gem – this is the ruby interface for oracle using the OCI8 API. You need to have a version of Oracle installed on your machine for this gem to work, otherwise the OCI8 library will not be available on your machine and this will make your life difficult (i.e. impossible :)), as you might imagine.
  2. activerecord oracle enhanced adapter gem – this is an ActiveRecord adapter that has useful extra methods for working with new and legacy Oracle databases from Rails.

You can potentially vendor the oracle enhanced adapter gem within your rails app, but I wouldn’t vendor the ruby-oci8 gem as it has native extensions. Normally that would be fine, but we’re going to make a Java application out of this in the end (for deployment to Tomcat) and native extensions will once again make life very difficult (i.e. impossible :)).

So, to install these gems we do:

gem install ruby-oci8
gem install activerecord-oracle_enhanced-adapter

This will usually go smoothly, but I’ve had one situation where the native extensions failed to build properly when installing ruby-oci8, and I couldn’t find anything that should have caused that. To fix the issue, I simply compiled it from source and then re-installed it as a gem again, at which point everything went smoothly. Something to keep in mind.

Rails Configuration

As you would expect, we now need to modify our database.yml to configure our Oracle database. It is reasonably straight forward, but there are a few things to be aware of. By default your database.yml file will have something like this:

development:
  adapter: mysql
  database: rails_development
  username: user
  password: pass

Of course there will also be similar entries for test and production.

Oracle has a slightly different model, we don’t have to create different databases, we create different database users, so the username and password are used to distinguish between different Oracle dbs.

Our configuration will now look like this:

development:
  adapter: oracle_enhanced
  database: YOUR_ORACLE_SID
  host: localhost
  username: user
  password: pass
  port: 1522

The differences from the norm are as follows:

  • you need to have an adapter entry and the value will always be _oracleenhanced
  • the database entry should contain the SID of your oracle installation, this too will always remain the same
  • as I mentioned the username and password are not just the credentials but will also identify the Oracle database (as per how Oracle works)
  • if you changed the default port (the default is 1521) when installing Oracle you need to specify that too

You will once again need similar entries for test and production to maintain the standard Rails conventions.

Modifying Your Normal Workflow

Of course things are not quite as simple as that (they never are with Oracle :)). Firstly, some of the rake commands you’re used to will not work with Oracle, such as:

rake db:create
rake db:drop

You will need to create your Oracle databases by hand, which in “Oracle world” means you will need to create some users. You have to log in to Oracle as an admin user, create another user and give them the right privileges. To log in to Oracle using sqlplus, you can do the following:

sqlplus admin_user/admin_password as sysdba

You can then execute a series of SQL commands similar to those below, in order to create a new user and give them the right privileges:

CREATE USER my_user IDENTIFIED BY my_password DEFAULT TABLESPACE some_tablespace TEMPORARY TABLESPACE temp QUOTA UNLIMITED ON some_tablespace;
CREATE ROLE my_role;
GRANT all privileges TO my_role;
GRANT my_role TO my_user;

The tablespace and quota stuff is optional (I think :)) so feel free to leave it out. If things are still not working after you’ve configured it all in Rails and tried it out, you can also try:

GRANT all privileges TO my_user;

This should fix most of the issues you might encounter. At this point all the migrate commands should work, such as:

rake db:migrate
rake db:migrate:down

These are the important ones as far as I am concerned, so I am not too worried about loosing some of the others.

One last thing to note, if you don’t create a separate development and test user due to laziness or whatever (which is exactly what happened in my case :)), be aware that if you run tests through rake all your dev data will be blown away (Rails blows away the data in the test database every time).

But What If You Need To Deploy To Tomcat?

Once again the corporate overlords make their presence felt :) and you’re only allowed to deploy to Tomcat. You can develop everything in JRuby and then roll a WAR file in the end, but why bother when you have Warbler.

Warbler is a gem that will make a WAR file out of your Rails app – pretty convenient. They way it does this is by bundling a copy of JRuby (i.e. you don’t need to supply it) and then packaging all the resources in your Rails app into a WAR structure in a sensible fashion. To install Warbler all you need to do is:

gem install warbler

There are a few different things you can do with Warbler, but you really just want to create a WAR file, so all you need to do is go to the root of your Rails app and type:

warble

This will produce a WAR file for you which you can deploy directly to Tomcat – easy. There are however a few things to be aware of, especially in our case.

  1. As I mentioned already, make sure you’re not using any gems that have native extensions (e.g. hpricot). Warbler will try to bundle all the gems it needs and it will be pretty difficult (i.e. impossible :)) to do the compilation steps from within a container.
  2. We’re using Oracle, but we will not need the ruby-oci8 gem (which has native extensions), instead we can use the ojdbc14 JAR. All you need to do is place the JAR file in your Rails application’s /lib folder and Warbler will take care of the rest.
  3. Incidentally, if you need to include any other JAR files also put them in the /lib folder of your Rails app.

That’s it, unless you were doing something really fancy, you should be able to produce a WAR file, drop it into Tomcat and have a working Java web application with no extra effort. The good thing about this is that you can develop your application in pure Ruby (i.e. you don’t need JRuby), JRuby only comes into the picture when you need to deploy. I would recommend deploying as frequently as you can though, to make sure you don’t run into any unexpected surprises.

If you think this unlikely scenario (using Ruby on Rails with Oracle for dev and deploying it all as a Java app to Tomcat) can never happen, think again. This is all based on “real world” events and no animals were harmed in the making of it :). There are a couple of other interesting things we did in the course of that project, such as serving user uploaded images from outside the Rails folder structure with the help of quick post about that at a later date, so if you’re interested – stay tuned.

Image by ellen reitman