This post was written for the YunoJuno tech blog, where I’m currently freelancing, and was first posted there.
ElasticSearch is a powerful, easy-to-install, easy-to-use, low config, clustered search engine based on Lucene, and is a popular choice for all sorts of applications.
In the Django world, Haystack seems to be the de-facto way to implement search outside the ORM – and it certainly seems the fastest to implement.
There are a lot of good reasons to use Haystack: it’s very easy to get up and running, it’s super-simple to configure based on your existing models, it comes with a great SearchQuerySet that lets you chain filters and use Django ORM-like syntax to create queries that lazy load, and it’s portable between ElasticSearch, Solr, Xapian and Whoosh.
Although I’d implemented Lucene some years ago and knew a little about search in general, I’d never used either ES or Haystack before I was asked to improve the existing YunoJuno search functionality. One of the ideas I wanted to implement was a tag search, where users can type into a text box and matching tags will appear as search autocomplete, and can then be used to filter the results.
To do this I realised I’d need an autocomplete-type nGram search that’s superfast for the input typeahead, and also some fancy tokenising to handle multiple-word tags and a synonym feature to keep the total number of tags low.
It turns out that, in exchange for making a powerful search very easy to add to your Django project, Haystack exacts a price, by putting a fairly low ceiling on what you can achieve with the search. In order to preserve portability, some advanced ES features such as aliases and warming aren’t supported in Haystack.
We can live with this, but I was a little disappointed by a couple of the other design decisions Haystack makes (presumably for the same reasons). It can only run queries against a single field on the document, which you implement as a kind of munged data catchall using Django’s template language to combine all the model fields you want indexed. It also puts all your models into a single ‘type’ under the index of your choice, so if you want to split them out you have to split them into new indices. Both of which mean you have an artificially large number of artificially large indices.
Worse still, Haystack comes with no out-of-the-box way to change any of the analysers or tokenisers you use, just allowing you to pick between snowball (a great English-focused text analyser) and nGrams, which weren’t even working for me as expected.
That was when the real fun began. Nothing seemed to work. Sure it was my inexperience, I played around with analysers and tokenisers, validating them all through the ElasticSearch analyse and mapping APIs over the course of a couple of days.
The truth was slowly dawning on me: although Haystack was accepting my custom analysers, and ES was being told about them when I indexed my files, the analysers simply weren’t being applied to the fields as I was indexing them; only on the queries I was running. Which means that, at best, my custom analysis was being ignored and, at worst that it was being applied only to one side of the equation, giving me bogus – or no – results.
Which isn’t to say that elasticstack doesn’t work – if all you want is synonyms it should do the job really well, but don’t expect anything really complex from Haystack.
So now we’re starting the process of communicating directly with elasticsearch-py, the official ES Python library. We’re writing a lightweight wrapper that emphasises ES configuration and style over Django, in order to allow full access to all ES functionality in exchange for portability.
Luckily our codebase only touches ES in a couple of places…