Productification

Intersection between Technology, User Experience and Product Innovation


17 Jul

Lucene / Solr still needs hackers to get up and running


Just read an article posted on the Lucene blog - “Lucene and the Corporate Environment

If the list of companies using Lucene are not “corporate” environments, then I don’t know what corporate means. If by corporate packaging, you mean it has a lot of bloat and charges exorbitant license fees, then no, unfortunately, Lucene is not ready to succeed in the corporate environment. If by corporate environment, it means it is used to save time/money/energy, then Lucene should break out the khakis and button-down shirt and start punching the clock.

Before I go on with this rant, I want to say that I love lucene/solr as what they have to offer and I myself have been using / customizing / delivering innovative solutions based on lucene since 2001. But after spending years with other enterprise search products and lucene / solr, I agree with the statement that Lucene/Solr still have some ways to go to make it really easy for the corporate/enterprise adoption.

One of the things I see in a leading enterprise solution is that it is relatively easy to see value of the product after the installation and it does not require a bunch of hackers to get it up and running. From where I am looking, most of the companies deploying a Solr/Lucene based solution requires programmers who understand IR/Search on their payroll to get the system running.

In a decent sized deployment, there needs to be infrastructure work on monitoring / replication, performance, etc - which in other enterprise search products is mostly built in.

For corporates who have data in their various Silos (DBs, file systems, intranet), Solr/Lucene does not yet provide the full suite of connectors to ingest that data. There are connectors, but again, one has to understand them, their use and how to integrate into Solr/Lucene. Good enterprise software solutions provide management interfaces to configure connectors along with a variety of connector choices (commercial and open source).

Yes, Lucene/Solr is a great platform for companies who want to go above and beyond in delivering value but having the right expertise in house is a key to success. And yes, there is some work needed for Solr/Lucene so that it’s an easy deployment for enterprises.


01 Jul

How to influence the query plan in Lucene / Solr?


I was looking through Luence’s source code today (okay - night) to find whether you could provide hints to Lucene to change the clause precedence during query execution. Unfortunately, I found that Lucene does not support users to supply any such hint (I was looking at ConjunctionScorer).

At work, we have a use case, where we have knowledge about the data that gets indexed into the system. Post query classification and pre-processing, we can utilize this knowledge to inform lucene about what we think the execution order of each clause in the query should be. This could drastically improve the performance of the query, mostly for AND queries where, when run independently, one clause would return a handful of results and the other clause would return thousands or even tens of thousands of results.

In the past, I have worked on IR systems that would maintain cardinalities (statistics) of the index, that would help optimize the query and produce a best (in given time and resources) plan.

Does anyone know whether Lucene maintains these cardinalities internally? If so, how does it impact the query execution plan?


2 Responses Filed under: Development, Search Tags: ,
19 May

Solr/Lucene Feature Alert: TrieRange Capabilities


Are you doing range searches in Lucene / Solr in your application? If so, you can get performance boost by using the new TrieRange package.

Here is a ppt that details the capability.

If you want to read more, you can read the article posted by Grant on Lucid’s site.


No Response Filed under: Search Tags: , ,
16 Jan

Providing Search in Django based applications


I have seen many Django based applications that do not provide intuitive and powerful search capabilities to their users. If you pickup a product created in django and try to do a search, you will be disappointed by the fact that the search is so primitive. No spelling correction, no fuzzy searching, no complex multi-field searches as well. Oh .. and more over, don’t even think about relevance based search.

I have created some applications, but I have been mostly written one-off custom code to integrate best of the breed open source engines like Solr and Lucene into my web applications. While doing that, it got me thinking -

What are the options that a developer has, to provide search in an django based application?

  1. Use of “contains” using QuerySet API
  2. Use Django Sphinx
  3. Use - django-search-lucene app

The first approach is by far the most commonly used in Django applications. What it does is that it makes use of the underlying LIKE operator of the database. The problem is with this approach is that it’s too primitive. No Relevancy, No complex constraints, and won’t work for multi-word query where the two words don’t appear together.

Second option is to use django-sphinx project. Sphinx is an open source search engine, that was primarily written to integrate well with databases (SQL focused). Though, it seems to be gaining some momentum, but for a really poweful and featureful search engine, I have found Lucene much better. Also, the integration with django requires you to install and set it up as separate server, which is always more than you are looking for.

Lastly, there is an app called django-search-lucene that provides lucene and django integration using PyLucene (Python port of Lucene). The application provides easy integration with Django ORM and simple APIs to perform search. Moreover, for power users, they have exposed an api where you can fire native lucene queries. In addition to that, it also exposes some basic status reporting in the Django admin, which is helpful is monitoring the index / searches.

I am also noticing a flurry of activity in the last 48 hours on the project and am curious to know what new additions are being made.

Next time when you are creating a django based application, look at the kind of search that you are providing to your userbase and see if you can use one of the two (2,3) options above to enhance their experience. Also, do tell me what your experience was - always looking forward to hear that.


1 Response Filed under: Django, Search Tags: ,
06 Nov

Solr 1.3 comes with Search enhancements


Grant Ingersoll has published a new article on IBM developer works that talks about new features in Solr 1.3.

  1. “Did you mean” Spellchecking
  2. Finding similar pages (More like this)
  3. Editorial results placement - Ability to specify that a particular document (or documents) appear at a particular place in the search results.
  4. Distributed Search - Solr adds distributed search capabilities that has the ability to scale the index size by spliting up the documents across several machines (shards)
  5. Performance Gains - ~5x improvement in indexing speed

Solr is starting to get more mature as it starts adding feature / functionality that modern information access applications need. I have spent a lot of time integrating and enhancing product offerings that used Lucene as the underlying engine to provide search. Now, it’s great to see Solr project moving forward with some thrust.

Companies that use Lucene in their applications and products should definitely start evalutaing Solr and how they can take advantage of it.