// you’re reading...

Development

How to influence the query plan in Lucene / Solr?

I was looking through Luence’s source code today (okay - night) to find whether you could provide hints to Lucene to change the clause precedence during query execution. Unfortunately, I found that Lucene does not support users to supply any such hint (I was looking at ConjunctionScorer).

At work, we have a use case, where we have knowledge about the data that gets indexed into the system. Post query classification and pre-processing, we can utilize this knowledge to inform lucene about what we think the execution order of each clause in the query should be. This could drastically improve the performance of the query, mostly for AND queries where, when run independently, one clause would return a handful of results and the other clause would return thousands or even tens of thousands of results.

In the past, I have worked on IR systems that would maintain cardinalities (statistics) of the index, that would help optimize the query and produce a best (in given time and resources) plan.

Does anyone know whether Lucene maintains these cardinalities internally? If so, how does it impact the query execution plan?

Discussion

2 comments for “How to influence the query plan in Lucene / Solr?”

  1. Have you tried asking about this on the java-user@lucene mailing list?

    I’m not a low level internals guy, so i’m not great at explaining hte nitty gritty details but ConjunctionScorer evaluates the subclasses in parallel using skip lists. ConjunctionScorer will ask each clause what the ‘first’ doc (in docId) order is that matches, and whatever the *highest* docId returned is, ConjunctionScorer will then tell all of the other clauses to “skipTo” that docId, and it will keep doing that until all of the subclauses agree that the “next” docId they match on is the same.

    So when you query for “+author:Asimov +content:planet” *you* don’t need to tell ConjunctionScorer that author:Asimov is the most restrictive clause in that query so it should check it first … the “author:Asimov” clause indicates directly that it is more restrictive.

    But like i said: i’m not an internals guy — if you ask about this on the list you can get a much better explanation.

    Posted by Hoss | July 1, 2009, 9:18 am
  2. Hoss, I tried doing some searches on the lucene mailing list, but could not find what I was looking for.

    I saw the docid comparison that they were doing in the code, but I am not entirely clear how much it saves (in some cases, it does).

    The documents (ids) matching author:Asimov can be all over the place and not necessary grouped together. Are they?

    Posted by Sameer | July 1, 2009, 2:02 pm

Post a comment