20 Aug
Grant did an interview with Sammy Yu, who worked on the search system for Digg.com that utilizes Solr as their platform. Here are some notes from the interview:
Number of Documents in Digg’s index: 13 Million
Index Size (Lucene) on Disk: 8 GB
Architecture: Master - Slave setup, with 10 slaves, running being a load balancer with some caching.
Query Volume: 4.8 million queries / day
17 Jul
Just read an article posted on the Lucene blog - “Lucene and the Corporate Environment”
If the list of companies using Lucene are not “corporate” environments, then I don’t know what corporate means. If by corporate packaging, you mean it has a lot of bloat and charges exorbitant license fees, then no, unfortunately, Lucene is not ready to succeed in the corporate environment. If by corporate environment, it means it is used to save time/money/energy, then Lucene should break out the khakis and button-down shirt and start punching the clock.
Before I go on with this rant, I want to say that I love lucene/solr as what they have to offer and I myself have been using / customizing / delivering innovative solutions based on lucene since 2001. But after spending years with other enterprise search products and lucene / solr, I agree with the statement that Lucene/Solr still have some ways to go to make it really easy for the corporate/enterprise adoption.
One of the things I see in a leading enterprise solution is that it is relatively easy to see value of the product after the installation and it does not require a bunch of hackers to get it up and running. From where I am looking, most of the companies deploying a Solr/Lucene based solution requires programmers who understand IR/Search on their payroll to get the system running.
In a decent sized deployment, there needs to be infrastructure work on monitoring / replication, performance, etc - which in other enterprise search products is mostly built in.
For corporates who have data in their various Silos (DBs, file systems, intranet), Solr/Lucene does not yet provide the full suite of connectors to ingest that data. There are connectors, but again, one has to understand them, their use and how to integrate into Solr/Lucene. Good enterprise software solutions provide management interfaces to configure connectors along with a variety of connector choices (commercial and open source).
Yes, Lucene/Solr is a great platform for companies who want to go above and beyond in delivering value but having the right expertise in house is a key to success. And yes, there is some work needed for Solr/Lucene so that it’s an easy deployment for enterprises.
01 Jul
I was looking through Luence’s source code today (okay - night) to find whether you could provide hints to Lucene to change the clause precedence during query execution. Unfortunately, I found that Lucene does not support users to supply any such hint (I was looking at ConjunctionScorer).
At work, we have a use case, where we have knowledge about the data that gets indexed into the system. Post query classification and pre-processing, we can utilize this knowledge to inform lucene about what we think the execution order of each clause in the query should be. This could drastically improve the performance of the query, mostly for AND queries where, when run independently, one clause would return a handful of results and the other clause would return thousands or even tens of thousands of results.
In the past, I have worked on IR systems that would maintain cardinalities (statistics) of the index, that would help optimize the query and produce a best (in given time and resources) plan.
Does anyone know whether Lucene maintains these cardinalities internally? If so, how does it impact the query execution plan?
19 May
Are you doing range searches in Lucene / Solr in your application? If so, you can get performance boost by using the new TrieRange package.
Here is a ppt that details the capability.
If you want to read more, you can read the article posted by Grant on Lucid’s site.
28 Dec
Solr now supports Tika through ExtractingRequestHandler
It is now possible to send any of Tika’s supported document types (MS Office, PDF, XML, HTML, etc.) and have the content extracted and then indexed, all within Solr.
A natural enhancement / extension to Metadata extraction and identification toolkit would be to layer a content analysis framework on top. For some verticals (especially news), there is value in extracting named entities out of the content from content sources (documents or web pages). These named entities can then be added to Solr that can allow users to slice and dice information by People, Company, Places, etc. Once there, it becomes a great platform for entrepreneurs to develop applications on top of it and not have to worry about entity extraction.
There are already a number of options for extracting entities from text (LingPipe, OpenCalais). The task is to standardize and wrap them in a framework that can be easily plugged into Solr (atleast to start with).
Grant, is someone already working on it? Any plans in the pipeline?
06 Nov
Grant Ingersoll has published a new article on IBM developer works that talks about new features in Solr 1.3.
- “Did you mean” Spellchecking
- Finding similar pages (More like this)
- Editorial results placement - Ability to specify that a particular document (or documents) appear at a particular place in the search results.
- Distributed Search - Solr adds distributed search capabilities that has the ability to scale the index size by spliting up the documents across several machines (shards)
- Performance Gains - ~5x improvement in indexing speed
Solr is starting to get more mature as it starts adding feature / functionality that modern information access applications need. I have spent a lot of time integrating and enhancing product offerings that used Lucene as the underlying engine to provide search. Now, it’s great to see Solr project moving forward with some thrust.
Companies that use Lucene in their applications and products should definitely start evalutaing Solr and how they can take advantage of it.