Productification

Intersection between Technology, User Experience and Product Innovation

28 Dec

Solr adds Tika support - Entity Extraction next?


Solr now supports Tika through ExtractingRequestHandler

It is now possible to send any of Tika’s supported document types (MS Office, PDF, XML, HTML, etc.) and have the content extracted and then indexed, all within Solr.

A natural enhancement / extension to Metadata extraction and identification toolkit would be to layer a content analysis framework on top. For some verticals (especially news), there is value in extracting named entities out of the content from content sources (documents or web pages). These named entities can then be added to Solr that can allow users to slice and dice information by People, Company, Places, etc. Once there, it becomes a great platform for entrepreneurs to develop applications on top of it and not have to worry about entity extraction.

There are already a number of options for extracting entities from text (LingPipe, OpenCalais). The task is to standardize and wrap them in a framework that can be easily plugged into Solr (atleast to start with).

Grant, is someone already working on it? Any plans in the pipeline?



2 Responses to “Solr adds Tika support - Entity Extraction next?”

  1. By Sameer on Jan 6, 2009 | Reply

    Grant responded to my comment on his blog. Here is the discussion:

    Tom Morton and I have a written on this in “Taming Text” (http://www.manning.com/ingersoll). The associated code has integration between Solr and OpenNLP, which can do Named Entity Recognition. That’s a starting point. You could also easily plugin other algorithms, I think, but I don’t know if anyone is currently offering that in Solr.

  2. By Julien Nioche on Feb 22, 2010 | Reply

    Pushing the output of a NER system into SOLR would be quite straightforward. One could for instance write a plugin for GATE or UIMA to post an annotated document to a SOLR instance instead of doing that from within SOLR.
    That’s exactly the sort of things that Behemoth (http://code.google.com/p/behemoth-pebble/) can help with for instance.

Post a Comment