Productification

Intersection between Technology, User Experience and Product Innovation


28 Dec

Solr adds Tika support - Entity Extraction next?


Solr now supports Tika through ExtractingRequestHandler

It is now possible to send any of Tika’s supported document types (MS Office, PDF, XML, HTML, etc.) and have the content extracted and then indexed, all within Solr.

A natural enhancement / extension to Metadata extraction and identification toolkit would be to layer a content analysis framework on top. For some verticals (especially news), there is value in extracting named entities out of the content from content sources (documents or web pages). These named entities can then be added to Solr that can allow users to slice and dice information by People, Company, Places, etc. Once there, it becomes a great platform for entrepreneurs to develop applications on top of it and not have to worry about entity extraction.

There are already a number of options for extracting entities from text (LingPipe, OpenCalais). The task is to standardize and wrap them in a framework that can be easily plugged into Solr (atleast to start with).

Grant, is someone already working on it? Any plans in the pipeline?