I have been working on several large entity extraction projects in the last few months and I have been very impressed at the scope and depth of some of the new OpenSource entity extraction tools as well as the robustness of commercial products. I thought I would discuss this since these technologies could start to move the semantic web (Web 3.0) up the hockey stick growth curve.
If you are not familiar with Entity Extraction (EE), think of it as pulling the most relevant nouns out of a text document. But EE technologies look at the syntax of each sentence to differentiate nouns from verbs and locate the most critical entities in each document.
There are very primitive EE tools in place today such as tools extract InfoBox data from a Wikipedia page. But these are only the beginning. There are many more to come.
Once of the most significant developments in Entity Extraction is the Apache UIMA project. UIMA (for Unstructured Information Management Architecture). This is a massive project being undertaken by the Apache foundation to allow non-programmers to extract entities from free-form text using an innovative high-performance architecture for doing document analysis. The best way to describe UIMA is a version of UNIX pipes for entity extraction but with an important twist: the data does not have to physically move between processes. Annotator processes dance over the documents and out pops documents with high-precision annotations.
The second demonstration you can try is the Thompson/Reuters ClearForest/OpenCalais demos. This is a commercial product with an excellent FireFox plugin called Gnosis that can be used to see the incredible quality that Entity Extraction has made in the last few years. The FireFox add-on does a great job of extracting precise entities from any free form text. The results can also be formatted in RDF or other formats.
If you are working on a Wikipedia article it is very useful to use the Gnosis plugin to find out what words should have wikipedia links in them. You will find that many of the words that Gnosis finds are already in Wikipedia. They just need to have the wiki markup tags added.
But using these tools is creating some new strategic thinking in the industry. In the near future the UIMA tools will become so mature that you can just configure your web server to automatically create a "richly-linked" views of each web page. This will have many implications for the search marketplace and web site stickyness. Once you are on a richly linked page you will be able to allow your users to "find other documents that reference this term" and keep them from going to a search engine.
When you combine EE with a native XML database such as eXist-db.org, search application can be created with just a few pages of XQuery. In fact, it is possible that an XQuery module will be created to automate the EE process.
In summary, I think that the newer generation of EE will add an incredible amount of fuel to the semantic web fire. This is going to ignite several new business strategies.
What do you think? How many years before many web site offers EE views? What will be the impact to your business plan?