How Entity Extraction is Fueling the Semantic Web Fire

By Dan McCreary
February 23, 2009 | Comments: 9

I have been working on several large entity extraction projects in the last few months and I have been very impressed at the scope and depth of some of the new OpenSource entity extraction tools as well as the robustness of commercial products. I thought I would discuss this since these technologies could start to move the semantic web (Web 3.0) up the hockey stick growth curve.

If you are not familiar with Entity Extraction (EE), think of it as pulling the most relevant nouns out of a text document. But EE technologies look at the syntax of each sentence to differentiate nouns from verbs and locate the most critical entities in each document.

There are very primitive EE tools in place today such as tools extract InfoBox data from a Wikipedia page. But these are only the beginning. There are many more to come.

Once of the most significant developments in Entity Extraction is the Apache UIMA project. UIMA (for Unstructured Information Management Architecture). This is a massive project being undertaken by the Apache foundation to allow non-programmers to extract entities from free-form text using an innovative high-performance architecture for doing document analysis. The best way to describe UIMA is a version of UNIX pipes for entity extraction but with an important twist: the data does not have to physically move between processes. Annotator processes dance over the documents and out pops documents with high-precision annotations.

The second demonstration you can try is the Thompson/Reuters ClearForest/OpenCalais demos. This is a commercial product with an excellent FireFox plugin called Gnosis that can be used to see the incredible quality that Entity Extraction has made in the last few years. The FireFox add-on does a great job of extracting precise entities from any free form text. The results can also be formatted in RDF or other formats.

If you are working on a Wikipedia article it is very useful to use the Gnosis plugin to find out what words should have wikipedia links in them. You will find that many of the words that Gnosis finds are already in Wikipedia. They just need to have the wiki markup tags added.

But using these tools is creating some new strategic thinking in the industry. In the near future the UIMA tools will become so mature that you can just configure your web server to automatically create a "richly-linked" views of each web page. This will have many implications for the search marketplace and web site stickyness. Once you are on a richly linked page you will be able to allow your users to "find other documents that reference this term" and keep them from going to a search engine.

When you combine EE with a native XML database such as, search application can be created with just a few pages of XQuery. In fact, it is possible that an XQuery module will be created to automate the EE process.

In summary, I think that the newer generation of EE will add an incredible amount of fuel to the semantic web fire. This is going to ignite several new business strategies.

What do you think? How many years before many web site offers EE views? What will be the impact to your business plan?

You might also be interested in:


"could start to move the semantic web (Web 3.0)"

Hey, I thought Web 3.0 was about the programmable web, i.e. webhooks. Dang! I just sent all of my marketing materials to the printer.

I agree that entity and relationship extraction are major distinguishing factors in the quality of search engines, i.e., are what separates commodity search from intelligent search. However, as for your high recommendation of UIMA, this is not really an Apache effort but rather an IBM effort that was "donated" (dumped?) to Apache by IBM as they didn't gain any big market attention. Besides, the UIMA architecture is not made for high-performance throughput and has some other deficiencies. So, I would not regard UIMA a valuable element in the whole picture of intelligent information access.

While certainly XML repositories have an interesting part in the "intelligent search" game, eXist is known to scale badly and be everything else but performant.

As the top of this page says "O'Reilly", you should rather have a look at the Mark Logic Server (e.g., used in the former SafariU) and its capabilities. The OpenSource offerings are still a bit weak in this domain.

I would submit that Entitiy extraction is of secondary importance to non-programmers. What they need is knowledge extraction. And not just on text / web pages - it should be speech as well. See the latest technologies relating to text mining and speech analytics.....

Interesting post, and very spot on to a post I just made on our blog ( As the first commenter hints at, one of the big wins for entity extraction is that it gives search engines better and richer content to index, and better content equals better results. Our post digs into a new tool to let users build their own entity recognizers (think legal terms, diseases, etc...), which we think is the next step in exposing the real value of entity extraction to the masses.

I think it is unfair to suggest that UIMA was 'dumped' by IBM. First, UIMA is used in several IBM products, such as its search engines, and I understand that IBM continues to invest in the technology, both for product use and to support the Apache effort. I think that once they started to use UIMA themselves, IBM realized that it would be to their advantage if there was a community of developers that was familiar with the technology and who could customize the products. Furthermore, I think IBM decided that they wouldn't be able to make a business out of selling the framework (which is basically what UIMA is) and might as well give it away as charge money for it. So for all these reasons it made sense for IBM to release UIMA into open source.

As several groups have recognized, there is a need for a modular, open framework for text analysis. Previously, customers were tied into proprietary frameworks if they wanted to use commercially-available modules for entity and relationship extraction. It was very difficult and expensive to mix-and-match modules from different vendors. Customers had to build ad-hoc solutions for doing so. UIMA solves that problem and reduces the necessary work. Kudos to IBM for recognizing this and making a solution available.

Can't say how EE helps the "semantic web", but it certainly adds a semantic boost to search. The challenge is to enhance EE with disambiguation so when you extract the term Paris, you can determine if this is the city in Texas, the capital of France, or the first name of a socialite.

@Mahesh speaks of knowledge extraction. To my mind EE + disambiguation = KE.

We're doing what we can to bring EE to the masses. Try MashLogic's Firefox add-on to see real-time EE and dynamic aggregation of web services.

Entity extraction is transforming unstructured content on the web. Proper disambiguation will be key to "building out" the semantic web from unstructured resources.

AlchemyAPI provides entity extraction and disambiguation capabilities for more than two-dozen entity types, and supports a half dozen+ languages (English, French, ..), mining OCRed documents, etc. Inform and BasisTech also make quality entity extraction APIs.

This is a very helpful background article, and the comments are useful to.

As a search person, I do want to urge people to use extracted entities to enrich the searchable interest, add helpful weights to the results ranking, and use entity information for faceting. But XQuery is more like SQL than like Google: simple text search is where most people like to start.


Want to see Entity Extraction in action?

I would like to invite you to try ctrl-News, an online news service powered by the CTRL semantic engine.

Using the CTRL semantic engine, Ctrl-News daily fetches for you articles from the top news sources online – articles that are semantically/topically related to your specific subject(s) of interest. You can also check an automatically-generated summary, the key topics and the entities (people, companies, brands, countries, etc.) identified for every news article retrieved.

Click here to subscribe and try ctrl-News.
We certainly welcome any feedback you might have.

News Topics

Recommended for You

Got a Question?