Editor’s note: This is a guest post by IBM Senior Technical Staff Member and Apache UIMA Project Management Committee Chairman Marshall Schor. Meet Mr. Schor and Watson at the 2011 Impact Conference, April 10-14.
Natural language is messy. Slang, puns and the context of when and where something is spoken influences meaning. Watson tackled the problem of understanding the natural language of Jeopardy! with a mess of algorithms – managed by an open source architecture.
The open source Unstructured Information Management Architecture (Apache UIMA™) that IBM Research donated to the Apache Foundation in 2006 is what makes Watson’s hundreds of independent algorithms – written in different languages – work together. Watson combines legacy code written in C and C++, developed before Java became popular, with pattern matching algorithms written using Prolog. The majority of the algorithms are coded in Java because it is currently the most popular, general purpose, high performance object oriented language in use today.
IBM Researchers came up with UIMA about a decade ago to connect colleagues who worked on language processing and unstructured information analytics. UIMA (an OASIS standard) wrapped the independent algorithms in a common architecture so they could work together. When UIMA-AS was added to take advantage of multi-core machines and networks of machines, it was a natural fit for Watson.
Watson runs on POWER7 because of its suitability to highly parallelized applications and its high bandwidth between its memory and the 32 cores of each node. UIMA scales out its components across thousands of these cores so Watson can answer a single Jeopardy! clue in about three seconds.
Algorithms at work: Watson learning across categories
Where else is Watson’s software?
UIMA is embedded in several IBM products, including IBM InfoSphere Warehouse, which performs text analytics for both structured and unstructured content. InfoSphere BigInsights has been used to run UIMA analytics within Apache's Hadoop framework for scalable, distributed computing, to analyze and process a broad set of information including unstructured content.