How IBM is digitizing the world's text

The idea of digitizing all books and making them available on electronic libraries can be traced back to 1945, when Dr. Vannevar Bush wrote "As we may think" in the July issue of The Atlantic. His visionary description of an information centric application called "memex" influenced the development of the hypertext concept and the Internet. And projects in the 1970s – such as Michael Hart's Project Gutenberg, and futurist Ray Kurzweil's Optical Character Recognition (OCR) technology – continued the effort toward the digitization of textual information. But while billions of people access the Internet today, full digitization and availability of past textual information is still a work in progress.

Among many current efforts underway, IBM is working with the European Union on project IMPACT (IMProving ACcess to Text) to efficiently produce digital replicas of historically significant texts and making them widely available, editable and searchable online. As part of the project, IBM researchers in Haifa, Israel developed CONCERT (COoperative eNgine for Correction of ExtRacted Text). It automates simple, repetitious operations using an adaptive OCR engine that automatically learns from its text recognition errors.

Digitizing Japanese literature

The diverse nature of the Japanese language poses a serious challenge to digitizing the country's literature. Japanese script is expressed beyond a few dozen standard characters, typical of most other languages. In addition to Japanese syllabary characters – hiragana and katakana – Japanese includes about 10,000 kanji characters (including old characters, variants and 2,136 commonly used characters), and ruby, a small Japanese syllabary character reading aid printed next to a kanji. Not to mention mixed vertical and horizontal texts.

The National Diet of Japan is Japan's bicameral legislature. It is composed of a lower house, called the House of Representatives, and an upper house, called the House of Councilors. Both houses of the Diet are directly elected under a parallel voting system. In addition to passing laws, the Diet is formally responsible for selecting the Prime Minister.


Last year, IBM researchers in Tokyo combined their Social Accessibility tool with CONCERT to create a full-text digitization system prototype for the National Diet Library (NDL) of Japan. Dr. Makoto Nagao, the director of the National Diet Library, wrote the book "Digital Library" in 1994, in which he analyzed that the digitization of books is the first step towards realizing an ideal electronic library. The next step is to create a system which allows users take full advantage of digitized information.

"The system needs to have capabilities that are close to how we hold and utilize knowledge in our brain," said Dr. Nagao.

In addition to helping Japanese Diet members to perform their duties, the NDL preserves all materials published in Japan as the national cultural heritage, and make them available to the government institutions and the general public. (As part of this effort, NDL also launched the International Library of Children’s Literature in 2000.)

NDL is making recorded academic literature available online to the public, including making them accessible for the visually impaired, and lending the recordings to libraries throughout Japan.

The IBM Research – Tokyo team also developed a full-text digitization system prototype that improves the digitization of Japanese literature printed during and after the Meiji Period (1868 - 1912); improve accessibility for people with disabilities in reading printed text; and facilitate effective searching and viewing of full-text data. The prototype is also designed with an eye toward future international collaboration and standardization of libraries, including the digitization of historically significant literature, broad utilization of books for various academic activities and online searching.

In a matter of years, all of our textual information will be fully digitized in a reusable way.

Labels: , , ,