IBM Research: How IBM is digitizing the world's text

The idea of digitizing all books and making them available on electronic libraries can be traced back to 1945, when Dr. Vannevar Bush wrote "As we may think" in the July issue of The Atlantic. His visionary description of an information centric application called "memex" influenced the development of the hypertext concept and the Internet. And projects in the 1970s – such as Michael Hart's Project Gutenberg, and futurist Ray Kurzweil's Optical Character Recognition (OCR) technology – continued the effort toward the digitization of textual information. But while billions of people access the Internet today, full digitization and availability of past textual information is still a work in progress.

Among many current efforts underway, IBM is working with the European Union on project IMPACT (IMProving ACcess to Text) to efficiently produce digital replicas of historically significant texts and making them widely available, editable and searchable online. As part of the project, IBM researchers in Haifa, Israel developed CONCERT (COoperative eNgine for Correction of ExtRacted Text). It automates simple, repetitious operations using an adaptive OCR engine that automatically learns from its text recognition errors.

Digitizing Japanese literature

The diverse nature of the Japanese language poses a serious challenge to digitizing the country's literature. Japanese script is expressed beyond a few dozen standard characters, typical of most other languages. In addition to Japanese syllabary characters – hiragana and katakana – Japanese includes about 10,000 kanji characters (including old characters, variants and 2,136 commonly used characters), and ruby, a small Japanese syllabary character reading aid printed next to a kanji. Not to mention mixed vertical and horizontal texts.

The National Diet of Japan is Japan's bicameral legislature. It is composed of a lower house, called the House of Representatives, and an upper house, called the House of Councilors. Both houses of the Diet are directly elected under a parallel voting system. In addition to passing laws, the Diet is formally responsible for selecting the Prime Minister.

-- wikipedia.com

Last year, IBM researchers in Tokyo combined their Social Accessibility tool with CONCERT to create a full-text digitization system prototype for the National Diet Library (NDL) of Japan. Dr. Makoto Nagao, the director of the National Diet Library, wrote the book "Digital Library" in 1994, in which he analyzed that the digitization of books is the first step towards realizing an ideal electronic library. The next step is to create a system which allows users take full advantage of digitized information.

"The system needs to have capabilities that are close to how we hold and utilize knowledge in our brain," said Dr. Nagao.

In addition to helping Japanese Diet members to perform their duties, the NDL preserves all materials published in Japan as the national cultural heritage, and make them available to the government institutions and the general public. (As part of this effort, NDL also launched the International Library of Children’s Literature in 2000.)

NDL is making recorded academic literature available online to the public, including making them accessible for the visually impaired, and lending the recordings to libraries throughout Japan.

The IBM Research – Tokyo team also developed a full-text digitization system prototype that improves the digitization of Japanese literature printed during and after the Meiji Period (1868 - 1912); improve accessibility for people with disabilities in reading printed text; and facilitate effective searching and viewing of full-text data. The prototype is also designed with an eye toward future international collaboration and standardization of libraries, including the digitization of historically significant literature, broad utilization of books for various academic activities and online searching.

In a matter of years, all of our textual information will be fully digitized in a reusable way.

3 comments:

Patrick DurusauAugust 2, 2011 at 10:02 AM
I am not comfortable with the conclusion "all of our textual information will be fully digitized in a reusable way."

For example, unless digitization captures line breaks, how can we distinguish repetition from dittography (the skipping of a line when copying a text)?

Or, how do we detect the presence/absence of special forms of letters that end particular words or lines (in the Hebrew Bible, for example).

Even more commonly, how do we detect shifting semantics of words over time? Ask Watson what Lady Gaga and Shakespeare's title "Much Ado About Nothing" have in common?

You can write to me patrick at durusau.net when you have Watson's answer.

IBM has and is doing wonderful things in terms of digitizing manuscripts and other digital preservation projects but I don't want to lose sight of the potential losses from digitization. The better to avoid them.

Patrick
AlepollAugust 5, 2011 at 1:00 PM
Interesting article. There's another digitization project I am aware of, where IBM's collaboration has been fundamental. It consisted in the use of IBM LanguageWare to digitize key historical manuscripts known as the 1641 Depositions. You can read more on the Trinity College Dublin website: http://1641.tcd.ie/project-people.php
Rachel kAugust 13, 2011 at 9:46 AM
Not sure how this compares to Google's digital project, which has run into some tangles with authors and publishers on copyright. But what is most intriguing is what scholars have begun to do with the digitized information ...the "culturomics." The only thing I know is what was reported in the New York Times,
http://www.nytimes.com/2010/12/17/books/17words.html
I wouldn't be surprised to learn there is some cross collaboration going on via third party researchers or directly between IBM and Google. I'd love to learn more.

8.01.2011

How IBM is digitizing the world's text

3 comments: