Among many current efforts underway, IBM is working with the European Union on project IMPACT (IMProving ACcess to Text) to efficiently produce digital replicas of historically significant texts and making them widely available, editable and searchable online. As part of the project, IBM researchers in Haifa, Israel developed CONCERT (COoperative eNgine for Correction of ExtRacted Text). It automates simple, repetitious operations using an adaptive OCR engine that automatically learns from its text recognition errors.
Digitizing Japanese literature
The diverse nature of the Japanese language poses a serious challenge to digitizing the country's literature. Japanese script is expressed beyond a few dozen standard characters, typical of most other languages. In addition to Japanese syllabary characters – hiragana and katakana – Japanese includes about 10,000 kanji characters (including old characters, variants and 2,136 commonly used characters), and ruby, a small Japanese syllabary character reading aid printed next to a kanji. Not to mention mixed vertical and horizontal texts.
|
"The system needs to have capabilities that are close to how we hold and utilize knowledge in our brain," said Dr. Nagao.
In addition to helping Japanese Diet members to perform their duties, the NDL preserves all materials published in Japan as the national cultural heritage, and make them available to the government institutions and the general public. (As part of this effort, NDL also launched the International Library of Children’s Literature in 2000.)
NDL is making recorded academic literature available online to the public, including making them accessible for the visually impaired, and lending the recordings to libraries throughout Japan.
The IBM Research – Tokyo team also developed a full-text digitization system prototype that improves the digitization of Japanese literature printed during and after the Meiji Period (1868 - 1912); improve accessibility for people with disabilities in reading printed text; and facilitate effective searching and viewing of full-text data. The prototype is also designed with an eye toward future international collaboration and standardization of libraries, including the digitization of historically significant literature, broad utilization of books for various academic activities and online searching.
In a matter of years, all of our textual information will be fully digitized in a reusable way.
I am not comfortable with the conclusion "all of our textual information will be fully digitized in a reusable way."
ReplyDeleteFor example, unless digitization captures line breaks, how can we distinguish repetition from dittography (the skipping of a line when copying a text)?
Or, how do we detect the presence/absence of special forms of letters that end particular words or lines (in the Hebrew Bible, for example).
Even more commonly, how do we detect shifting semantics of words over time? Ask Watson what Lady Gaga and Shakespeare's title "Much Ado About Nothing" have in common?
You can write to me patrick at durusau.net when you have Watson's answer.
IBM has and is doing wonderful things in terms of digitizing manuscripts and other digital preservation projects but I don't want to lose sight of the potential losses from digitization. The better to avoid them.
Patrick
Interesting article. There's another digitization project I am aware of, where IBM's collaboration has been fundamental. It consisted in the use of IBM LanguageWare to digitize key historical manuscripts known as the 1641 Depositions. You can read more on the Trinity College Dublin website: http://1641.tcd.ie/project-people.php
ReplyDeleteNot sure how this compares to Google's digital project, which has run into some tangles with authors and publishers on copyright. But what is most intriguing is what scholars have begun to do with the digitized information ...the "culturomics." The only thing I know is what was reported in the New York Times,
ReplyDeletehttp://www.nytimes.com/2010/12/17/books/17words.html
I wouldn't be surprised to learn there is some cross collaboration going on via third party researchers or directly between IBM and Google. I'd love to learn more.