Resources

How To Digitise Content – One E-book At A Time

Inc42 Daily Brief

Stay Ahead With Daily News & Analysis on India’s Tech & Startup Economy

[Note: This article is part of The Junction Series. We will be covering the Media and Entertainment sector in detail at The Junction 2017 in Jaipur. Learn more about The Junction here!]

The onus to go paperless is urging organisations, big and small, to take up the cause of digitisation.  Its importance lies in a comprehensive collation of past data, which are available to them on various devices and making the content searchable, extractable, and shareable across various mediums. Owing to this rapid access of documents on cloud-based solutions, we now have added mobility to the earlier tedious task of building content.

The first thing that comes to mind when we talk about digitisation is “scanning.” Though scanning is the first step of digitisation in most cases, it is not the ideal end-product. A scanned document is merely a “digital replica” of the printed version but it lacks so many functionalities of a true digital document, viz. reusability, search-ability, possibility to extract partial text from it and the most important the ability to use the e-text for test to speech purposes.

How Regional Languages Can Benefit From Digitising Content

With paperless offices being one aspect of digitising content, another fascinating aspect is the conversion of existing literary content – namely books – into dynamic digital content. The perspective here is to treat literature as reusable data and in India with so many regional languages, each with their own rich culture and literature – we are sitting on a wealth of reusable literary resource.

The publishing industry has been utilising desktop publishing for over 25 years now. This fact essentially translates into an unspoken reality of today that every material that has either been published or is ready to be published must have a computerised source.

These sources hold immense potential to create accessible eBooks with standard guidelines for the kind of format and fonts used, converting them to mainstream books in electronic format that can be used for archiving ancient literature, preserving data that is searchable, creating e-books, print books, and books for persons with print-disability.

Unfortunately, for Indian languages, owing to the haphazard design and use of non-standard fonts and scanned documents, we haven’t been able to utilise these electronic source files to recognise and address the dire need for books in accessible formats of data. Lack of proper frameworks, tools and methodology for digitising Indian regional language makes it difficult to make information in this language available on search engines like Google and Yahoo.

Creating Extractable, Searchable, And Usable Data

Innovation in technology has made it possible to create true extractable, searchable and reusable data in Indian languages just as efficiently as in English. Conversion to the digital format does not merely mean scanning the document and tagging it to be able to search it for future use. Scanning documents is the very first step towards digitisation of data. These scanned photos then need to be passed through OCR which gives us a Unicode format of the content.

To ensure a higher accuracy, it is then run through intelligent search replace i.e. the vocabulary and a dictionary. The discrepancies found in typing at this stage are then fixed and matched with the original document that we received in the first place. This digital format after having processed through intelligent search replacement and type fixes, moves on to the next stage of proofreading and the errors found here.

Hence, after a rigorous filtering process of digital conversion of the said content into a viable and dynamic format we get a master book in its electronic format. These electronic books are already being used for main stream printing and for projects such as eBasta.

Digitising Content – One Ebook At A Time

Thanks to the constant innovation in technology, the process is to be carried out only once where the product is a master electronic document of the entire content and designing it too, is only a one-time effort. If proper guidelines are followed, with the use of proper tools, structured working methodology and directed efforts, publishers, authors, archivers, and even government departments, businesses and offices operating in Indian languages like Hindi, Gujarati, Marathi, Odia, Tamil, etc., will require designing their books just once.

Moreover, transliteration from one given language to the other is also just a click away now and that makes all the extra paperwork redundant across sectors.

The beauty of this methodology is that such a master document can be then used as a pool of searchable data, print book e-book, e-books for people with print disabilities as well as Braille books. Latest developments and the digital technology revolution has a potential of providing a solution to produce books in such formats and all we would need then, is the master book in its electronic format.

Note: We at Inc42 take our ethics very seriously. More information about it can be found here.

Inc42 Daily Brief

Stay Ahead With Daily News & Analysis on India’s Tech & Startup Economy

Recommended Stories for You