To address limitations mentioned above, we propose an OCR for classical Indic documents containing arbitrarily long conjoined words. Another fundamental issue is the lack of annotated Sanskrit document image datasets itself. However, as we shall show, these approaches do not work satisfactorily across the range of conjuncts and word-lengths present in Sanskrit documents. Apart from academic approaches, open and commercial document-level OCR systems are available for Devanagari. The same is the case with more recent approaches which assume vernacular word-level segmentation and annotation. phenomenon, approaches relying on character-level processing are not viable. Also, in Yatra One font, the dot at the end of last2words are slightly misplaced. E.g., note the breaks in shirorekha (the line joining characters in a Devanagari word) in some of the conjunct characters in the Dekko font. Observe that fonts, while seemingly consistent, may contain rendering issues. Due to this 1ΔΆ Shobhika Sanskrit 2003 Samyak Devanagari Poppins Tillana Kalam Baloo Amita Yatra One Dekko Figure 2: Synthetic images for a Sanskrit sentence in 10 different fonts. Additionally, these documents, typically rendered in Devanagari, routinely contain sentences where words themselves conjoin to arbitrary lengths (see Figure 1). Even within Indic documents, those written in the classical language Sanskrit exhibit the highest levels of complexity and variety in terms of conjunct characters. Compounding the challenge, the glyph substitution rules for conjunct characters lack consistency across fonts (refer Figure 2). Furthermore, the visual appearance of conjunct characters is generally more complicated than the individual elementary script characters. In many Indic scripts, two or more characters often combine to form conjuncts which considerably increase the vocabulary to be tackled by OCR systems. One reason for the slow progress is the unique challenges with Indic scripts. Observe the differences in word length and the uneven line alignment in our case. Figure 1: Some images from previous works (above the blue line) and our work (below). However, barring recent exceptions, progress has been less than satisfactory for the numerous scripts from Indian subcontinent. As with other document-related tasks, the advent of deep learning has produced sophisticated and reliable OCR systems for many script systems worldwide. Introduction Optical Character Recognition (OCR) forms an essential component in the workflow of document image analytics. Overall, our contributions set the stage for application of OCRs on large corpora of classic Sanskrit texts containing arbitrarily long and highly conjoined words. Consequently, our OCR achieves a word error rate of 15.97% and a character error rate of 3.71% on challenging Indic document texts and outperforms strong baselines. To augment real data and enable high performance for our OCR, we also generate synthetic data via curated font selection and rendering designed to incorporate crucial glyph substitution rules. We introduce a dataset of Sanskrit document images annotated at line level. We present an attention-based LSTM model for reading Sanskrit characters in line images. To address these shortcomings, we develop a Sanskrit specific OCR system. Due to these challenges, the word accuracy of available OCR systems, both academic and industrial, is not very high for such documents. It involves complexities such as image degradation, lack of datasets and long-length words. 1 An OCR for Classical Indic Documents Containing Arbitrarily Long Words Agam Dwivedi Ravi Kiran Sarvadevabhatla Rohit Saluja Centre For Visual Information Technology (CVIT) International Institute of Information Technology, Hyderabad (IIIT-H) Gachibowli, Hyderabad, INDIA Abstract OCR for printed classical Indic documents written in Sanskrit is a challenging research problem.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |