OCR for Devnagari scripts

Sumedh Pendurkar

Sumedh Pendurkar

Shivajinagar, Maharashtra

1 0
  • 0 Collaborators

This project aims to extract text from printed documents in Devanagari scripts. This would allow easy scanning of documents for various office purposes and also reduce the amount of space required to save scanned documents which would otherwise be saved as an image. ...learn more

Project status: Under Development

Artificial Intelligence

Intel Technologies
AI DevCloud / Xeon, Intel Opt ML/DL Framework

Code Samples [1]

Overview / Usage

Currently, a lot of work is being done on OCR for languages in Latin scripts such as English. The distinctive nature of such scripts that the characters are separated from each other, allows easy segmentation of the documents in these scripts. However, in Devanagari Scripts, due to the presence of 'shirorekha' (the continuous line at the top of each word) complicates the procedure of segmenting a word into characters. Besides it, character set in Devanagari in very large which is an added complexity to the normal classification problem. We intend to solve this problem specifically, apart from overcoming problems such as minor rotation that occurs while scanning image, varying intensity that is captured while scanning images. We are currently working on developing an entire pipeline which takes in scanned image as input and gives text file as output.

Methodology / Approach

Proposed Methodology:

  • After removal of speckle, pepper noise and binarization contours in the images where obtained.
  • These contours were the words as shirorekha is present above all characters.
  • Each word was deskewed using hough line transformations
  • Segmentation using sliding window approach.
  • Use classifier trained on basic characters(only क ख ग..) segmented by this process were tested using SVM with linear kernel (130 fonts)

Technologies Used

  • Intel optimized libraries: numpy, scipy, tensorflow, keras, opencv, sklearn, scikit-learn
  • Other tools: Jupyter Notebook
  • Hardware: Intel DevCloud

Repository

https://github.com/ameyaapte1/devanagari_ocr

Comments (0)