Multilanguage OCR

Anubhav Singh

Anubhav Singh

Kolkata, West Bengal

0 0
  • 0 Collaborators

An OCR tool to extract text in multiple languages automatically using the Tesseract library by Google developed on Intel Optimized Python. The project allows adding own sets of handwritings or training models which are not previously available to facilitate recognition of text from new handwritings. ...learn more

Project status: Published/In Market

Artificial Intelligence, Graphics and Media

Groups
Student Developers for AI, Artificial Intelligence India

Intel Technologies
Intel Python

Code Samples [1]

Overview / Usage

This OCR, built on top with tesseract is presently able to extract text in English, Hindi and Bengali with a 70% accuracy. I wish to expand this to cover the other Indian languages.

Methodology / Approach

First, the text found in the images is broken down into bounded boxes using OpenCV and then for each box found, a CNN predicts the alphabet matched. For each language, a different model is used.

Technologies Used

Intel Optimized Python
OpenCV
Tesseract

Repository

https://github.com/xprilion/optical-character-reader

Comments (0)