Multilanguage OCR
Anubhav Singh
Kolkata, West Bengal
- 0 Collaborators
An OCR tool to extract text in multiple languages automatically using the Tesseract library by Google developed on Intel Optimized Python. The project allows adding own sets of handwritings or training models which are not previously available to facilitate recognition of text from new handwritings. ...learn more
Project status: Published/In Market
Artificial Intelligence, Graphics and Media
Groups
Student Developers for AI,
Artificial Intelligence India
Intel Technologies
Intel Python
Overview / Usage
This OCR, built on top with tesseract is presently able to extract text in English, Hindi and Bengali with a 70% accuracy. I wish to expand this to cover the other Indian languages.
Methodology / Approach
First, the text found in the images is broken down into bounded boxes using OpenCV and then for each box found, a CNN predicts the alphabet matched. For each language, a different model is used.
Technologies Used
Intel Optimized Python
OpenCV
Tesseract