GAPML

David Molina

David Molina

Portland, Oregon

6 0
  • 0 Collaborators

The Gap NLP/CV data engineering framework is an open source project that provides an easy to get started into the world of machine learning for your unstructured data in PDF documents, scanned documents, TIFF facsimiles and camera captured documents and your image data in image files and image repositories. ...learn more

Project status: Under Development

Artificial Intelligence

Intel Technologies
AI DevCloud / Xeon

Code Samples [1]Links [2]

Overview / Usage

This framework is ideal for any organization planning to do:

Data extraction from their repository of documents into an RDBMS system for CART analysis, linear/logistic regressions,
or generating word vectors for natural language deep learning (DeepNLP).

Generating machine learning ready data from their repository of images for computer vision.

Methodology / Approach

The framework consists of a sequence of Python modules which can be retrofitted into a variety of configurations. The framework is designed to fit seamlessly and scale with an accompanying infrastructure. To achieve this, the design incorporates:

  • Problem and Modular Decomposition utilizing Object Oriented Programming Principles.
  • Isolation of Operations and Parallel Execution utilizing Functional Programming Principles.
  • High Performance utilizing Performance Optimized Python Structures and Libraries.
  • High Reliability and Accuracy using Test Driven Development Methodology.

Technologies Used

The Gap framework extensively uses a number of open source applications/modules:

Artifex's Ghostscript - extracting text from text PDF
ImageMagic's Magick - extracting image from scanned PDF
Google's Tesseract - OCR of scanned/image captured text
NLTK (Natural Language Toolkit) - stemming/lemmatizer/parts of speech annotation
unidecode - romanization of Latin character codes
numpy - high-performance in-memory arrays (tensors)
HDF5 - high performance of on-disk data (tensors) access
openCV - image manipulation and processing for computer vision
imutils - image manipulation for computer vision

Gap is testing performance with large datasets using Intel CPUs on DevCloud using multiprocessing and threading.

Repository

https://github.com/gapml

Comments (0)