Attention Based Visual Dialogue

6 0
  • 0 Collaborators

Visual Dialogue is a task in which an AI agent can hold meaningful dialogue with humans in natural language about a visual content, specifically images. ...learn more

Project status: Under Development

Artificial Intelligence

Intel Technologies
Intel Opt ML/DL Framework, Intel Python

Links [1]

Overview / Usage

It is given an image, a dialogue history and the current question about the image, then the AI agent has to
understand the question about the image, gather the context from history and answer in an intelligent manner. This task lies at the intersection of computer vision and natural language processing. The AI agent has to be grounded enough in both domains to serve as a general test of machine intelligence. Convolutional Neural Networks will be used for
extracting the features from the image and Long Short-Term Memory(LSTM) will be used for processing natural language. The dataset to be used is VisDial v0.9 which contains 1 dialogue with 10 Q-A pairs on ~120K images from COCO, totaling to a ~1.2M dialogue Q-A pairs.

Future Real-World Applications

  1. Helping visually impaired people understand their surroundings or social media content. Example: (AI: ‘John just uploaded a picture from his vacation in Hawaii’, Human: ‘Great, is he at the beach?’, AI: ‘No, on a mountain’)

  2. Analyze video content and give important insights. Example:(Human: ‘Did anyone enter this room last week?’, AI: ‘Yes, 27 stances logged on camera’, Human: ‘Were any of them carrying a black bag?’)

  3. Interacting with an AI assistant (Human: ‘Alexa – can you see the baby in the baby
    monitor?’, AI: ‘Yes, I can’, Human: ‘Is he sleeping or playing?’)

Methodology / Approach

Solution

The solution to this problem is connected model of CNN + LSTM + Memory Networks with attention mechanism both on the image and the past Q-A pairs.

Image Encoder

A CNN architecture i.e resnet-34 will be used to extract image features from the image. The features from its first FC layer after the convolution will be used. These features will be concatenated with the embedding of current question as shown in the figure above. The network will be trained end-to-end and the gradients will be backproped into the resnet layers so as to fine tune the resnet.

One more thing I want to try is to fine-tune the resnet to classify COCO images first and fine-tune even more while doing the complete training. I will also use attention mechanism on the image to check whether the performance increases or not.
Q-A Encoder-Decoder. The complete model will be given an Image, the current question, the history of the questions. The set of candidate answers will also be given input to the the model if the decoder model is discriminative.

Encoder

The current question will be encoded with an LSTM. Then it will be concatenated image encoding and follow it by a fully-connected layer and tanh non-linearity to get a ‘query vector’. Each caption/QApair will be encoded independently by an LSTM with shared weights. The query vector is then used to compute attention over the t facts by inner product. Combination of attended history vectors is passed through a fully-connected layer and tanh non-linearity, and added back to the query vector. This combined representation is then passed through another fully-connected layer and tanh non-linearity and then used
to decode the response
Other type of Decoders to try: Transformer Network, Hierarchical Recurrent Encoder
Decoder
Two types of Decoder can be used
a) Generative : It will generate one word at each time step.
b) Discriminative: A set of candidate answers will be chosen based on the input question only and network will be given task to rank those based on their log-likelihood.
I will try both decoders and will deploy the best model.

Technologies Used

● PyTorch
● Intel Python
● Intel Math Kernel Library
● Torchvision
● Numpy
● OpenCV
● Matplotlib

Comments (0)