Multimodal Representations from other Unimodal Representations: a Survey

Multimodal Representations from other Unimodal Representations: a Survey

Fernando Tadao Ito

Fernando Tadao Ito

, State of São Paulo

My master's thesis, consisting of several experiments with unimodal and multimodal representations and how they are formed.

Artificial Intelligence

  • 0 Collaborators

  • 0 Followers

    Follow

Description

In this project, we explore different unimodal representations for text and image and their combinations in multimodal representations, measuring them by empirical classification tests on multimodal datasets. The objective is to see if there is substantial gain in classification F-scores with multimodal representations, and if different representations affect this performance.

We are working in e-commerce and news datasets, with images and text, and combining different representation techniques. For images, we use SIFT, SURF and ORB visual words; for text, we use LSI and LDA topical representations, GloVe and Word2Vec for word representations. To generate multimodal representations, we use a simple Deep Multimodal Autoencoder trained for reconstruction.

Early results on product category classification suggest that not always complexity is key: LSI has amazing performance and beats all other representations on this task, unimodal and multimodal alike. As more data and results come by, I'll update this project. An upcoming paper submitted to LREC will be attached here later (if it goes through, fingers crossed).

Medium 22688387 2094838993875612 2670521379965790698 n

Fernando Tadao I. created project Multimodal Representations from other Unimodal Representations: a Survey

Medium e1a7f67a 6f6b 465c b27b 74832dbe5c1f

Multimodal Representations from other Unimodal Representations: a Survey

In this project, we explore different unimodal representations for text and image and their combinations in multimodal representations, measuring them by empirical classification tests on multimodal datasets. The objective is to see if there is substantial gain in classification F-scores with multimodal representations, and if different representations affect this performance.

We are working in e-commerce and news datasets, with images and text, and combining different representation techniques. For images, we use SIFT, SURF and ORB visual words; for text, we use LSI and LDA topical representations, GloVe and Word2Vec for word representations. To generate multimodal representations, we use a simple Deep Multimodal Autoencoder trained for reconstruction.

Early results on product category classification suggest that not always complexity is key: LSI has amazing performance and beats all other representations on this task, unimodal and multimodal alike. As more data and results come by, I'll update this project. An upcoming paper submitted to LREC will be attached here later (if it goes through, fingers crossed).

No users to show at the moment.

No users to show at the moment.