In this project, we explore different unimodal representations for text and image and their combinations in multimodal representations, measuring them by empirical classification tests on multimodal datasets. The objective is to see if there is substantial gain in classification F-scores with multimodal representations, and if different representations affect this performance.
We are working in e-commerce and news datasets, with images and text, and combining different representation techniques. For images, we use SIFT, SURF and ORB visual words; for text, we use LSI and LDA topical representations, GloVe and Word2Vec for word representations. To generate multimodal representations, we use a simple Deep Multimodal Autoencoder trained for reconstruction.
Early results on product category classification suggest that not always complexity is key: LSI has amazing performance and beats all other representations on this task, unimodal and multimodal alike. As more data and results come by, I'll update this project. An upcoming paper submitted to LREC will be attached here later (if it goes through, fingers crossed).