Part of the renewed excitement about research in image and video captioning, which is a combination of Computer Vision, Natural Language Processing, and Knowledge Representation and Reasoning stems from a belief that it is a drastic step towards solving AI. The problems at the intersection of Computer Vision and the processing of natural language are of significant importance to challenging research questions and for the rich set of applications they enable. Given an input image and an open-ended question in natural language, it needs an understanding of visual elements contained in the image and common-sense knowledge to provide a useful response. In this review, we examine multiple approaches to this problem - where images and questions are mapped to a common feature space - along with the datasets available to train and evaluate these systems. We also discuss future scope and promising directions in which this field could be headed towards