Recent advancements in technology have made available a variety of image capturing devices, ranging from handheld mobiles to space-grade rovers. This has generated a tremendous visual data, which has made a necessity to organize and understand this visual data. Thus, there is a need to caption thousands of such images. This has resulted in tremendous research in computer vision and deep learning. Inspired by such recent works, we present an image caption generating system that uses convolutional neural network (CNN) for extracting the feature embedding of an image and feed that as an input to long short-term memory cells that generates a caption. We are using two pre-trained CNN models on ImageNet, VGG16 and ResNet-101. Both the models were tested and compared on Flickr8K dataset. Experimental results on this dataset demonstrate that the proposed architecture is as efficient as the state-of-the-art multi-label classification models. Experimental results on public benchmark dataset demonstrate that the proposed architecture performs as efficiently as the state-of-the-art image captioning model. © Springer Nature Singapore Pte Ltd. 2020.