Image Captioning using Transformer Model

Authors

  • Anisha Adhikari Nepal College of Information Technology
  • Mahigya Dahal Nepal College of Information Technology
  • Rudra Nepal Nepal College of Information Technology
  • Priya Shilpakar Nepal College of Information Technology

DOI:

https://doi.org/10.65091/icicset.v2i1.14

Abstract

This paper presents a deep learning approach for
generating descriptive captions for images through the combi
nation of deep learning driven image understanding and text
generation. The rapid progress in deep learning has greatly
enhanced machines’ capacity to interpret visual information and
generate natural-language descriptions. Among these advance
ments, Transformer architectures have proven especially effective
due to their self-attention mechanisms, which enable models
to better capture global image, text relationships. Traditional
image captioning models that use CNNs for visual encoding and
RNNs for sentence generation often struggle to model long-range
dependencies and produce contextually rich descriptions.
To address these challenges, this study introduces an intelli
gent image captioning framework that utilizes InceptionV3 for
visual feature extraction and a custom-built Transformer-based
encoder decoder architecture for generating textual descriptions.
Unlike conventional systems that depend on pre-trained language
models, the decoder in this work is implemented and optimized
exclusively for image captioning. Comprehensive training and
evaluation are conducted using a large-scale benchmark dataset,
MS COCO, to validate the system’s effectiveness across diverse
image domains. This paper aims to integrate image understand
ing and linguistic reasoning by providing a practical framework
with applications in accessibility, digital asset management, and
automated content generation.

Downloads

Published

2025-12-24

How to Cite

[1]
A. Adhikari, M. Dahal, R. Nepal, and P. Shilpakar, “Image Captioning using Transformer Model”, ICICSET2025, vol. 2, no. 1, Dec. 2025.