Generating Sanskrit Captions for Images using Transformers and Long Short-Term Memory
DOI:
https://doi.org/10.65091/icicset.v2i1.10Abstract
This study addresses the challenge of automated image captioning for Sanskrit, a low-resource language that lacks dedicated models and datasets for vision-language tasks. We propose a encoder-decoder architecture that integrates a Vision Transformer (ViT) for visual feature extraction with a Long Short-Term Memory (LSTM) for generating syntactically coherent Sanskrit captions. We curated and utilized a dataset of 40,000 image-caption pairs, with English captions from Flickr manually translated into Sanskrit. In validation data, trained model achieved BLEU-1, BELU-2, BELU-3, BELU-4, ROUGE L scores of 0.3082, 0.1843, 0.1115, 0.0639, and 0.3472 respectively.
This work represents a significant advancement in the processing of Sanskrit language within computer vision, with applications
in multimedia retrieval for digital archives, automated content analysis of cultural heritage materials, and the development of
assistive accessibility tools.