← Back to Projects ML/AI Projects

Image Captioning with CNN-RNN, Attention and ViT-GPT2

Explored multiple deep learning architectures for automatic image captioning – starting from a CNN-RNN baseline, extending it with attention, and finally fine tuning a ViT-GPT2 encoder–decoder model – then compared their quality using ROUGE scores and test loss.

View project report

Quick Insights:

Introduction

This project investigates different neural architectures for image captioning, from classic CNN-RNN pipelines to transformers. The goal was to move beyond “toy examples” and build a training + evaluation loop that makes it easy to benchmark ideas under the same data, metrics and preprocessing.

We started with a strong baseline using pre trained Inception v3 as a feature extractor and a 2-layer LSTM decoder trained with teacher forcing. We then added an attention mechanism over the image features, and finally replaced the entire stack with a ViT-GPT2 encoder–decoder initialized from Hugging Face checkpoints and fine tuned using Seq2SeqTrainer.

Key Takeaways

This project provided a practical comparison between classic and transformer-based captioning systems under a controlled setup. It also reinforced the importance of clean preprocessing, consistent evaluation, and strong baselines.