Research2024

Image Captioning System (CNN + Transformer)

End-to-end image-captioning system built around an InceptionV3 visual encoder and a custom multi-head Transformer decoder, trained on COCO. The architecture underpins the IEEE-published paper "AI Narratives: Bridging Visual Content and Linguistic Expression"; this repository lifts the original Kaggle research notebook into a typed, tested, configuration-driven Python package with Pydantic v2 configs, mypy-strict typing, 37 unit tests, and a four-stage notebook parity audit gated by SHA-256. The serving layer is a production-style FastAPI service with a lifespan-managed CaptionPredictor singleton, structured logging with per-request UUIDs, and a React 19 + Vite 8 + Tailwind v4 SPA that drives multipart uploads against POST /v1/captions with AbortController-based timeouts and a typed ApiError boundary. Reference BLEU-4 ~24 from the IEEE notebook; beam-search decoding, CIDEr / METEOR / ROUGE-L, and a stabilized COCO training run are in active iteration.

Technology stack

Python 3.10+TensorFlow 2.15InceptionV3TransformerPydantic v2FastAPIReact 19Vite 8pytestmypyCOCO

Problem statement

The IEEE paper "AI Narratives: Bridging Visual Content and Linguistic Expression" introduced a CNN + Transformer architecture for scene-aware image captioning, but the supporting code lived in a Kaggle notebook — fine for reproducing the result, useless for evolving the system. The goal here was to lift the research artifact into a typed, tested, configuration-driven Python package that can be retrained, served, and benchmarked without reconstituting state from notebook cell order — while preserving the published architecture exactly through a SHA-256-locked parity audit.

Dataset & data

COCO 2017 captions: ~120,000 sampled caption-image pairs (data.sample_size in configs/base.yaml), TextVectorization-adapted to a 15,000-token vocabulary, 80/20 train/val split. Images are resized to 299×299 for InceptionV3 ingestion; captions are capped at 40 tokens. The same preprocess_image_tensor runs in the tf.data training pipeline and at inference time, eliminating train/serve skew by construction.

Architecture & design

Pretrained InceptionV3 (ImageNet, frozen) emits 64 spatial patches × 2048 channels per image. A single-layer Transformer encoder with one attention head projects those features into the decoder embedding dim. The decoder is two layers with eight attention heads, embedding_dim=512, learned (not sinusoidal) positional embeddings — preserved verbatim from the IEEE paper. Inference goes through CaptionPredictor.from_artifacts() with a warmup() call on first boot to kill the first-request latency cliff. The FastAPI service uses a lifespan-managed singleton so every request reuses one warm model; React 19 + Vite 8 + Tailwind v4 drives multipart uploads against POST /v1/captions with AbortController timeouts (3s health, 60s caption) and a typed ApiError boundary.

Training pipeline

Configuration is YAML validated by Pydantic v2 with extra="forbid" — typos in hyperparameters become load-time errors instead of silent drift. Env vars override at any nesting depth via the CAPTIONING__ prefix and double-underscore delimiter, useful for CI smoke runs and ablations. Optimizer is Adam with masked sparse-categorical cross-entropy and masked accuracy; callbacks include EarlyStopping(patience=3). Phase 1b adds opt-in label smoothing, cosine LR schedule, warmup steps, and a dropout-free validation path in configs/train/stabilized.yaml — byte-identical to base.yaml except for those four flags, so any quality delta is attributable to them alone.

Results & performance

Reference BLEU-4 ~24 from the IEEE notebook. Beam-search decoding now lives at src/captioning/inference/beam.py and dispatches through the same predictor as greedy. CIDEr / METEOR / ROUGE-L are implemented under src/captioning/evaluation/ and emitted into a single metrics.json per run; benchmarking artifacts (metrics.json, predictions.jsonl, diagnostics.jsonl, run_meta.json) are written to results/<run_id>/ on a versioned contract so any two runs can be diffed mechanically. Caption quality from the current modular pipeline is still being stabilized on a freshly trained COCO checkpoint — the serving stack is production-ready; the bootstrap weights committed today are intentionally random and exist only to exercise lifespan + predictor + multipart upload + frontend integration end to end.