Spoiler Alert: NLP Classification

Deep Learning for Automatic Spoiler Detection in Movie Reviews

View Interactive Notebook

Project Overview

This project develops an intelligent system to automatically detect spoilers in movie reviews using state-of-the-art natural language processing and deep learning techniques. By analyzing review text alongside movie plot information, the system can alert readers before they encounter spoilers, enhancing the user experience on movie review platforms and discussion forums. The project compares multiple deep learning architectures including LSTM, BERT, and Longformer models to achieve optimal spoiler detection performance.

Key Objectives

Dataset

Source: IMDB Spoiler Dataset - movie reviews with spoiler annotations

Total Size: 143,055 movie reviews

Training Data:

Key Features:

Key Insight: 58% of spoilers have combined review + plot text under 512 tokens, and longer reviews are significantly more likely to contain spoilers.

Methods & Techniques

Data Preprocessing Pipeline:

Deep Learning Models:

Bidirectional LSTM BERT (bert-base-uncased) Longformer (4096 tokens) Ensemble (Two-Tower BERT)

Optimization Techniques:

Results & Performance

71.5%
Longformer Accuracy
68.1%
BERT Accuracy
65.0%
LSTM Accuracy
4096
Max Tokens (Longformer)

Key Findings

  • Best Model: Longformer achieved the highest test accuracy (71.5%), demonstrating the advantage of longer context windows for spoiler detection
  • Context Length Matters: Longformer's 4096-token capacity significantly outperformed BERT's 512-token limit, as spoilers often require extended context
  • Review Length is Predictive: The most significant predictor of spoilers is review length - longer reviews are substantially more likely to contain spoilers
  • Metadata Limitations: Movie genre and ratings showed no meaningful correlation with spoiler likelihood
  • Ensemble Underperformance: The two-tower BERT ensemble (43% accuracy) suggests that better joint modeling strategies are needed rather than independent tower predictions
  • NLU is Essential: Natural language understanding of both review content and plot context is crucial for accurate spoiler detection

Technologies Used

Python PyTorch TensorFlow/Keras Hugging Face Transformers BERT Longformer NLTK spaCy Gensim (Word2Vec) PEFT (LoRA) scikit-learn pandas NumPy CUDA/GPU