Project Overview
This project develops an intelligent system to automatically detect spoilers in movie reviews using state-of-the-art natural language processing and deep learning techniques. By analyzing review text alongside movie plot information, the system can alert readers before they encounter spoilers, enhancing the user experience on movie review platforms and discussion forums. The project compares multiple deep learning architectures including LSTM, BERT, and Longformer models to achieve optimal spoiler detection performance.
Key Objectives
- Build a binary classifier to distinguish spoiler reviews from non-spoiler reviews
- Compare traditional deep learning (LSTM) with transformer-based models (BERT, Longformer)
- Leverage both review content and movie plot information for context-aware detection
- Handle long-form text effectively using advanced transformer architectures
Dataset
Source: IMDB Spoiler Dataset - movie reviews with spoiler annotations
Total Size: 143,055 movie reviews
Training Data:
- LSTM model: 20,000 balanced samples (10,000 spoilers + 10,000 non-spoilers)
- Transformer models: 10,000 balanced samples (5,000 spoilers + 5,000 non-spoilers)
Key Features:
- review_text: User-written movie review content
- plot_summary & plot_synopsis: Movie plot information
- review_summary: Short summary of the review
- Metadata: Movie genre, ratings, token counts
- Target: is_spoiler (binary: 0 = non-spoiler, 1 = spoiler)
Key Insight: 58% of spoilers have combined review + plot text under 512 tokens, and longer reviews are significantly more likely to contain spoilers.
Methods & Techniques
Data Preprocessing Pipeline:
- Text lowercasing and unicode normalization
- Contraction expansion (e.g., "don't" → "do not")
- URL, HTML tag, and special character removal
- Lemmatization (for LSTM only)
- Balanced sampling to address class imbalance
Deep Learning Models:
Bidirectional LSTM
BERT (bert-base-uncased)
Longformer (4096 tokens)
Ensemble (Two-Tower BERT)
Optimization Techniques:
- Word2Vec embeddings (100-dimensional) for LSTM
- LoRA (Low-Rank Adaptation) for efficient transformer fine-tuning
- Early stopping on validation loss to prevent overfitting
- Gradient checkpointing for memory efficiency
- 70-10-20 train-validation-test split
Results & Performance
71.5%
Longformer Accuracy
4096
Max Tokens (Longformer)
Key Findings
- Best Model: Longformer achieved the highest test accuracy (71.5%), demonstrating the advantage of longer context windows for spoiler detection
- Context Length Matters: Longformer's 4096-token capacity significantly outperformed BERT's 512-token limit, as spoilers often require extended context
- Review Length is Predictive: The most significant predictor of spoilers is review length - longer reviews are substantially more likely to contain spoilers
- Metadata Limitations: Movie genre and ratings showed no meaningful correlation with spoiler likelihood
- Ensemble Underperformance: The two-tower BERT ensemble (43% accuracy) suggests that better joint modeling strategies are needed rather than independent tower predictions
- NLU is Essential: Natural language understanding of both review content and plot context is crucial for accurate spoiler detection
Technologies Used
Python
PyTorch
TensorFlow/Keras
Hugging Face Transformers
BERT
Longformer
NLTK
spaCy
Gensim (Word2Vec)
PEFT (LoRA)
scikit-learn
pandas
NumPy
CUDA/GPU