Spoiler Alert NLP - ML Project

Project Overview

This project develops an intelligent system to automatically detect spoilers in movie reviews using state-of-the-art natural language processing and deep learning techniques. By analyzing review text alongside movie plot information, the system can alert readers before they encounter spoilers, enhancing the user experience on movie review platforms and discussion forums. The project compares multiple deep learning architectures including LSTM, BERT, and Longformer models to achieve optimal spoiler detection performance.

            Key Objectives
            Build a binary classifier to distinguish spoiler reviews from non-spoiler reviews
Compare traditional deep learning (LSTM) with transformer-based models (BERT, Longformer)
Leverage both review content and movie plot information for context-aware detection
Handle long-form text effectively using advanced transformer architectures

        

Dataset

Source: IMDB Spoiler Dataset - movie reviews with spoiler annotations

Total Size: 143,055 movie reviews

Training Data:

LSTM model: 20,000 balanced samples (10,000 spoilers + 10,000 non-spoilers)
Transformer models: 10,000 balanced samples (5,000 spoilers + 5,000 non-spoilers)

Key Features:

review_text: User-written movie review content
plot_summary & plot_synopsis: Movie plot information
review_summary: Short summary of the review
Metadata: Movie genre, ratings, token counts
Target: is_spoiler (binary: 0 = non-spoiler, 1 = spoiler)

Key Insight: 58% of spoilers have combined review + plot text under 512 tokens, and longer reviews are significantly more likely to contain spoilers.

Methods & Techniques

Data Preprocessing Pipeline:

Text lowercasing and unicode normalization
Contraction expansion (e.g., "don't" → "do not")
URL, HTML tag, and special character removal
Lemmatization (for LSTM only)
Balanced sampling to address class imbalance

Deep Learning Models:

Bidirectional LSTM BERT (bert-base-uncased) Longformer (4096 tokens) Ensemble (Two-Tower BERT)

Optimization Techniques:

Word2Vec embeddings (100-dimensional) for LSTM
LoRA (Low-Rank Adaptation) for efficient transformer fine-tuning
Early stopping on validation loss to prevent overfitting
Gradient checkpointing for memory efficiency
70-10-20 train-validation-test split

Results & Performance

71.5%

Longformer Accuracy

68.1%

BERT Accuracy

65.0%

LSTM Accuracy

4096

Max Tokens (Longformer)

                Key Findings
                Best Model: Longformer achieved the highest test accuracy (71.5%), demonstrating the advantage of longer context windows for spoiler detection
Context Length Matters: Longformer's 4096-token capacity significantly outperformed BERT's 512-token limit, as spoilers often require extended context
Review Length is Predictive: The most significant predictor of spoilers is review length - longer reviews are substantially more likely to contain spoilers
Metadata Limitations: Movie genre and ratings showed no meaningful correlation with spoiler likelihood
Ensemble Underperformance: The two-tower BERT ensemble (43% accuracy) suggests that better joint modeling strategies are needed rather than independent tower predictions
NLU is Essential: Natural language understanding of both review content and plot context is crucial for accurate spoiler detection

            

Technologies Used

Python PyTorch TensorFlow/Keras Hugging Face Transformers BERT Longformer NLTK spaCy Gensim (Word2Vec) PEFT (LoRA) scikit-learn pandas NumPy CUDA/GPU

← Back to All Projects

Spoiler Alert: NLP Classification

Project Overview

Key Objectives

Dataset

Methods & Techniques

Results & Performance

Key Findings

Technologies Used