2023•ML Innovation•Research Project

Automated Highlights Detection

Master's thesis research using YouTube's "Most Replayed" feature for automated highlight detection in long-form videos, achieving competitive performance with transformer-based models.

Research Overview

My Master's thesis addressed a critical gap in video analysis: traditional highlight detection requires expensive manual labeling (~$15,000 for 10,000 videos) and focuses mainly on short videos (<5 minutes). I developed a novel approach using YouTube's "Most Replayed" feature as training data for long-form videos (3.5-30 minutes).

This was the first research to leverage real user engagement data from YouTube's API, providing authentic feedback about what viewers find most engaging rather than subjective human annotations.

YouTube's Most Replayed feature showing engagement intensity across video timeline

YouTube's "Most Replayed" feature - the novel data source used for training highlight detection models

Technical Implementation

Model Architecture:

Adapted Unified Multimodal Transformer (UMT) for highlight detection
Multimodal fusion: Audio (PANN) + Video (I3D with RAFT optical flow)
Cross-modal attention mechanisms for joint audio-visual learning

Dataset & Methodology:

150+ YouTube videos across sports and comedy categories
Video length: 3.5-30 minutes (much longer than typical benchmarks)
Novel labeling: Percentile-based vs Z-score strategies from engagement data

Research Questions:

Can user engagement data replace manual annotations?
Category-specific vs mixed-category model effectiveness
Impact of different continuous-to-discrete labeling strategies

UMT (Unified Multimodal Transformer) architecture adapted for highlight detection

Adapted UMT architecture with multimodal fusion of audio (PANN) and video (I3D) features

UMT Architecture Deep Dive

The Unified Multimodal Transformer (UMT) framework consists of several sophisticated components working together to process video and audio data for highlight detection:

1. Feature Extraction Layer:

Audio: PANN (Pretrained Audio Neural Networks) extracts 500x2048 audio embeddings from WAV files using Wavegram-Logmel-CNN architecture
Video: I3D (Inflated 3D ConvNet) processes RGB frames and RAFT optical flow, creating 500x1024 embeddings for each modality

2. Uni-modal Encoders:

Address the limitation of local-only context in feature extractors by using transformer self-attention mechanisms
Each video segment receives global context about the entire video through attention, crucial for determining highlight-worthiness
Output: 500x256 contextually-aware representations for each modality

3. Cross-modal Encoder:

Implements bottleneck attention mechanism for efficient multimodal fusion
Compression Stage: Bottleneck tokens capture compressed features from all modalities
Expansion Stage: Compressed information propagated back to enhance each modality
Maintains linear computational complexity while minimizing noise incorporation

4. Saliency Head:

Multi-Layer Perceptron (MLP) that makes individual predictions for each of the 500 video segments
Each prediction leverages the rich, contextually-aware representations from previous layers
Outputs probability scores indicating highlight-worthiness for each temporal segment

Continuous-to-Binary Labeling Methodology

A critical innovation in this research was developing methods to convert YouTube's continuous "Most Replayed" engagement scores (0-100) into binary highlight labels for supervised learning. I developed and compared two distinct approaches:

Method 1: Percentile-Based Labeling

This method adapts to the distribution of engagement within each video, ensuring consistent proportions of highlights across the dataset.

Formula: If engagement_score ≥ percentile_threshold, then label = 1
Thresholds tested: 97th (3%), 90th (10%), 85th (15%), 80th (20%), 75th (25%), 70th (30%)
Advantage: Consistent proportion of highlights per video, stable for model training
Limitation: May label relatively unengaging moments as highlights in low-engagement videos

Method 2: Z-Score-Based Labeling

This method identifies segments that are statistical outliers in terms of engagement, capturing moments that deviate significantly from the video's average engagement.

Formula: If engagement_score > μ + (x × σ), then label = 1
Where: μ = mean engagement, σ = standard deviation, x ∈ 2.5
Constraint: Ensured each video has at least one highlight to prevent edge cases
Advantage: Captures "true highlights" as significant deviations from average
Limitation: Variable number of highlights per video, potentially inconsistent training data

Key Finding:

Percentile-based labeling generally outperformed z-score methods, particularly for sports content. The stability of having consistent highlight proportions proved more beneficial for model learning than the theoretical advantage of identifying statistical outliers. This finding validates the importance of dataset consistency in supervised learning scenarios.

Video Segmentation and Processing Pipeline

The processing pipeline transforms raw YouTube videos into model-ready data through several technical steps:

1. Data Collection Framework:

Custom Python framework using PyTube for video download and Google API for metadata
LemnosLife API integration for "Most Replayed" data extraction
Automated filtering: English language, 3.5-30 minutes duration, "Most Replayed" feature enabled

2. Temporal Segmentation Strategy:

YouTube provides engagement data as 100 equal-length segments
Extended to 500 segments for finer granularity (1-3 second windows depending on video length)
Original labels replicated 5x to maintain alignment with engagement data
Computational constraint: 500 segments maximum due to VRAM limitations during processing

3. Feature Extraction Pipeline:

Dockerized environment ensuring reproducible feature extraction across computing platforms
Parallel processing of audio and video streams with AWS S3 storage integration
Database tracking for metadata, processing status, and file locations

Key Results & Findings

Competitive Performance: Achieved mAP@5 of 0.8513 for sports videos, outperforming TVSum benchmark (0.8314)
Novel Data Source Validated: Proved user engagement data can accurately identify highlights without manual annotation
Category Insights: Mixed-category models outperformed category-specific ones for sports, while comedy benefited from specialization
Labeling Strategy Impact: Percentile-based labeling generally outperformed Z-score methods for user engagement data
Long-form Video Capability: Successfully extended highlight detection to videos 6x longer than typical benchmarks

Model Performance Visualization

The visualizations below show how well the model's predictions (orange line) align with actual user engagement patterns from YouTube's "Most Replayed" data (blue line). Red dots indicate the model's top 5 highlight predictions.

Model predictions vs user engagement with 90th percentile threshold (10% highlights)

90th percentile threshold (10% highlights) - More precise but conservative predictions

Model predictions vs user engagement with 70th percentile threshold (30% highlights)

70th percentile threshold (30% highlights) - Broader highlight detection

Performance Comparison with Benchmarks

Comparison of our models with established benchmark datasets shows competitive performance, especially considering our use of longer videos and novel labeling approach:

Benchmark/Model	Highlights (%)	Metric	Score
QVHighlights (UMT)	37.9%	mAP	0.3985
Our Comedy Model	30%	mAP	0.3357
TVSum (UMT)	33.2%	mAP@5	0.8314
Our Sports Model (Best)	30%	mAP@5	0.8513
YouTube Highlights (UMT)	43.2%	mAP	0.7493
Our Sports Model	30%	mAP	0.3920

Note: Our models were trained on significantly longer videos (3.5-30 minutes) compared to typical benchmarks (<5 minutes) and used real user engagement data rather than manual annotations.

Category-Specific vs Mixed-Category Results

One key finding was that the optimal modeling approach varies by content type:

Model Type	Evaluation	mAP	mAP@5
Comedy-Specific	Comedy	0.3357	0.7285
Mixed-Category (Balanced)	Comedy	0.3077	0.6723
Sports-Specific	Comedy	0.3138	0.3598
Mixed-Category (Fine-tuned)	Sports	0.3920	0.8513
Sports-Specific	Sports	0.3228	0.6880
Comedy-Specific	Sports	0.3097	0.2544

Key insight: Comedy highlights benefit from category-specific models, while sports highlights perform better with mixed-category training, suggesting different underlying patterns in highlight characteristics.

Technologies Used

PythonPyTorchUMT FrameworkTransformersYouTube APIPANN AudioI3D VideoRAFT Optical FlowDockerAWS S3Computer VisionAcademic Research