An open empty notebook on a white desk next to an iPhone and a MacBook

📊 Data Analysis Case Study: TikTok Claim vs Opinion Video Classification.

TikTok’s mission is to inspire creativity and bring joy. With millions of user-generated videos, the platform receives a high volume of user reports flagging content as “claims” (factual assertions) or “opinions.” Moderators face a growing backlog, slowing response times and affecting user trust.

2/8/20263 min read

Exploratory data analysis and visualization to support automated content moderation.

Business Context / Overview

Background:
TikTok’s mission is to inspire creativity and bring joy. With millions of user-generated videos, the platform receives a high volume of user reports flagging content as “claims” (factual assertions) or “opinions.” Moderators face a growing backlog, slowing response times and affecting user trust.

Why this analysis was required:
To build a predictive model that automatically classifies videos, reducing manual review time and improving moderation efficiency.

Who benefits:

  • TikTok’s Content Moderation Team

  • Data Science and Operations Teams

  • End-users through faster, more accurate content handling

Problem Statement

TikTok’s manual review process for user-reported videos is inefficient and unsustainable due to volume. Without a clear, data-driven understanding of what distinguishes a “claim” from an “opinion,” the team cannot build an accurate classification model, leading to delayed moderation, potential misinformation spread, and user dissatisfaction.

Objectives

  • Perform comprehensive EDA to uncover patterns in video metrics between claims and opinions.

  • Identify key distinguishing features (e.g., view count, author status, engagement metrics) that could predict video type.

  • Deliver clear, accessible visualizations for both technical and non-technical stakeholders.

  • Prepare clean, analysis-ready data for future machine learning modeling.

Dataset Description

Source: Internal TikTok dataset (tiktok_dataset.csv)
Records: 19,382 videos
Features: 12 columns including:

  • claim_status (claim/opinion)

  • video_view_count, video_like_count, video_share_count

  • author_ban_status, verified_status

  • video_duration_sec, video_transcription_text
    Time Period: Not specified (static snapshot)

Data Preparation & Cleaning

  • Handled missing values in claim_status and engagement metrics.

  • Verified and corrected data types for numerical vs. categorical fields.

  • Assessed outliers using IQR and median-based thresholds.

  • Standardized column names and ensured consistent formatting.

Exploratory Data Analysis (EDA)

  • Distribution analysis of video duration, views, likes, shares, downloads, and comments.

  • Comparison of claims vs. opinions across verified status and ban status.

  • Outlier detection for engagement metrics using non-parametric methods.

  • Correlation exploration between view count, likes, and claim status.

Analytical Approach

  • Statistical summaries using .describe() and .groupby() methods.

  • Visual distribution checks via boxplots and histograms.

  • Segmentation analysis by author ban status and verification status.

  • Threshold-based outlier identification using median + 1.5 * IQR.

Key Insights

  • Claim videos receive significantly higher median views (~501k) compared to opinion videos (~4,953).

  • Verified users are more likely to post opinions, while unverified users post more claims.

  • Authors under review or banned post more claim videos and receive disproportionately high view counts.

  • Engagement metrics (likes, shares, downloads) are highly right-skewed, with a small percentage of videos driving most engagement.

Visualizations & Reporting

  • Python (Matplotlib/Seaborn):

    • Boxplots & histograms for all key metrics

    • Claim vs. opinion bar charts by author status

    • Scatter plots of views vs. likes colored by claim status

Video Type Distribution:

What percentage of videos are claims vs. opinions?

Engagement Metrics Analysis:

View Count Distribution by Video Type

Author Status Breakdown:

Correlation And Relationship Analysis :

Correlation Heatmap

Views vs Likes Scatter with Trendline:

Outlier Analysis:

Outlier Detection Across Metrics

Summary Dashboard:

Executive Summary Visualization

Accessibility-Friendly Version:

For stakeholders with visual impairments

Tools & Tech Stack

  • Language: Python

  • Libraries: Pandas, NumPy, Matplotlib, Seaborn

  • Environment: Jupyter Notebook

  • Visualization Tool: Tableau Public

  • Data Storage: CSV

Challenges & Limitations

  • Highly skewed engagement data required non-standard outlier treatment.

  • Missing values in key columns reduced usable sample size.

  • Text data (video_transcription_text) was not analyzed in this phase.

  • Domain knowledge required to interpret “ban status” and “verified status” accurately.

Results & Business Impact

  • Clear feature understanding now guides model feature selection.

  • Stakeholder alignment achieved through accessible Tableau dashboards.

  • Moderation team can now prioritize videos based on risk indicators (e.g., high-view claims from unverified authors).

  • Data pipeline readiness for building a classification model.

Recommendations

  • Build a binary classification model using view_count, author_ban_status, and verified_status as top features.

  • Implement automated outlier flagging for viral content to prevent model bias.

  • Expand analysis to text data (NLP) for transcription-based classification.

  • Create real-time dashboards for ongoing monitoring of claim vs. opinion trends.

Key Learnings

  • Business context is critical—understanding TikTok’s moderation needs shaped the analysis direction.

  • Right-skewed social media data requires tailored statistical approaches.

  • Visualization accessibility ensures insights are actionable for all stakeholders.

  • Clean, documented EDA accelerates downstream modeling and decision-making.

Project Status

Status: âś… Completed