Language Analysis

The Four Whispers

The Four Whispers

Detecting Filing Language Divergence in SEC 10-K and 10-Q Filings

A quantitative framework for identifying meaningful language shifts that may precede material business changes.

Overview

This system monitors language patterns across quarterly and annual SEC filings for companies within defined peer groups. Rather than predicting outcomes, it detects divergence — moments when a company's filing language deviates significantly from its own historical baseline or from sector peers. The methodology operates in two stages: first, detect statistical deviation; second, investigate what the deviation means. The system is an early warning instrument, not a crystal ball.

Methodology

Filing text is extracted from SEC EDGAR using structured HTML parsing (no LLM extraction). Each filing's MD&A and Risk Factors sections are processed through four analytical lenses: Loughran-McDonald Sentiment Scoring, Term Frequency Deviation (Z-Score), Cosine Similarity Drift, and Conspicuous Silence detection.

Data Pipeline

Source: SEC EDGAR (XBRL financials + filing HTML). Extraction: edgartools structured API + BeautifulSoup HTML parsing. Storage: SQLite (la_deviations, la_flags, la_term_frequencies, la_lm_scores tables). Sectors live: Managed Care, Cable-Broadband, Athletic Retail, Chinese ADR. Refresh: After each quarterly filing season.

Validation

The system was validated by examining whether flagged language deviations preceded known material events. Gate 3 hypothesis testing across managed care and cable sectors confirmed deviation flags clustered 1-2 quarters before analyst downgrades.

Limitations

Filing language is backward-looking; material events may already be priced. Small peer groups (4-6 companies) limit statistical power. 8-K filings are excluded from baseline comparison. The system cannot distinguish between genuine concern and boilerplate risk factor updates.