The Four Whispers

Detecting Filing Language Divergence in SEC 10-K and 10-Q Filings

A quantitative framework for detecting shifts in how companies in a peer group describe their own businesses — measured against peers and against their own history — as an early-warning surface upstream of the numerical data.

📊 Data 📄 Methodology 📰 Article

Overview

The Language Analysis module measures divergence in the way companies in a curated peer group describe their businesses in SEC 10-K and 10-Q filings. Divergence here has a specific technical meaning: a company's usage of a term, or its sentiment on a filing section, or the structural content of a section itself, sits far enough from the peer-group distribution or from the company's own prior filings that the distance is unlikely to be noise. The pipeline is deliberately two-stage. Stage one is quantitative detection: it answers only the question of whether a filing deviates, not what the deviation means. Stage two is interpretation, performed by a human analyst with the source filing text in hand. The system is an instrument that surfaces anomalies; it is not a classifier that predicts outcomes, and every design choice below is made with that split in mind. The distinction matters because the failure mode of a predictive language model is to confidently mislabel what it detects. A sentiment classifier that returns “bearish” or “bullish” for a paragraph of 10-K risk factor text is claiming to know the direction of a future price move from a piece of audited regulatory disclosure that was written eight to twelve weeks earlier. That claim is not defensible at scale. A detection system that returns “this section changed materially against its prior baseline” makes a much smaller claim, and a true one: the change is measurable, the measurement is reproducible, and the interpretation is left where it belongs.

What the Pipeline Measures

Four distinct signals are computed for every eligible filing section. Together they form what the site calls the four whispers: four independent views of the same underlying question of whether a filing has drifted away from its established pattern. The first signal is term-frequency deviation, computed in two modes. The cross-sectional mode asks how much more or less often a given company uses a lexicon term in a filing period compared to its peers filing in the same period. The longitudinal mode asks the same question against the company's own historical baseline. Both modes produce a z-score and a percentile rank for each term-company- filing-year combination, flagging any observation at the extreme tail of its reference distribution. Terms that are consistent across the peer group but suddenly disappear from one company's filing, or terms that appear in one company's filing at many multiples of the peer mean, are the observations that populate the heatmap on the data page. The second signal is Loughran-McDonald sentiment. The LM dictionary is the standard finance-specific tonal lexicon: its categories are negative, positive, uncertain, litigious, and constraining, with a separate complexity score. Applied to 10-K narrative sections, it produces the percentage of scored words in each tonal category. Sentiment is computed per section type (Item 1A Risk Factors, Item 7 MD&A, Item 8 Financials) because the baseline tone of each section differs substantially — averaging them together would produce a meaningless composite. The interesting observation is never the absolute sentiment level; it is the drift of the sector-mean sentiment over time, and the individual filings that sit far from the cohort in any given year. The third signal is cosine-similarity drift. Each filing section is represented as a term-frequency vector over the sector lexicon, and the cosine similarity between a company's current-year section and its previous-year section is computed. A similarity below the configured threshold (default 0.80, tunable per sector) is flagged. This detects structural rewriting: a Risk Factors section that has been substantively rewritten against the prior year's template is a different object from one that copies the prior year with a few numerical updates, and the difference is visible in the vector distance. The fourth signal is conspicuous silence. A term that appeared consistently in a company's prior four or more filings but is absent from the current filing produces a term-disappearance flag. The converse — a term that peers are changing together while one company holds its own language fixed — produces a stability flag. Both cases carry information. The term that went missing is often the sharpest anomaly in the dataset, because management had previously judged the topic important enough to discuss and has now chosen to stop.

How the Signals Are Computed

Every signal above reduces to arithmetic on tokenized section text against a sector-specific lexicon. The lexicon for a sector is a curated list of terms — single words, bigrams, and trigrams — that matter for how companies in that sector describe their economics, their regulatory posture, and their risk factors. The lexicon for managed care contains terms that would not carry the same weight in cable-broadband filings or athletic-retail filings, and vice versa. The per-sector lexicon is what makes term-frequency deviation meaningful: noise terms that appear uniformly across every 10-K are not included, and terms that carry genuine sector-specific economic content are. For each eligible filing section, the pipeline tokenizes the raw text, counts occurrences of each lexicon term (single-word terms by token match, multi-word terms by non-overlapping phrase match in lowercase raw text), and normalizes by section word count to produce a normalized frequency. The normalized frequency is what feeds every downstream computation. Raw counts would be confounded by section length: a Risk Factors section that grew from 12,000 words to 18,000 words has more raw occurrences of every term by construction, and the interesting question is whether the rate changed, not whether the count did. Cross-sectional scoring proceeds as follows. For each term, for each year, for each company in the peer group, the pipeline computes the mean and standard deviation of the term's normalized frequency across the other peer companies filing in the same year, then scores the target company as a z-score against that peer distribution. Peers with no filing that year, or peers below the minimum word-count floor for valid extraction, are excluded from the reference distribution for that term-year. A percentile rank is computed alongside the z-score, and an observation-count check (default minimum of eight peers; configurable per sector) determines whether the score is flagged as data-sufficient. The floor exists because small peer groups produce unstable z-scores: a three-peer cohort can produce a spurious extreme z on noise alone. Longitudinal scoring uses the same normalized frequency as input but computes the reference distribution from the company's own historical filings. A minimum of four prior observations is required before a longitudinal z-score is computed for a given term-company pair. This is lower than the cross-sectional minimum for the simple reason that most US public companies do not have thirty years of 10-K history to draw from, and requiring a very long baseline would eliminate every company whose filings post-date a spinoff, IPO, or restructuring. The four-observation floor is enough to estimate a baseline with moderate stability while remaining tractable for companies with shorter filing histories. Loughran-McDonald sentiment is applied at the section level. The pipeline loads the LM dictionary once per run, scores each active narrative section against the dictionary, and stores the per-category percentages alongside total word count and scored word count. Scored word count is a transparency input: it reports how many of the total words in the section were matched to any LM category, which in turn indicates whether the LM scoring is covering enough of the section to be meaningful. Very low scored-word counts relative to total words suggest either an unusually technical section or a extraction failure, both of which are worth flagging. Cosine similarity is computed between the current filing's section vector and the prior filing's section vector for the same company and section type. The vectors are the per-term normalized frequencies over the sector lexicon. Two sections that use the same lexicon terms at similar relative rates produce a similarity near 1.0; a section that has been materially rewritten produces a lower number. Each sector has its own configured drop threshold because rewriting rates differ across industries — a Risk Factors section in a regulated utility changes less year to year than in a consumer-technology company, and the threshold reflects what constitutes a meaningful change within the peer group. Conspicuous silence and stability detection run as separate passes over the per-term frequency history. For each lexicon term and each company, the pipeline examines the time series of normalized frequencies across filings. A term that appears in four or more prior filings with consistent frequency and then drops to zero in the current filing produces a term-disappearance flag. A term whose peer-group mean frequency has shifted materially while the target company's frequency has remained flat produces a stability flag. Both signals take the company's own prior behavior as the reference, and both require multiple prior observations before a flag is issued.

Signal Strength and Filing-Level Aggregation

Individual term-level and section-level flags aggregate up to a filing-level signal strength tier. The three tiers visible on the filing timeline — high, medium, and low — are derived from the configured thresholds in the per-sector config. The managed-care config uses a quiet floor of up to five flags per filing, a notable band up to thirty flags, and elevated band up to eighty flags, with a loud tier above that. Cosine-similarity and section-length changes carry independent threshold gates. A filing with sixty flags and a cosine drop below 0.75 reaches the loud tier regardless of the flag count; a filing with thirty flags and an ordinary cosine profile reads as notable. The aggregation is deliberately coarse. The finer-grained per-term, per-section observations remain available for inspection on the data page, but the timeline view compresses them into a single tier label so that a reader can scan a multi-year sector history and locate the filings worth reading in full. The compression loses information. The purpose is to make the long tail of ordinary filings visually recede so that the anomalies are easy to find.

Data Sources and Pipeline Inputs

Filing text is extracted from SEC EDGAR. No large language model participates in the extraction pipeline. The text of every filing section is pulled via structured HTML parsing — edgartools for the section-aware retrieval where available, with BeautifulSoup as a fallback for filings where the structured accessors fail — and normalized into the la_sections table. Every observation in the downstream database traces back to a named accession number, a named section type (Item 1A, Item 7, Item 8), and an explicit word count. Sections below a minimum word-count floor of 500 words are treated as truncated extractions and excluded from analysis, because deviation scoring on a 200-word extraction produces false positives that swamp the signal. The peer group for each sector is defined in a sectors/ <name>/peers.yaml file. The file names every ticker in the peer group along with a role classifier (target, peer, or benchmark) and free-text notes documenting any caveat that affects interpretation — a company whose filings are dominated by a non-core business line, for instance, will be flagged as a potential outlier in language analysis even while remaining in the corpus. The four sectors currently enabled for language analysis are managed care, cable-broadband, athletic retail, and Chinese ADRs. Each has its own peer list and its own configuration of thresholds and lexicon. The sector is the unit of analysis; results do not port across sectors without recalibration. Per-sector configuration files define the thresholds that govern flag generation and signal-strength classification. The managed-care config uses a minimum peer count of eight for cross-sectional scoring, a longitudinal z-threshold of 2.0, a minimum of four longitudinal observations, a cosine-similarity drop threshold of 0.80, and the signal-strength tier cutoffs noted above. These thresholds are not universal. They were calibrated against the filing history of the managed-care peer group to produce a flag distribution where the notable and elevated tiers contain the filings a human reader would agree merit attention. Other sectors carry different thresholds and would require the same calibration before their outputs could be trusted. Refresh happens after each quarterly filing season. The pipeline picks up new 10-Qs and 10-Ks from EDGAR, runs the full extraction and scoring chain end to end, and updates the database. Older filings are not re-scored unless the lexicon or thresholds change, in which case the run is versioned and the historical database entries are retained for audit.

Validation — Signal vs Noise

The most honest claim the system can make about its own output is that it detects measurable deviations. The harder claim — that these deviations matter for investment outcomes — is the one that validation has to address, and the state of the evidence is limited and appropriately humble. Internal validation so far has focused on three things. The first is coverage: for each enabled sector, every company in the peer list has complete section coverage over the filing window, every scored section clears the minimum word-count floor, and every flagged observation points to a reproducible term-filing-section tuple that a human reader can look up. Coverage is a precondition for any downstream claim and is audited by the check_completeness script against a declarative spec. The second is threshold calibration: the sector thresholds are tuned so that the top tier of the signal-strength distribution contains filings whose language is observably anomalous on manual review, not merely statistically tail-heavy. A threshold that flags fifty percent of filings as high-strength is uninformative by construction; the current managed-care thresholds produce a loud/elevated tier that contains a small fraction of the total corpus. The third is cross-signal agreement: a filing that registers anomalies on three of four signals is a stronger candidate for reading than one that registers on only one. The site's flag count is a proxy for this, and the strongest flagged filings tend to show coincident anomalies across term-frequency, sentiment, and cosine-drift channels. Beyond the internal coverage and calibration checks, the pipeline has been subjected to four independent rigor tests plus an LLM re-grading pass: look-ahead-bias correction (T+1 event-time), Bonferroni multiple-hypothesis-testing correction across 21 tests (7 flag types × 3 horizons), earnings-announcement confound adjustment, and a vocabulary-swap placebo control. Of seven flag classes, only term_disappearance — the flag that fires when a company stops using a sector-defining term its peers still use — survives all four tests. It produces a 58.88% hit rate with +5.72% mean excess return at 60 days (n=2,724) and beats a vocabulary-swap placebo by 4.09 percentage points. The other six classes fail at least one test; two invert from predictive-positive to reliably-negative after the look-ahead correction. A single-grader LLM review of 38 high-severity flags produced a 76.3% non-noise rate and identified term_disappearance as 100% material (5 of 5 in the stratified sample). The published paper reports the full per-test numbers and names the look-ahead correction as the central methodological finding. The system in its current form is useful as a reading prioritization tool — it surfaces the filings most likely to reward close reading — and vocabulary-specific claims are scoped to the one flag class that survived the rigor gauntlet. Everything else is flag-generation infrastructure, valuable for monitoring, not for statistical alpha claims.

Known Limitations

The filing is not the event. Quarterly and annual filings are lagged summaries of what management had already observed weeks to months earlier. By the time a Risk Factors section acknowledges a deterioration, the market has often already repriced. Language divergence is an early-warning signal relative to the analyst community, not relative to the tape. A reader looking for a signal that leads price has the wrong instrument. Small peer groups limit statistical power. The managed-care group contains nine tickers, and the minimum peer count of eight means that cross-sectional scoring on a given term-year is effectively using the entire remaining group as the reference distribution. Smaller sectors fall below that floor on any year with a missing filing and produce data-sufficient=false flags instead of z-scores. The sectors currently enabled are those where the peer count is large enough to support the methodology. Filing structure is a noisy input. The pipeline excludes 8-K filings from the baseline comparison because their content varies too much with event type to form a stable reference distribution. A material agreement 8-K, an officer-departure 8-K, and an earnings-release 8-K have effectively no common structure, and pooling them corrupts the baseline. 10-Qs and 10-Ks are the primary inputs; 20-F filings (for foreign private issuers like the Chinese ADR sector) use a different section taxonomy passed explicitly to the scoring functions. The lexicon is the bottleneck. A sector's signal quality is bounded by how well its lexicon captures the language that matters for that sector. A lexicon that omits a key regulatory term will not surface the divergence when management stops using that term; a lexicon padded with generic terms will produce noise. Each sector's lexicon is curated and calibrated against its peer group, and the calibration is part of what enabling a new sector requires. The system cannot distinguish between genuine concern and boilerplate updates. A Risk Factors section that adds three new paragraphs about cybersecurity in 2024 could reflect actual new exposure to a specific threat, or it could reflect industry-wide SEC guidance that required every filer to update cybersecurity disclosure language. The divergence is measurable either way; the interpretation requires reading the filing. This is the split between detection and interpretation that the methodology builds in deliberately. An analyst reading a flagged filing is expected to perform the interpretation step, and the site surfaces the flag, the term, the prior filing, and the current filing side by side to support that work. The numbers on this page describe the methodology as implemented in the managed-care sector, which is the most-calibrated sector currently live. The thresholds, signal-strength tiers, and minimum peer counts for the other enabled sectors are set in their own configuration files and may differ. Any claim about the output of those sectors should be read against their specific configurations and not against the managed-care anchors reported here. A look-ahead bias was caught during the paper's own preparation and is disclosed as the central methodological finding. The original backtest overlay anchored T0 at the filing date's same-day close. Because 10-K and 10-Q filings routinely post after market close, the pre-filing reference price already reflected the market's first reaction. Correcting to the next business day's close collapsed the pre-correction longitudinal_spike90-day hit rate from 65.11% to 42.20% and inverted its mean excess return from +4.37% to -5.60%. The paper treats the gap between the pre- and post-correction numbers as the paper's headline finding rather than burying it in limitations. About one in four high-severity flags is pipeline noise per the LLM re-grading. The three noise sources are generic English words leaking through sector stoplists, section_length_change flags that fire on prior-year extraction-truncation artifacts, and universal-event flags (COVID in 2020 filings, ASC 842 in 2019 retailer filings) that fire identically across every peer. Remediations are tracked as v2 work. Multiple-hypothesis-testing correction is conservative but scope-limited. Bonferroni is applied to the 21 flag-type × horizon tests. Additional implicit tests across per-ticker or per-term dimensions are not corrected; a tighter correction would require pre-registration. The earnings-confound adjustment uses estimated earnings dates (10-K − 21 business days, 10-Q − 28 BD, 20-F − 45 BD) and crude window subtraction rather than live dates and event-study regression; the live-date upgrade is deferred. The vocabulary-swap placebo was run on managed-care only, and LLM grading is single-grader — inter-rater reliability is not established. None of these caveats flip the surviving term_disappearance finding; they sharpen the scope of what the system claims.