Headlines Analytics Console
Part I—How this works

What this console is

A private dashboard at data-headlines.pierce.tools that turns published-headline performance into editorial decisions.

Every article McClatchy publishes lands in the Tracker (the content team lead's master sheet). Snowflake enriches that with platform-level page-view data—Apple News, SmartNews, Yahoo, Newsbreak, search, social, subscriber, newsletter. Tarrow adds platform-syndication detail. The console reads from that joined surface and asks one question per tile: which combinations of pub, topic, formula, vertical, and writer beat their baseline by enough to act on?

It's the analytics counterpart to the Keywords Decisioning Console: keywords decides what to write next; headlines learns from what was written.

Who uses each surface

RolePrimary surfaceWhat they do
Content strategistFindings + GraderDaily headline grade + weekly tile review
Content team leadFindings + AuthorsPer-author percentile review, formula coaching
Anyone post-publishGrader pageReview per-headline scorecards from today's automated run
New team memberDocsRead this; then ask questions.

Concepts 101 · zero-knowledge primer

The console reuses a handful of recurring terms. If anything below is unfamiliar, start here.

Page-views (PVs / total_pvs)

The total times an article was loaded across every distribution surface in the period (Apple News + SmartNews + Yahoo + Newsbreak + search + social + direct + newsletter + AMP + subscriber). Sourced from STORY_TRAFFIC_MAIN in Snowflake.

Lift

A multiplier comparing one group's median PVs to a baseline. +50% means the group's median article hit 1.5× the baseline. Negative = below baseline. The baseline depends on context: usually the publication's overall median, sometimes the vertical median, sometimes the untagged-formula baseline.

Median (not mean)

The console reports medians everywhere because page-view distributions are heavy-tailed—one viral hit can shift a mean by 5×. Medians are stable; viral outliers don't distort the signal.

Headline formula

A regex-classified template the headline matches. Priority-ordered lead-position patterns (number_lead, heres, quote_lead, name_lead) then structural cues (did_you_miss, what_to_know, explainer, definitional, how_to, question, dash_directive, superlative_list, comparative, negation_lead, curiosity_gap) and a length-based structural fallback (short_punchy, descriptive_standard, descriptive_long) so every headline buckets. See the full taxonomy in the glossary.

Vertical

One of three named TH-channel verticals: Mind/Body, Everyday Living, Experiences. Articles published outside these channels (search/discover work) aren't bucketed.

Content type

Either News & Regional (T1 newspapers—Miami Herald, KC Star, Charlotte Observer, etc.) or L&E (US Weekly, Woman's World, and Life & Style). The active lens at the top of Findings restricts every tile to one or the other. (InTouch and Closer are retired L&E brands the console recognizes only for News-vs-Entertainment tag coloring—they aren't in the active Findings cohort.)

Verdict

Each tile assigns each row a verdict: go (real lift, strong evidence), test (directional, worth piloting), or skip (below baseline or thinly evidenced). Color-coded green / amber / coral on the tile rail and pills.

P-value

The probability the observed lift could have arisen by chance from the baseline distribution. p<0.05 = real signal; p<0.10 = directional; otherwise no detectable difference. Computed via Mann-Whitney U-test (non-parametric—no assumption of normal distributions).

The lens filter (News / L&E / All)

Three pills at the top of Findings. Active pill governs the cohort (a cohort = a group of publications sharing a content type) + primary metrics across every tile below.

LensCohortPrimary metric framing
All publicationsFull active cohort (count is computed at runtime + shown next to the pill)Aggregate; useful for portfolio audits
News & RegionalT1 newspapers—Miami Herald, KC Star, Charlotte Observer, Fort Worth Star-Telegram, Sacramento Bee, Raleigh News & Observer, Centre Daily Times, Idaho Statesman, plus the rest of the McClatchy T1 portfolioCTR + Search Ranking. Editorial truth: straight headlines outperform punny ones on Google News/Discover.
Lifestyle & EntertainmentUS Weekly, Woman's World, and Life & Style (InTouch + Closer are retired brands, recognized only for tag coloring)Social Shares + Scroll Depth. Editorial truth: track the curiosity gap—questions kept open beat answers spoiled.

Findings: the 7 tiles

The home page (/findings/) renders seven analytical tiles. Each tile is a self-contained question + answer; click anywhere on a tile to open its drawer with the underlying data.

  1. Per-publication: top-performing topics—which topics each pub hits hardest.
  2. Per-publication: top-performing formulas—which headline patterns earn lift at each pub.
  3. Per-vertical formula × site—three TH-channel rails (M/B, EL, Exp) with cluster batting average.
  4. Trends Over Time—per-pub PV trajectory by quarter.
  5. Formula × Topic × Publication—three-way interaction including weather as a topic dim.
  6. Bottom-Performer Patterns—confirmed anti-patterns to avoid.
  7. Team Performance—per-author percentile vs publication median.

Above the tiles sit the lens pills (News / L&E / All) and a filter-chip row that narrows every tile at once to one content lane, publication type, or headline formula. Several tiles (and the home page) also carry a recent top-performers panel you can rank by traffic or by reader-engagement.

Tile · Per-publication topics

For each publication, the topics that exceed the pub's median PVs by the largest margin. Lift = (topic median − pub median) ÷ pub median. n = articles in the topic at that pub during the lookback window.

Use: assignment prioritization. If Politics at Miami Herald is +34% with n=120 and p<0.05, that's a durable signal—the next politics piece probably outperforms a same-pub baseline article.

Caveat: topic mix shifts seasonally. Lift numbers in the per-pub formula + vertical-cube tiles default to recency-weighted (last 12 months, decayed by ~0.95^weeks-since-publish ≈ 3-month half-life) so "what's working now" beats "what worked over the year."

Tile · Per-publication formulas

Same shape as Per-publication topics, but classifying by headline formula instead of topic. The baseline is the pub's overall median PVs across all headlines; lift = (formula median − pub median) ÷ pub median. Every headline buckets into one of the formulas in the taxonomy (the length-based structural fallback catches anything without a signature pattern), so there's no untagged baseline.

News editorial truth: direct-declarative + quote-lead likely outperform wordplay on local news in Google News/Discover. The data confirms or revises that prior per publication.

L&E editorial truth: curiosity-gap likely beats spoiled-answer. Confirmed on UsW with +42% vs −35% (n=19 / n=12, p<0.05).

Tile · Per-vertical formula × site cube

Three rails—Mind/Body, Everyday Living, Experiences—each showing the formula × site combinations with the largest lift vs vertical median.

The cluster batting average in the meta line answers: of all CSA-touched articles (those with a cluster_id), what fraction sit in a cluster with a positive cluster_vs_co_median? It's article-level, so larger clusters count more (more articles = more votes). Reported as "1 in N"—lower N = better.

Use: formula assignment at the vertical level, or evaluating whether a creator's cluster strategy is paying off.

Tile · Trends Over Time

Per-publication median PVs by quarter. Lift trend = (latest quarter median − first quarter median) ÷ first quarter median.

Engagement trajectory (scroll depth, time-on-page, engaged session rate, cluster-continuation) is live via Wing B substrate + Marfeel ETL recurring (2026-05-18) and primary on the engagement-rank lens; PV trend remains as the reference traffic-only read.

Tile · Sweet-spot finder (pub × topic × formula)

A topic might look flat at a pub overall—but a specific formula can unlock it. This tile finds those three-way sweet spots so the per-tile-1 and per-tile-2 single-axis views don't bury them.

What it shows

Every (publication, topic, formula) combination with n ≥ 3 articles, scored by lift vs that publication's median PVs. Sorted by lift descending. Verdict overlays the same go / test / skip rule as everywhere else. The cube is the most fine-grained tile in the console — at small corpus sizes it surfaces only the densest 1-3 cells; as the article volume grows it fills out.

How to use it

  • Look for combinations marked go with high n. Those are the assignable sweet spots—pub + topic + formula triples that beat the baseline reliably.
  • If a topic shows flat or negative on the per-pub topics tile, scan here before writing it off—the right formula may rescue it.
  • Use the highlighted rows to directly compare editorial-principle pairs (e.g. curiosity-gap vs spoiled-answer at UsW).

Weather signal sidecar

The right pane shows the 30-day weather lift on T1 newspapers (median PVs of weather-classified articles vs the T1 baseline) plus the count. Weather is included in the main table as a topic dimension; the sidecar separates it so seasonality is visible at a glance.

n ≥ 3 floor: combinations with fewer than 3 articles are excluded. Mann-Whitney + lift CIs require n ≥ 5 and ≥ 10 respectively, so cells at n=3-4 surface a lift without a stat-test verdict — the confidence pill renders "low" until cohort builds. If the table is sparse on a given week, the next weekly cron may surface more.

Tile · Bottom-Performer Patterns

Anti-patterns: combinations whose median PVs sit ≤0.5× their pub median, with n ≥ 10.

Currently three confirmed patterns: local-team sports without national hook (T1s), spoiled-answer headlines (L&E), punny/wordplay on local news (T1s).

Use: what to avoid. Full guidance with examples lives in the bottom-performer drawer of each per-publication tile.

Tile · Team Performance

Per-author percentile vs publication median. Author shown as initials (first + last name leading letters; collisions disambiguated with a digit suffix). Sorted by pub-percentile descending.

A writer at 70th pctile means their median article ranks at the 70th percentile of articles published at their primary pub—read it like a class rank: their typical article beats 70% of what their pub published.

Two extra percentile readings sit beside the headline one so writers are compared like-for-like: vertical-peer percentile (rank against only other writers in the same content lane—fairer than ranking a recipes writer against a politics writer) and tier-peer percentile (rank against writers at comparable-size publications). The small ± figure on each is the wiggle room from sample size (a bootstrap confidence range).

Anonymization: the live Tracker sheet uses the canonical AUTHOR field; the live site shows the real name. The committed JSON stores initials only, which satisfies the no-proper-names-in-repo rule.

Recent top-performers panel (+ engagement rank)

A side panel that rides several tiles (and the home page) listing the standout published headlines from the last 7 days. New since the 5/04 docs cut: you choose what to rank them by.

An axis toggle above the list—four buttons: PVs, Engaged %, Time on page, Cluster cont.—lets you choose the ranking metric, and your choice is remembered on your next visit:

Rank byWhat it meansNormalized?
PVsPage-view traffic.Ranked by how far the article beat its own publication's normal level, so big and small papers compare fairly.
Engaged %The share of visits where readers actually engaged rather than bounced.Same—vs the publication's normal engaged rate.
Time on pageHow long readers stayed (median).Same—vs the publication's normal dwell.
Cluster cont.The share of readers who clicked on to another linked article in the same group.Absolute (the value is already a normalized rate): 30%+ reads as strong, 15–30% middling, under 15% weak.

An article missing the chosen measure drops off that list (e.g. a piece with no engagement coverage won't appear when you rank by Engaged %). When you rank by PVs, each row also carries a small decile badge (like D8) showing which tenth of its content lane the article lands in by lift (how far it beat the publication median).

Engaged %, time-on-page, and cluster-continuation come from the Wing-B engagement substrate + the recurring Marfeel ETL (live + longitudinally reliable from 2026-05-18 forward). Where an article predates that coverage or wasn't tracked, those axes simply skip it rather than show a zero.

Reading the numbers (lift · confidence range · verdict dot · decile)

Four small affordances appear throughout Findings. None require statistics background—here's how to read each at a glance. Full definitions are in the glossary.

  • Lift — how far above or below normal a group performed vs a typical article in the same publication or lane. +50% = half again as much (1.5×); −20% = a fifth less; 0% = exactly normal.
  • Confidence range — the bracket after a lift number, like [+22, +47], is the likely range of the true effect: about 95% confident the real lift sits between those two numbers. Narrower = more reliable; wider = more uncertainty. It only shows when there are enough articles to compute it (n ≥ 10).
  • Verdict dot — the colored dot in each tile's header is that tile's overall call at a glance: green = trusted and actionable (go), amber = worth testing (test), coral/red = skip, or too uncertain to act on yet.
  • Decile badge — on top-performer rows ranked by traffic, a badge like D8 says which tenth of its content lane the article sits in by lift (how far it beat the publication median); D10 is the very top tenth (the badge is highlighted when an article reaches it).

Each of these has a (?) help bubble on the live page with the same plain-language definition. The docs and the tooltips are kept in lock-step.

Anatomy: tile + drawer

Every tile has the same skeleton:

  • Rail. Vertical accent strip on the left edge. Color encodes verdict—green = go, amber = test, coral = skip.
  • Title row (t1). Tile name + (?) help bubble + a one-line meta summary (n records · key signal). Always visible.
  • Chevron (▸ / ▾). Right edge of the title row. Click anywhere on the title row to expand or collapse the drawer.
  • Drawer (t2). Hidden until expanded. Two-pane: main column with description + data table; side column with recent top-performers (or per-tile signal—formula hit rates, vertical totals + cluster batting average, weather signal).

Default state: all 7 tiles land collapsed. Click to open the drawer for the tile you want to dig into. Multiple drawers can stay open at once.

Grader page + today's grades

Two grader surfaces feed off the same daily run:

  • Today's published grades on the home page—4 KPIs (headlines graded · avg score · range · top issue) reflecting the most recent grader run.
  • Grader page—full per-headline scorecard for every headline in the lookback window, sorted worst-first. Each headline card expands to show the full rule-by-rule breakdown plus per-vertical tips. The page also carries a 30-day history strip, a per-outlet pass-rate matrix (every rule × every publication), and a criterion-correlations panel (which rules tend to be missed together).

The grader runs daily at 10:17 CDT (15:17 UTC) against the Tracker. It scores each headline against the rule set (a mix of fixed computer checks and a Groq LLM for the four judgment-call rules—active voice, lead burial, curiosity, accuracy), writes the results to docs/grader/index.html + docs/data/grader-signals.json + docs/grader/history.json, and commits. The score is the share of rule-weight a headline earned (see Grader tiers + the score).

Manual trigger (operators only): gh workflow run headline-grader.yml. The in-page Run-now button was removed 2026-05-04—a hardcoded passcode + localStorage PAT isn't a defensible auth surface for a public-readable site.

Grader tiers + the score

Every rule belongs to a tier, and tiers carry different weight. A headline's score is the share of applicable rule-weight it earned—not a simple count of passes.

The score, in one sentence

Add up the weights of the rules a headline passed, divide by the weights of all rules that applied to it, multiply by 100. Rules that don't apply (the platform-title rules, or a keyword rule when no keyword was supplied) are left out of the denominator entirely—they neither help nor hurt. So two headlines can be scored against different rule sets and still compare fairly.

The four tiers + their weights

TierShare of scoreWhat it covers
Structure & Length27%The headline's shape: its length, how it opens, whether it reads as plain active English, and whether the news is up front (not buried).
Formula & Signal27%Using a proven headline pattern, including the target search keyword, avoiding "what to know" filler, and—when the story has one—surfacing the expert/study signal.
Quality Flags36%The "don't do this" rules plus two judgment calls: no "did you miss," no question/all-caps/banned punctuation, and yes to curiosity and accuracy. Carries the most weight of any tier.
Platform-Specific9%The custom titles sent to Apple News and SmartNews. The two length rules (an_title_chars / sn_title_chars) are scored (weight 1), but their Tarrow source isn't joined to the grader yet, so they almost always read "?" (N/A) and drop out of the denominator in practice. Only apple_heres is the weight-0 informational rule here.

The Grader page shows these four as a KPI strip with each tier's average pass rate and its weight (e.g. "27% wt"). Within a headline card, a mini stacked bar shows where that headline's points came from by tier (blue = Structure, purple = Formula, green = Quality), with a dark tail for points it didn't earn.

Weights are defined once in generate_grader.py:CRITERIA and roll up automatically; the percentages above are the live tier sums (Structure 6 / Formula 6 / Quality 8 / Platform 2 = 22 total weight → 27 / 27 / 36 / 9%). See the full 19-criteria table for the per-rule weights.

Format-aware grader rules

Some article formats lock the H1 to a pattern that universal rules would penalize. The grader detects the article's Format column and overrides the universal criteria where they conflict with the format spec.

Synthesis rule

Format spec wins where it conflicts with universal rules (since the format spec is the locked editorial standard). Universal rules apply where the format is silent.

Format → required H1 + exempt criteria

FormatRequired H1Exempts (would otherwise penalize)
Everything to Know[Subject]: Everything You Need to Knowno_vague_wtk
What to Know Next[Subject]: What's Happening, Why, and What Could Be Nextno_vague_wtk
Discover ExplainerWhat Is [Topic]? / Who Is [Person]?no_questions
RecipeMust contain "Recipe"
Timeline[Subject]: A Complete Timeline / Breakdownno_articles (leading "A")
Recap[Show] Recap: [N] Biggest Moments From [Episode]
Obituary[Name] Dead: [Descriptor] Was [Age]
Follow-Up Content[Name] Dead at [Age]
Interview[Name] on [Topic]: '[Quote]' (EXCLUSIVE)
Fan Theory[Show] Fan Theory About [X] Will Blow Your Mind
Fan QuestionBiggest Questions About [Show] Answered

Source of truth: csa-content-standards/docs/<format>.md. Each format spec carries the canonical pattern; the grader's _FORMAT_RULES table is a synthesis of those.

What happens if the Format column is blank

The universal rule set applies (the same 19 criteria as everywhere else). No format-specific override fires. The card's meta line shows the author + vertical + platform + brand but no format badge.

The headline experiment (framing test + expert signal)

A live test, run inside the Variant tool, that separates two very different kinds of editorial knowledge: a lever we've proven and want to enforce, and a hypothesis we want to test before trusting. This section is the high-level "what + why"; the operator how-to lives in the Variant tool documentation.

The two halves, and why they're treated differently

The June 2026 headline analysis turned up one finding strong enough to act on now and one question still open. Putting them on the same tool forced a clean split:

  • Proven lever — the expert/study signal. When a story is built on a credentialed expert or a named study, naming that in the headline reliably lifts performance (the analysis measured roughly +71% page-views and +50% engagement time). Because it's a measured result, not a guess, it's enforced: it became a scored grader rule (see Expert/study signal), alongside length and the punctuation ban.
  • Open hypothesis — the trend-stage framing. Whether describing where a trend sits in its lifecycle changes how readers engage is unknown. So it's tested, not scored: a headline can be written in a test framing without that framing helping or hurting its grade. Scoring a framing would amount to claiming we already know the answer—and we don't yet.

The organizing rule is simple: codify what's proven, validate what's not. A framing only graduates into a scored rule once engagement data picks a winner.

The three framing styles under test

Each is a way of framing the same trend in the headline—same article, different angle on where the trend stands:

StyleThe angle it takesFeel of the language
Established (the current default + baseline)The trend is recognized and authority-backed."Explained," "What the Research Says," "Why X Has Become…"
EmergingThe trend is early-stage and just breaking mainstream—the reader is ahead of the curve."Just Starting to…," "Is Growing," "Why Experts Are Increasingly…"
ForecastThe trend is evolving; forward-looking and predictive."What Comes Next," "Where X Is Headed," "What This Means for You"

Writers begin running the test on 2026-06-10. The framing styles are tuned weekly as data arrives, so they're deliberately kept somewhere editable without a redeploy—not baked into the governed rule library. The test runs alongside normal generation: a writer opts a headline into a test style, or generates the usual production headline; it's one labeled lane, not a separate tool.

The per-headline "receipt"

For the eventual "which style won?" to be honest, every generated headline has to be traceable to the exact setup that produced it—the thresholds, the rule set in force, how the expert signal was delivered, and which framing was under test. So each generation records an immutable config snapshot: a content-addressed "receipt." Two headlines with the same receipt provably ran under the same configuration; a different receipt means something changed. When engagement numbers land weeks later, each headline's outcome attributes to its precise receipt rather than to a guess about what was live at the time.

Operator how-to—the test-style controls on the form, the expert-signal inputs, the conformance flag in the results, and the A/B report—lives in the Variant tool documentation. This section stays high-level on purpose.

Bridge to keywords console

The console emits an outbound feed at /data/headline-outcomes.json with every published article's outcome (PV, lift, formula, topic, content type, vertical, publication date). The Keywords Decisioning Console reads that feed for its Decision Log lens—keywords with a published headline get the headline grade + PV outcome paired back automatically.

Inbound bridge (filter Findings tiles by a keyword brief) is stubbed in the t2-side panel—pending the data-keywords brief-id metadata to land in the Tracker.

The (?) help bubbles

Click any (?) icon to open a popover with a tight definition of that term, metric, or tile. Click anywhere outside or press Esc to close. Enter / Space with the icon focused also toggles.

Content lives in docs/js/help-blurbs.js as a single registry—every term documented in one file. To reference a blurb from any page: <span class="help" data-help="some-id"></span>. The icon + popover are auto-hydrated.

What auto-updates vs what's manual

SurfaceCadenceTrigger
Findings tilesWeeklyMon 19:23 UTC cron → tiles_new.py
Today's published gradesDaily10:17 CDT cron → generate_grader.py
Snowflake enrichmentWeeklySame Mon cron (precedes tiles_new.py)
Tarrow XLSXWeekly (Tarrow-side)Auto-fetched at run start; falls back to last successful
Grader runOn demandgh workflow run headline-grader.yml (operators only)

Troubleshooting

Tiles show stale data

Check the live badge in the topbar. If it says missing for a feed, that feed's cron hasn't fired today. If all 3 feeds are bound but data looks old:

  • Hard-refresh (Cmd+Shift+R)—Cloudflare's cache can lag a deploy by a few minutes.
  • Every JS/CSS/JSON URL is now stamped with ?v=YYYYMMDDHHMM based on the cron run, so each new deploy auto-busts both the browser cache and Cloudflare's CDN cache.
  • If the badge says 0 graded, no headlines were published in the last 24h.

Grader page modal won't close

If Cancel doesn't dismiss, click the dimmed area outside the modal box. Resolved as of 2026-05-02 commit 001356d.

Vertical filter doesn't show all options

Three verticals only: Mind/Body, Everyday Living, Experiences. Articles classified as "Discover" in source data are search/discover work mode, not a vertical—they bucket as in vertical-keyed UI.

Writer initials collide

Two writers with same initials get a digit suffix: RB and RB2. The mapping isn't published; reach out if you need to dereference.

FAQ

Why are PV numbers different from what I see in Amplitude?

This console reads from MCC_PRESENTATION.CONTENT_SCALING_AGENT.TRACKER_ENRICHED, which composes per-platform PVs from STORY_TRAFFIC_MAIN + Tarrow XLSX. Amplitude only covers O&O; MSN traffic for example is Tarrow-only. Numbers will differ.

Why isn't InTouch / Closer / Life & Style on the lens filter?

The filter pills show the top-3 buckets (All / News / L&E), not per-publication pills. The active L&E cohort is US Weekly, Woman's World, and Life & Style; an L&E lens covers all three. InTouch and Closer are not in the active cohort—they're retired brands the console recognizes only for News-vs-Entertainment tag coloring, so they're graded for historical continuity but don't appear in the live Findings cohort.

Why is the (?) icon called the help bubble?

The (?) icon opens a popover with a "what is this?" blurb. The pattern is shared across the keywords + headlines consoles.

Part II—Technical specs

Architecture

Static HTML site deployed to Cloudflare Pages. No build step, no React, no bundler—pages are emitted by Python scripts and served directly. JS interactions are vanilla (no framework).

data/                          # source—Snowflake enrichment + Tarrow XLSX
  snowflake_enrichment.json    # ← snowflake_enrich.py (weekly cron)
  Top Stories 2026 Syndication.xlsx  # ← Tarrow (weekly)

tiles_new.py                   # weekly cron → docs/data/tiles.json + docs/index.html
generate_grader.py             # daily cron → docs/grader/index.html + docs/data/grader-signals.json

docs/                          # served by Cloudflare Pages
  index.html                   # home (Findings)
  data/                        # JSON feeds for tile rendering
  js/                          # vanilla JS interactivity
  css/styles.css + legacy-pages.css

Data sources + refresh cadence

SourceWhat it providesRefreshPath
Snowflake TRACKER_ENRICHEDPer-article PVs by platform, vertical, author, formula classifier, cluster_idWeekly (Mon)data/snowflake_enrichment.json
Tarrow XLSXPer-platform syndication titles + PVs (AN, MSN, Yahoo, SmartNews)WeeklyTop Stories 2026 Syndication.xlsx (repo root)
Tracker sheetLive editorial Tracker (headline, author, brand, vertical, dates)—source-of-truth that feeds TRACKER_ENRICHEDReal-time (Google Sheet)Flows into Snowflake; grader reads TRACKER_ENRICHED (since 2026-06-05)
Groq LLMActive-voice + curiosity-gap + accuracy criteriaDaily (per-headline)API call from grader

File map

PathRole
tiles_new.pyTile builders (~1450 lines). Reads enrichment.json, classifies formula + topic, computes lift + confidence ranges + verdicts + deciles + engagement axes, emits tiles.json + headline-outcomes.json + index.html.
generate_grader.pyGrader (~3900 lines). Reads Tracker sheet, scores the 19 criteria (17 weighted + 2 informational), emits grader/index.html + grader-signals.json + history.json.
scripts/apply_v2_styles.pySub-page migration. Strips legacy embedded CSS, links shared stylesheets, injects v2 topbar.
docs/css/styles.cssv2 design system tokens. Mirrors data-keywords/docs/css/styles.css.
docs/css/legacy-pages.cssClass re-themer for legacy generate_site.legacy.py output (.pb-tile, .agg-tbl, etc.).
docs/js/v2-render.jsReads window.__DATA_HEADLINES__ URLs → injects real data into tile bodies.
docs/js/help.js + help-blurbs.js(?) popover system + content registry.
docs/js/table-align.jsAuto-aligns table headers to match data-cell alignment.

Snowflake TRACKER_ENRICHED schema

The vetted boundary table read by snowflake_enrich.py. Path: MCC_PRESENTATION.CONTENT_SCALING_AGENT.TRACKER_ENRICHED.

Columns this console relies on:

ColumnTypeSource
published_urlVARCHARTracker sheet → DYN_STORY_META_DATA
headlineVARCHARTracker H1 column
authorVARCHARTracker AUTHOR column
domainVARCHARParsed from URL
verticalVARCHARAUTHOR_VERTICAL_MAP join
publication_dateDATETracker
total_pvsNUMBERSTORY_TRAFFIC_MAIN sum (reach metric—fan-out included)
search_pvs / social_pvs / direct_pvs / applenews_pvs / smartnews_pvs / newsbreak_pvs / yahoo_pvs / subscriber_pvs / amp_pvs / newsletter_pvsNUMBERPer-platform breakdown from STORY_TRAFFIC_MAIN + Tarrow
pub_median_pvsNUMBERComputed in routine
article_vs_co_medianFLOATComputed (origin-PVs basis post-cross-syndication-screen—see below)
is_hitINTEGER1 if at-or-above publication median (origin-PVs basis post-screen)
cluster_id / cluster_vs_co_median / cluster_hit_rateVARCHAR / FLOAT / FLOATCluster join + computed (origin-PVs basis + winsorized post-screen)
origin_pvs / syndicated_pvs / n_syndication_sites / syndicated_share / syndication_juice / marfeel_mediumsNUMBER / NUMBER / NUMBER / FLOAT / VARCHAR / ARRAYDesigned; merge-ready when Marfeel→Snowflake feed lands. See cross-syndication distortion screen.
Per the data-team boundary rule: consumers read from MCC_PRESENTATION.CONTENT_SCALING_AGENT only. Direct reads of MCC_RAW or upstream tables are not allowed.

Tracker sheet integration

The editorial Tracker is a Google Sheet maintained by the content team lead, treated as the source-of-truth. It flows into the Snowflake model MCC_PRESENTATION.CONTENT_SCALING_AGENT.TRACKER_ENRICHED (rebuilt ~2×/day), and the grader reads that—it switched off the direct Sheets read on 2026-06-05 (the sheet stopped receiving new dated rows, frozen at 6/02, while the Snowflake model stayed current).

Required columns (used by this console): Headline, Author, Brand Type, Publication Date, Published URL/Link, Primary Keywords, Syndication platform, Format, Content Type, Rotation #.

Header aliases are documented in scripts/shared/constants.py—when the content team lead renames a column, the alias mapping keeps the pipeline stable until the next ingest.

Tarrow XLSX intake

Weekly per-platform syndication file from Tarrow (former vendor name retained as data-product label). Provides Apple News / SmartNews / Yahoo / MSN per-platform titles + PVs. MSN data is platform-side only—not in Snowflake—so Tarrow is the only path to MSN signal.

The data team's 2026-04-18 offer to load this XLSX into Snowflake as a proper data model is in flight as substrate work.

Cross-syndication distortion screen

Designed end-to-end; merge-ready when the Marfeel→Snowflake feed lands. Solves a structural data-quality issue that distorts every topic / cluster / format / author rollup the console renders.

Why it exists

On 2026-05-06, exec/leadership flagged that distribution hand-picking already-strong stories for cross-syndication "juices" apparent topic performance—his canonical example was a Field-Level-Media-syndicated piece spread across ~24 syndication targets. When the console reads total_pvs to render leaderboards + cluster aggregates, those juiced articles inflate every rollup they appear in. Operators reading the console can't tell native topic strength from syndication fan-out. The distortion is selection-on-success: syndication is downstream of strong early performance on the origin publication, not a topic-strength signal.

What changes

New columns on TRACKER_ENRICHED (fed by an upcoming MCC_PRESENTATION.CONTENT_SCALING_AGENT.MARFEEL_ARTICLE_BY_MEDIUM table built by the data engineer):

ColumnMeaning
origin_pvsPVs from the home publication only (Marfeel medium = article_domain)
syndicated_pvstotal_pvs − origin_pvs
n_syndication_sitesDistinct non-origin medium count
syndicated_sharesyndicated_pvs / total_pvs, bounded [0,1]
syndication_juicenone / light / heavy tier (heavy ≥10 sites & ≥60% syndicated; light ≥3 & ≥30%)
marfeel_mediumsSorted distinct medium array—the actual syndication targets that carried the article

Performance-signal columns switch from total-PVs basis to origin-PVs basis at the same migration:

  • is_hit—1 if origin_pvs ≥ pub_median_pvs (was total_pvs ≥)
  • article_vs_co_median—uses origin_pvs instead of total_pvs
  • cluster_total_pvs + cluster_hits—recomputed on origin_pvs
  • cluster_vs_co_median—uses new cluster_total_pvs_trimmed with the heaviest-juiced article per cluster excluded (winsorized)
  • article_pv_share_of_domain_month—origin-PVs basis
  • author_hit_count + author_hit_rate—origin-PVs basis

Reach + revenue columns stay on total-PVs basis because every PV is real revenue regardless of medium: total_pvs, article_programmatic_revenue_live, author_avg_pvs, author_pv_stddev.

UI surfacing

Two new operator-facing surfaces ride the new columns:

  1. Per-article juice chip on every card: 🔁 N-site, X% syndicated when syndication_juice != 'none' (orange for light, red for heavy; hover reveals the medium list).
  2. Leaderboard "Hide heavy-juiced" toggle: localStorage-persisted, default off. When on, filters syndication_juice = 'heavy' rows from leaderboard rankings. Article still openable through deeplink.

What this is NOT

  • Not a hide-juiced policy. Heavily-syndicated articles still appear on every card view + every leaderboard by default. The toggle is opt-in.
  • Not a content-quality judgment. The chip is descriptive, not evaluative—a heavily-juiced article isn't bad content; it just shouldn't be read as topic-strength signal.
  • Not a Marfeel feed dependency for correctness logic. The CTE is fail-open (defaults safely if no match): when no Marfeel row matches an article, origin_pvs falls back to total_pvs and syndication_juice defaults to none. Pre-feed-land state behaves identically to pre-screen state.
  • Not a replacement for cross_site_* columns (McClatchy-internal newspaper syndication). Marfeel covers external / platform syndication targets; both apply to most articles; both useful.

Implementation

Master design + SQL diffs in ops-hub/docs/cross-syndication-screen.md. UI spec + snowflake_enrich.py diffs in data-headlines/dev-docs/cross-syndication-screen-ui.md. Governance entry in csa-content-standards data-universe-labeling.

Cron schedules

WorkflowCadenceTrigger
weekly-site-refresh.ymlMon 19:23 UTCops-hub cross-repo-dispatcher.yml (native cron broken post-transfer)
headline-grader.ymlDaily 10:17 CDTGitHub Actions schedule

Both workflows can be triggered manually via gh workflow run or the GitHub Actions UI.

Deployment + Cloudflare Pages

Deployed to data-headlines.pierce.tools from the main branch. Auto-deploys on push (~30s). Cloudflare Pages also exposes a default *.pages.dev URL for the project; that endpoint resolves but is operational-only—always use the custom domain for stakeholder-facing links.

Access is gated by Cloudflare Access—visiting any page bounces through a one-time-PIN login (whitelisted emails only) before serving content. SPA-fallback is enabled, which serves /index.html for unknown paths—keep this in mind when adding new sub-pages (use absolute paths if the page sits at a depth where relative resolution would 404).

Anonymization governance

Per ops-hub/ANONYMIZATION.md: no proper colleague names in any committed file. Pierce is the only permitted name. Authors in tiles.json appear as initials; grader-signals.json drops the author field entirely; meeting transcripts are extracted to anonymized summaries and originals deleted.

Enforcement runs as Step 4.5 of /sync-repos—fail-closed on denylist match, transcript-filename match, verbatim-content patterns, or snapshot-directory contamination.

Quality gates

Pre-push gates from ops-hub:

  • Anonymization—Step 4.5 of /sync-repos.
  • Conciseness—Step 4.6: project descriptions ≤400 chars, nextActions ≤25 words, CONTEXT.md ≤150 lines.
Part III—Glossary

PV / total_pvs

Page-views: total times an article was loaded across every distribution surface (Apple News, SmartNews, Yahoo, Newsbreak, search, social, direct, newsletter, AMP, subscriber). Sourced from STORY_TRAFFIC_MAIN in Snowflake and aggregated into total_pvs on TRACKER_ENRICHED.

Reach metric—fan-out included. total_pvs is the right number for reach + revenue rollups (every PV is real money regardless of medium). It is not the right number for topic-strength / cluster-strength / author-strength judgments—those use origin_pvs post-cross-syndication-screen to remove selection-on-success distortion. See Cross-syndication distortion screen for full context.

Lift

A multiplier comparing one group's median PVs to a baseline. lift = (group_median − baseline_median) ÷ baseline_median. Positive = above baseline, negative = below. The baseline depends on context: usually publication median, sometimes vertical median, sometimes untagged-formula median.

Median (vs mean)

Page-view distributions are heavy-tailed—a single viral hit can shift a mean by 5×. The console uses medians everywhere because they're robust to outliers. When a tile reports n=20 articles with median PVs of 1,200, that's the 10th-ranked article's PV count.

P-value (Mann-Whitney U)

The probability the observed lift could have arisen by chance from the baseline distribution under the null hypothesis (no difference). Computed via Mann-Whitney U-test—non-parametric, rank-based, no normality assumption.

  • p < 0.05—conventional significance; treat as real signal
  • p < 0.10—directional; worth a controlled pilot
  • p ≥ 0.10—no detectable difference; data is too thin

Verdict (go / test / skip)

VerdictColorThreshold
goGreen (--go #4ade80)Real lift, p<0.05, n large enough
testAmber (--test #fbbf24)Directional, worth piloting
skipCoral (--skip #f87171)Below baseline or thinly evidenced

Verdict color encodes on the tile rail (left edge), pills, and KPI accents. The Verdict column pill in tile tables shows the verdict label first followed by the confidence reading: go · p<0.05, test · p<0.10, test · no diff, skip · p<0.05.

Confidence (high / mod / low)

Overlay on verdict, conveying statistical confidence:

  • conf-high (cyan #38bdf8)—p<0.05, n large
  • conf-mod (violet #c084fc)—directional p<0.10
  • conf-low (slate #94a3b8)—anecdote / no detectable difference

Confidence range on lift (the bracket)

The bracket after a lift number—like [+22, +47]—is the likely range of the true effect: we're about 95% confident the real lift sits between those two numbers. A narrower bracket means a more reliable estimate; a wider one means more uncertainty.

It's a 95% normal-approximation interval around the lift, computed in tiles_new.py:_lift_ci_normal. It only appears when the group has at least 10 articles (and the baseline median is non-zero); below that the data is too thin to bracket, so the cell shows the lift alone. Use the bracket to tell a steady, dependable lift (tight bracket on many articles) from a fluky one (wide bracket on few).

Decile badge (D1–D10)

On top-performer rows ranked by traffic, a small badge like D8 says which tenth of its content lane (vertical) the article lands in by lift (how far it beat the publication median). D10 is the top tenth; D1 is the bottom tenth. The badge is highlighted green when an article reaches the top deciles, amber in the middle, and muted lower down—a one-glance read of "how far up its lane did this land."

Computed per vertical in tiles_new.py (lift_decile_within_vertical): the article's traffic rank within its lane, bucketed into ten. It's a within-lane ranking, so a recipes article is placed against other recipes articles, not against politics. It shows only on the PV ranking axis of the recent top-performers panel.

Engagement metrics (engaged % · time on page · cluster continuation)

Beyond raw clicks, three reader-behavior measures say whether a headline actually delivered. They're now the primary read on the top-performers panel; page-view trend stays as the simple traffic-only reference.

  • Engaged % (engaged_session_rate)—the share of visits where readers engaged rather than bounced.
  • Time on page (median_time_on_page_seconds)—how long readers stayed, as a median (one marathon reader doesn't distort it).
  • Cluster continuation (csa_cluster_continuation_rate)—the share of readers who clicked on to another linked article in the same group. 30%+ is strong, 15–30% middling, under 15% weak.

These are Amplitude/Marfeel-derived and land on each article via the Wing-B engagement substrate plus the recurring Marfeel ETL—live and longitudinally reliable from 2026-05-18 forward. They're NULL when an article has no engagement coverage (it predates the window, or wasn't tracked), so the panel skips an article on any axis it's missing rather than show a misleading zero. The "beat the publication's normal level" standard used by the hit-rate-by-formula sidecar uses these measures where they exist and falls back to plain traffic where they don't.

Cluster batting average

Article-level: of all CSA-touched articles (those with a cluster_id), the share whose cluster's median PV beat the publication's company median (cluster_vs_co_median > 0). Reported as "1 in N"—lower N = better.

Larger clusters get proportionally more votes (each article in a winning cluster counts), so this reads as "how often does a CSA-touched article sit in a cluster that's outperforming."

Current batting average sits in the 1-in-2.5 to 1-in-3 range. The exec/leadership Q3 target is 1 in 3.3 (≈30%).

Methodology shift coming. Post-cross-syndication-screen, cluster_vs_co_median uses cluster_total_pvs_trimmed (heaviest-juiced article per cluster excluded) computed against origin_pvs per article. Expect cluster batting averages to shift downward modestly as previously-juiced clusters revert to native strength. See Cross-syndication distortion screen.

Curiosity gap

Editorial principle: lead with the question, not the answer. "The Secret to Her 20-lb Weight Loss" (gap kept open → click) outperforms "She Lost 20-lbs Eating Kale" (answer spoiled → no click). Active in the L&E lens.

Confirmed at US Weekly with n=19 vs n=12 control, +42% vs −35% lift, p<0.05.

Formula taxonomy (full list)

Canonical list lives in formula_taxonomy.py at repo root — single source of truth shared by the Findings classifier (tiles_new.py:classify_formula) and the published-headline grader (generate_grader.py:_formula). The variant tool's Cloudflare Worker (data-headlines-variant, separate codebase) is synced manually against this list.

Priority-ordered: penalized-filler phrases first (so they bucket correctly even when a credit-worthy pattern also matches), then lead-position patterns, then structural cues, then stylistic markers, then a length-based fallback. First match wins. Every non-empty headline buckets — untagged is reserved for the empty-string degenerate case only.

Role column: credit = legitimate editorial formula, earns "Formula present" credit in the grader, classifier buckets here, variant Worker generates candidates here. penalized = recognized vague-filler pattern, classifier buckets it for cohort analysis but the grader does NOT award credit (the per-criterion penalty fires separately). fallback = length-based classifier-only bucket; doesn't represent an editorial formula.

Penalized-filler phrases (priority 1)

These bucket FIRST so a headline like "X: Everything You Need to Know" doesn't accidentally credit on a stylistic marker further down the list.

FormulaTriggerRoleExample
did_you_missPhrase "did you miss"penalized"Did You Miss the Aurora Last Night?"
what_to_know"what to know" / "what you should know"penalized"Severe Weather Hits SC: What to Know"
explainer"everything (you need) to know"penalized"SHA Wellness Retreat: Everything You Need to Know"

Lead-position patterns (priority 2)

FormulaTriggerRoleExample
heresStarts with "Here's" / "Heres" / "Here are"credit"Here's How to Cancel a Subscription"
quote_leadStarts with a quotation markcredit"'Devastated': Family Reacts to Verdict"
name_leadProper-name lead + reporting verb (says / reveals / admits / shares)credit"Kendall Jenner Says This Weighted Blanket Helped…"
inside_leadStarts with "Inside"credit"Inside Hollywood's Ice Bath Obsession"
why_leadStarts with "Why"credit"Why Creative Retreats Are the Biggest Travel Trend of 2026"
action_verb_leadStarts with Watch / See / Meet / Listen / Read / Behold / Witnesscredit"Watch: A NeeDoh Compared to Other Stress Balls"

Numeric + listicle (priority 3)

list_with_quantifier comes before number_lead so "5 Things to Know" buckets as the more specific listicle pattern rather than the bare number-lead bucket.

FormulaTriggerRoleExample
list_with_quantifier"N + plural noun" listicle (things / reasons / ways / tips / hacks / kitchen items / benefits / …)credit"5 Things You Should Know About Sleep"
number_leadStarts with a digit (and isn't a listicle)credit"5 Senate Republicans Break From Trump…"

Structural cues (priority 4)

FormulaTriggerRoleExample
definitional"what is/are…", "why is/are…", "how does/do…", or "explained"credit"What Is a NeeDoh? Everything Rosé Has Said About…"
how_to"how to/i/you can", "ways to", "tips/tricks to"credit"How to Reduce Microplastic Exposure"
questionEnds with question markcredit (+ separate no_questions penalty)"Is the Aurora Coming Back Tonight?"
dash_directiveEm/en-dash rhythm — directive after dash, or any em-dash spacingcredit"Most People Use Too Much Detergent—Here's How to Stop"
superlative_list"best", "top", "greatest", "ultimate", "easiest", "smartest"credit"The 7 Best Cold Plunge Resorts Around the World in 2026"
comparative"vs", "versus", "better than"credit"What Science Says vs. What Influencers Claim"
negation_lead"stop", "avoid", "don't", "never", "skip", "quit"credit"Stop Spending Hundreds on Sleep Aids — Try These 6 Hacks…"
curiosity_gap"secret", "reveal", "surprising", "shocking", "hidden", "little-known"credit"The Secret Behind Jelly Roll's Weight Loss"

Stylistic markers (priority 5)

Fire often, less differentiating. Last in priority so they only catch headlines that don't trigger any more specific editorial pattern above.

FormulaTriggerRoleExample
possessiveApostrophe-s or s-apostrophe possessivecredit"Khloé Kardashian's Beef Tallow Routine"
colon_structuredSubject + colon + space + wordcredit"Beef Tallow: The Wellness Trend That Won't Die"

Length-based fallback (classifier only)

Headlines that don't trigger any pattern above bucket by word count so per-pub × formula groups still populate. These are NOT editorial formulas — the grader returns "No recognized formula" (no credit) for headlines that fall through to this layer.

FormulaTriggerRoleExample
short_punchy≤ 7 wordsfallback"Beef Tallow Is Everywhere Now"
descriptive_standard8–15 wordsfallback"Khloé Kardashian Joins the Growing List of Celebrities Endorsing Beef Tallow for Skin"
descriptive_long≥ 16 wordsfallback"Your Nervous System Needs More Than Just 1 Deep Breath. These Popular Breathwork Apps Can Actually Help With Anxiety"

Topic taxonomy

Regex-based classifier in tiles_new.py:classify_topic(). Priority-ordered: specific patterns fire first; if no rule matches and Snowflake has an IAB tag, that's used; otherwise a token-cue fallback buckets to wellness_general / lifestyle_general / news_general. untagged is reserved for headlines with zero topical cue words (rare in practice).

Traditional news topics

weather_severe, weather_heatwave, weather_general, politics, crime, sports_local, sports_national, royals, finance.

TH wellness/lifestyle topics

Tuned to the Trend Hunter content corpus that dominates today's dataset:

  • wellness_menopause — menopause, perimenopause, hot flashes, HRT, estrogen
  • wellness_sleep — sleep, insomnia, circadian, melatonin, naps
  • wellness_retreat — retreats, spas, wellness boot camps, sanctuaries
  • wellness_stress — stress, anxiety, breathwork, vagus nerve, cortisol, cold plunge, meditation
  • wellness_beauty — skincare, beef tallow, wrinkles, botox, red light therapy
  • wellness_fitness — run clubs, workouts, pilates, sarcopenia, gym
  • wellness_nutrition — protein, electrolytes, hydration, gut health, loaded water, collagen
  • celebrity_health — weight loss, Ozempic, surgery, cancer
  • home_hazards — mold, microplastics, PFAS, forever chemicals, laundry detergent, tap water
  • home_garden — garden, flowers, curb appeal, landscaping, lawn
  • real_estate — home inspection, home buying, mortgage, housing market
  • recipes — recipes, baking, meal prep, dinner/breakfast
  • entertainment — movies, TV shows, Netflix, streaming
  • celebrity_general — Kardashian, Jenner, Hollywood stars (when not under a wellness bucket)
  • lifestyle_trend — viral, TikTok, trend, Gen Z, "explainer"
  • travel_food — travel, destinations, food tours, restaurants

Fallback buckets

wellness_general / lifestyle_general / news_general — fire when a headline contains a general cue word (wellness/health/luxury/trend/report/etc.) but no specific topic pattern matched. These keep per-pub × topic groups populated for analysis even when a headline doesn't fit a named bucket.

Content type (News vs L&E)

Content typePublicationsPrimary metric
News & RegionalMiami Herald, KC Star, Charlotte Observer, Fort Worth Star-Telegram, Sacramento Bee, Raleigh News & Observer, Centre Daily Times, Idaho Statesman, The State, Lexington Herald-Leader, Bradenton Herald, Belleville News-Democrat, Fresno Bee, Modesto Bee, Macon Telegraph, Sun Herald, plus 10 more T1 papersCTR + Search Ranking
Lifestyle & EntertainmentUS Weekly, Woman's World, and Life & Style (InTouch + Closer are retired brands, recognized only for tag coloring)Social Shares + Scroll Depth

Verticals (the 3)

  • Mind/Body (color: emerald --vert-mb)—health, fitness, wellness
  • Everyday Living (color: amber-orange --vert-el)—home, décor, lifestyle, parenting
  • Experiences (color: pink --vert-ex)—travel, recipes, finance, retirement

These are the three named TH-channel verticals. Articles published outside these channels (search/discover work) are bucketed as in vertical-keyed UI; they aren't a fourth vertical.

Writer initials

Author identifier in committed JSON. Computed from full author name: first letter of first word + first letter of last word. "Jane Doe"JD. "Alex Roe-Smith"AR. Collisions get a digit suffix (JD, JD2, …).

Live Tracker sheet uses canonical full AUTHOR field; initials are a render-layer transformation only.

Grader 19 criteria (+ tier weights)

Defined once in generate_grader.py:CRITERIA. A headline's score = (weight of rules it passed) ÷ (weight of rules that applied to it) × 100. Rules with weight 0 are informational (shown, never scored); rules that return "N/A" for a given headline drop out of its denominator. The four tiers carry 27 / 27 / 36 / 9% of the total weight (see Grader tiers + the score).

KeyPlain nameTierWtMethod
char_countCharacter count (outlet-specific)Structure1Rule-based (90–104 default; UsW / WW / Life & Style 90–100)
subject_leadsNamed entity leadsStructure1Rule-based
no_articlesNo "The / A / An" leadStructure1Rule-based
active_voiceActive voiceStructure1LLM (Groq)
no_lead_burialNo lead burialStructure2LLM
formulaFormula presentFormula2Rule-based
no_vague_wtkNo "what to know" fillerFormula1Rule-based
keywordKeyword presentFormula1Rule-based
expert_signalExpert/study signal (proven lever)Formula2Rule-based · scored only when the story has an expert/study to surface (else N/A)
numberNumber leadFormula0Informational
no_dymNo "Did you miss"Quality2Rule-based
no_questionsNo question headlineQuality1Rule-based
no_allcapsNo all-caps wordsQuality1Rule-based
no_banned_punctNo em/en-dash, colon, semicolonQuality1Rule-based · skipped (N/A) for UsW / WW / Life & Style
curiosityCuriosity gapQuality1LLM
accurateFactually accurateQuality2LLM
apple_heresHere's / Here arePlatform0Informational
an_title_charsApple News title (90–120 chars)Platform1Rule-based · scored only when the article is matched in Tarrow (else N/A)
sn_title_charsSmartNews title (70–90 chars)Platform1Rule-based · scored only when matched in Tarrow (else N/A)

Weight 0 = informational (shown but not scored)—number and apple_heres, because whether they help depends on which platform the title goes to, and the grader sees only the website headline. The two platform-title rules (an_title_chars, sn_title_chars) would grade the custom Apple News / SmartNews titles, but those live in the Tarrow file that isn't joined to the grader yet, so they read "?" (N/A) for almost every headline and stay out of the score in practice. All 19 grade the H1 from the Tracker, never the platform-specific syndication titles.

Up from the 15 criteria documented before the 5/04 cut: expert_signal (the proven lever from the June 2026 analysis—see the headline experiment), no_banned_punct (the TH public-facing ban on em/en-dash, colon, semicolon), and the two platform-title rules an_title_chars / sn_title_chars are all now in the registry.

Expert/study signal (proven lever)

When an article is built on an expert or a study—a credentialed person, researcher, or doctor, or named research—putting that signal in the headline reliably lifts performance. The June 2026 headline analysis measured this as the single highest-impact finding (roughly +71% page-views and +50% engagement time), which is why it's an enforced, scored grader rule rather than just a tip.

How the rule behaves: it fails only when the source clearly has expert/study material but the headline leaves it out; it passes when the headline surfaces it. It shows "?" (not scored) when the story has no expert/study material to surface, when the brand is US Weekly, Woman's World, or Life & Style (where the rule doesn't apply), or when the tool can't confirm the expert from the article body.

Anti-fabrication is structural: the signal must come from the article body—the writer has done the research. The tool never invents an expert or study the body doesn't support. To pass: name the expert or study in the headline when the story actually has one.

In the headline experiment this is the "proven" half (codify what's proven). It's armed for grading whenever the story supplies an expert/study; on an ordinary published headline with no such context it simply reads N/A and doesn't affect the score.

Headline letter grades

Score → letter grade mapping for the visual grade pill:

ScoreGrade
90–100A
80–89B
70–79C
60–69D
<60F

Grade trends per author are tracked in docs/grader/history.json (30-day rolling window).

Anonymization in practice

Per ops-hub/ANONYMIZATION.md, no proper colleague names appear in any committed file (Pierce is the sole exception).

Where it shows up

  • docs/data/tiles.json · team_perf.writers[].initials—first letter of first name + first letter of last name (collisions get a digit suffix). No author field at all.
  • docs/data/grader-signals.json · per-headline objects have vertical but no author.
  • docs/grader/index.html · the _author_initials() helper renders authors as initials in every hcard. The full name is only used in-process for the AUTHOR_VERTICAL lookup, never serialized.

Mapping back to actual names

The live Tracker sheet has the canonical AUTHOR field. Operators with sheet access can dereference initials by sorting on AUTHOR + matching to the displayed pub-percentile. Initials map is not published.

Enforcement

Pre-push gate: /sync-repos Step 4.5 invokes /anonymization which scans every committed file against the 55-name denylist + transcript-filename patterns + verbatim-content patterns. Fail-closed.

Part IV—Variant tool

Variant tool · what it is

This Part is the architecture + technical reference (overview, pipeline, models, D1 schema, versioning, deployment, endpoints). For the operator walkthrough—form fields, the style test, the expert lever, reading results—see the Variant tool documentation.

A live editor-in-the-loop tool at /variant/ that takes one headline + its primary keywords (and optionally the article body) and returns 5–10 publication-ready alternative phrasings, plus one structurally-novel ⚡ wildcard for the long-term cohort study.

The tool answers a different question than Findings or Grader. Findings tells you what shape worked across thousands of past articles. Grader scores one headline. Variant generates options for an article you're about to publish—every option verified against the same production grader, every option containing every primary keyword you supplied, every option diverse enough from the seed that selecting one is a real editorial decision.

Who uses it. Content team writers + the content team lead. Each user gets a cap of 10 generations/day, 200/month, with the team sharing a ~700-runs/month budget ceiling, and the tool runs weekday-only (Mon–Fri, Eastern; closed weekends). Pierce is on an unlimited exception list because he funds the underlying API.

Why it exists. Variants offered for a single article were running lexically too similar—sibling bridges in a cluster looked interchangeable. The variant tool is the editorial-side fix: forced facet diversity across an option set when the article body provides article-supported latitude, plus a long-term study aimed at promoting the most-picked off-pattern wildcard structures into the standard format library.

For the full operator reference—form fields, the style test, the expert lever, reading the result cards, rate limits, troubleshooting—see the Variant tool documentation.

The headline experiment in the Variant tool

The headline experiment (trend-stage framing test + the proven expert/study lever) runs inside the Variant tool. Here's where it sits in this tool's architecture; the writer-facing controls and the A/B report are in the Variant tool documentation.

Three rule homes, three change cadences

The experiment is built on a clean split between what's settled, what's under test, and the record that ties a result back to its setup:

LayerLives inChanges how oftenGoverns
Proven / settledThe bundled rule corpus + the scored criteria registryRarely (only when a hypothesis graduates)Length, punctuation ban, keyword presence, formula taxonomy, and the scored expert/study signal.
Under test (volatile)A hot-editable database row (no redeploy)Weekly+The Established / Emerging / Forecast framing definitions + their signal-word menus.
Provenance (the receipt)An immutable, content-addressed config snapshot per generationEvery callThe complete setup—thresholds + rule set + expert-delivery mode + framing under test—so attribution and rollback are exact.

The decision rule: a rule belongs in the settled/scored layer once an engagement winner has been picked and it's being enforced; it stays in the volatile database layer while it's still a hypothesis being injected and validated-but-not-scored. Scored ⇒ settled; under-test ⇒ database.

How it grades

  • Expert signal (proven) grades like any scored rule, using the same "score it when the story supplies an expert/study, otherwise N/A" gate the keyword rule already uses.
  • Framing (hypothesis) runs through a conformance check that confirms a test headline actually follows its assigned style, but never enters the score—so an unproven framing can't tank a production headline's grade or block generation.

Both the expert directive and the assigned framing ride the existing generation call—no extra LLM round-trips, no added cost. All the schema is additive and every new read fails open: a missing row falls back to bundled defaults exactly like the calibration read, so the experiment can never take the production tool down.

Design source: dev-docs/headline-experimentation/spec.md (problem statement + architecture). Writers begin the framing test 2026-06-10; styles are tuned weekly. A winning framing graduates into the scored layer only once engagement data picks it.

Quickstart · 60 seconds

  1. Paste the seed headline you want alternatives for. The tool produces rephrasings of the SAME article—not pitches for a different one.
  2. List the primary keywords (load-bearing SEO terms, comma-separated). Every variant the tool offers will contain ALL of them. Strict editorial rule.
  3. Optional but recommended: paste the full article body. The tool switches to "maximize difference" mode—pairwise diversity gates kick in, and the grader judges accuracy against the article instead of the headline alone. Without article body, the grader stays conservative.
  4. Click Generate variants. Wait 5–15 seconds. Each card shows score + similarity + structural breakdown.
  5. Keep the variant(s) you'd ship—the button toggles to Kept ✓; multi-pick is supported and kept state survives regeneration of the same seed (persists server-side for performance analysis). Or Copy to clipboard, or ↻ Regenerate for a fresh batch with the same inputs.
Form state survives page refresh (sessionStorage). Pasted articles aren't lost on accidental reload.

The form fields

FieldRequiredWhat it does
Seed headlineYes (≥20 chars)The version you wrote. Tool produces alternatives of the same article.
Primary keywordsYesComma-separated. ALL must appear in every variant (case-insensitive). If a keyword being required would over-constrain variants, move it to Secondary.
Secondary keywordsNoComma-separated. Bonus signal if a variant naturally surfaces them; not gated.
Article textNo (recommended)Full article body. Switches the tool to article-mode (see below). Up to 50,000 chars (truncated to 8,000 in the LLM prompt for cost + injection-surface control).
Variants (count)Always 10The tool always targets 10 alternative phrasings. Returns as many as the gates allow (minimum 3, including at least 1 ⚡ wildcard—guaranteed via fallback promotion if the wildcard would otherwise be missing). No operator slider—every call is max-effort.
Also avoid (Advanced)NoOne previously-published headline per line. New variants gated against each by the same cosine ceiling. Use when you've already shipped a deepener + bridges in the same cluster and need the next variant distinct from all of them.
Format / Persona / Platform / Outlet (Advanced)NoApply rule overrides from csa-content-standards. Only ACTIVE (finalized) formats / personas / platforms / outlets are listed—pending or draft entries are excluded. As bundled in the current rule corpus (RULES.js, synced from csa-content-standards): formats × 5 (Everything to Know · What to Know Next · Discover Explainer · Follow-Up Content · FAQ), personas × 2 (Curious Optimizer (TH) · Discover Browser), platforms × 5 (Apple News · SmartNews · MSN · Google News · Trend Hunter B2C), outlets × 4 (US Weekly · Woman's World · Life & Style · Trend Hunter B2C). Formats still pending in the standards (Recipe · Timeline · Recap · Obituary · Interview · Fan Theory · Fan Question) aren't in this list and default to general guidelines for grading.

Default vs article-mode

The tool runs in one of two regimes depending on whether the article body field is filled. The mode pill in the results header tells you which is active.

RegimeActive whenDiversity barAccuracy bar
DefaultNo article body"Options for selection"—variants may share content tokens across options (you pick one, rest are discarded). Only formula cap (≤2 per formula) and cosine (meaning-similarity, 0–1) vs seed enforced.Conservative—variants must stay within the seed headline's stated scope. No "experts say," no specific findings, no listicle counts beyond what the seed implies.
Article-modeArticle body provided"Maximize difference"—pairwise cosine (meaning-similarity, 0–1) ≤ 0.85 + pairwise content-token Jaccard (word-overlap, 0–1) ≤ 0.45 between rule-following variants (tightened 2026-05-27 per exec/leadership variety-weighting ask). Forces facet diversity across the set.Permissive—variants may name any technique / mechanism / finding / source / comparison present in the article body, even if absent from the seed. The grader judges against the article, not the headline alone.

Article-mode is what content team should reach for when she has the article in hand. Without it, the tool is honest about its narrower latitude and shows the conservative regime in the mode pill.

The wildcard slot

Every /generate returns exactly one variant tagged ⚡ wildcard. The wildcard is generated under an off-pattern prompt—its job is structural novelty, not faithful rephrasing.

What's different about a wildcard:

  • Bypasses the rule-following LLM grader (active voice / lead burial / curiosity). Those criteria are calibrated for ordinary variants and would fight the wildcard's role.
  • Still passes every hard structural rule at 100% (length, named-entity lead, formula, primary keywords, banned patterns)—wildcards are publishable shape, not garbage.
  • Single accuracy safety-net check via a one-criterion LLM call—catches wildcards that introduce fabricated claims (e.g. "experts say" when no experts are in the source).
  • Has its own cosine ceiling (0.95 vs seed) regardless of mode—the cohort study values structural fingerprint novelty, not semantic distance (meaning-difference), so a wildcard at 0.95 cosine with a structurally novel POS (part-of-speech) sequence is exactly what the study wants.

Verify wildcards editorially before publishing—they bypass the LLM judgment criteria by design. The chip + tooltip on each wildcard card discloses this.

The wildcard cohort feeds the format-promotion ledger—recurring wildcard structures that the editor consistently picks become candidate new headline standard formats.

Score floor · what's guaranteed

The grader is a port of the same criteria registry in generate_grader.py that runs daily on published headlines (kept in lock-step by a parity test). The variant tool applies it differently to each cohort.

Hard structural rules · 100% pass on every variant (rule-following + wildcard)

  • Length · distinct numbers, not one band. The hard floor is 75 chars—a variant-tool runtime override (VARIANT_CHAR_RANGE.min in index.js) that permits tighter phrasings; it sits below the daily grader's publish floor. The scoring target band is 90–104 (target_min/target_max, matching the daily grader since 2026-06-05). Above target there's an 11-char editorial over-tolerance (VARIANT_CHAR_RANGE.over_tol, added 2026-06-16): options 105–115 chars soft-pass carrying an amber ⚑ editorial-judgment flag instead of dropping, so writers get a fuller set and decide whether to trim; only <75 or >115 drop. This tolerance is variant-tool-only—the daily grader carries no over_tol and still hard-fails any headline over target. So a 75–89-char variant clears the floor and is publishable but scores below one in 90–104; a 105–115-char variant passes flagged. (Rationale: keeps a compressed or slightly-long phrasing from being dropped outright when a fully-compliant set is hard to reach.)
  • Named-entity lead · no "The / A / An / It / This / There / He / She / They / Here" lead
  • Formula present · matches at least one credit-worthy editorial formula (about two dozen, e.g. Number lead · "Here's" · Possessive · Colon-structured · Em-dash · How / Why / Inside leads · Action-verb · List-with-quantifier · Comparative · and more—see the full formula taxonomy)
  • Primary keywords · every primary keyword the operator supplied appears in the variant
  • Banned patterns · no question-mark ends · no "What to Know" / "Did you miss" · no all-caps non-acronyms

LLM-judged criteria · ≥85 composite score (rule-following only)

Four binary criteria graded by an LLM: active_voice, no_lead_burial, curiosity, accurate. Composite score weights each—failing one weight-1 criterion = 94 (passes), failing one weight-2 criterion = 88 (passes), failing two = ≤82 (fails). The 85 floor tolerates a single borderline LLM judgment-call miss but rejects combinations or multiple substantive misses.

Wildcards · one-criterion accuracy safety net

Wildcards skip the 4-criterion LLM grader entirely. Instead a single LLM call judges accuracy alone (does the wildcard introduce a falsifiable claim the source contradicts or doesn't cover?). Default PASS. Catches fabrications without fighting the structural-novelty objective.

Active gates · cosine + grader + diversity

GateDefault modeArticle-modeWhat it catches
Cosine vs seed (rule-following)< 0.97< 0.93Near-literal duplicates of the seed headline. Tighter in article-mode where the article gives variants room to legitimately diverge.
Cosine vs each "Also avoid" entry< 0.97< 0.93Collisions with previously-selected sibling headlines from the same cluster.
Cosine vs seed (wildcard)< 0.95< 0.95Wildcard-specific ceiling—independent of mode (cohort study values structural fingerprint, not semantic distance).
Pairwise cosine between rule-following variants< 0.85Forces facet diversity in article-mode; drops the lower-grader variant on collision (tiebreak: drop higher cosine-to-seed). Tightened from 0.92 on 2026-05-27.
Pairwise content-token Jaccard between rule-following variants< 0.45Catches variants that share most content nouns—different formulas of the same content angle. Tightened from 0.55 on 2026-05-27.
Formula cap≤ 2 per formula≤ 2 per formulaNo formula hogs the option set across the full credit-worthy formula taxonomy. Lowest-scoring excess get dropped.
Grader (rule-following)≥ 85 composite≥ 85 compositeHard structural rules at 100% + LLM-judged criteria pass at ≥85 floor.
Grader (wildcard)Hard rules + accuracy safety-netHard rules + accuracy safety-netHard structural rules at 100%; LLM-judged criteria bypassed; single accuracy check catches fabrications.

If the pool falls below the requested N after all gates, the pipeline runs up to 2 regeneration passes with explicit failure-mode feedback (e.g. "char_count under 80 in 4 variants—extend with article specifics"). After 2 passes, returns whatever's in the pool with a recovery-advice payload.

Quotas + caps

Four guardrails. A weekday-only window (Mon–Fri ET), per-user daily, per-user monthly, and the team-wide budget. Atomic D1 counters—concurrent calls cannot over-shoot. Visible to the user as pills above the form.

CapDefaultResetsBehavior at limit
Daily generations / user10/day00:00 UTC429 + recovery panel ("resets at 00:00 UTC")
Monthly generations / user200/month1st of month UTC429 + recovery panel ("resets on the 1st")
Weekday-only (Mon–Fri ET)Closed Sat/Sun (Eastern)Reopens MondayERR_WEEKEND + recovery panel ("back online Monday")
Team-wide budget$70/month → ~700 runs at $0.10/call1st of month UTC429 to ALL users + "contact ops to extend"
Slack alert80% of team budgetonce per crossingWebhook fires once when team_spend ≥ $56—early warning before hard cap

UI pills: "today X/Y" · "month X/Y" · "team ~X/Y runs left"—pills turn amber as you approach a cap, red when reached. The team pill converts dollars to estimated runs at the current per-call cost so users see a single shared ceiling in the units they care about.

Unlimited-users exception

The UNLIMITED_USER_EMAILS env var (comma-separated emails) bypasses every cap—the per-user daily and monthly limits and the weekend gate—for listed users. Calls still record into the counter tables and accrue cost for transparency in the usage log—they just never get blocked. Pierce is on this list because he funds the underlying API.

To tune the [vars] block: edit workers/variant/wrangler.toml [vars] (RATE_DAILY · RATE_MONTHLY · BUDGET_MONTHLY_USD · BUDGET_ALERT_FRACTION · EST_COST_PER_CALL_USD) and wrangler deploy.

UNLIMITED_USER_EMAILS lives as a Worker secret (not [vars]) because it lists colleague email addresses—committing those would violate anonymization. Update via echo '$LIST' | wrangler secret put UNLIMITED_USER_EMAILS; see DEPLOY.md §4 for rotation procedure.

If 0 variants returned

The tool tries up to 2 regeneration passes before giving up. If the pool is still empty (or short of N), the response carries a recovery_advice array—specific actionable guidance based on the dominant failure mode that pass.

The UI surfaces it as an amber-bordered "what to try" panel below the (empty) results, with a ↻ Regenerate button next to it. Examples of what the panel says depending on the failure pattern:

  • "Most candidates were rejected as too similar to your seed (cosine gate). Try with article body to unlock more varied phrasings."
  • "Most candidates failed the rule-based grader—your seed may be too short to extend into the 90–104 char target or contain banned patterns. Try lengthening the seed."
  • "You're requiring all 4 primary keywords in every variant—that's a strict bar. Move the least-essential keyword to 'secondary keywords' to widen the pool."
  • "Pairwise diversity rejected several candidates—your topic may have only a couple of viable facets. Try generating fewer (e.g., n=5) or remove an 'also avoid' entry."

Form state persists in sessionStorage—the operator can adjust inputs and regenerate without re-pasting everything.

My usage log

Every authenticated page load fetches the user's own usage from D1 (server-side, never localStorage) and populates the pills + the collapsible "My usage log" panel below them.

Personal section (CF-Access-authenticated email is the filter—users see only their own activity):

  • Per-day breakdown · last 30 days (calls / variants returned / cost in $)
  • Recent generations · last 30 (timestamp, headline preview, returned_n / requested_n, mode, request_id for forensics)

Team-wide visibility in the quota pills above the form: today's team total + this month's spend vs budget are surfaced as part of the standard quota line so concurrent operators see capacity without a separate panel. Per-user monthly breakdown is available via GET /metrics/quota-by-user if needed for ops review; not currently rendered in the UI.

Refreshes on every page load. The cap-enforcement counters are atomic in D1; the read endpoint is purely visibility—concurrent /generate calls cannot over-shoot the limits regardless of what this panel reports.

Architecture · Worker + D1 + Workers AI

Cloudflare Worker at workers/variant/ backed by D1 (shared with workers/history) + Cloudflare Workers AI (embeddings) + Anthropic API (generation + grading). Static page at docs/variant/.

ComponentTechnologyPurpose
Static pagePlain HTML/JS/CSS in docs/variant/Form + results UI; served by Cloudflare Pages
APICloudflare Worker (workers/variant/src/index.js)POST /generate-async (live client path) · /generate (legacy sync) · /select · /unselect · /outcome · /regenerate; GET /job/:id · /healthz · /seed/:fp/kept · /metrics/* · /export/*
DatabaseCloudflare D1 (data-headlines-history)Shared with workers/history. 18 variant_* tables (plus schema_versions for migration tracking) + v_active_kept view across migrations 0001–0014.
EmbeddingsWorkers AI @cf/baai/bge-large-en-v1.51024-dim L2-normalized vectors; cosine similarity gates. Free within neuron limits, attached to the same Cloudflare account.
Generation (rule-following)Anthropic Sonnet 4.6 (default; tunable via env var)n−1 publishable variants per call
Generation (wildcard)Anthropic Sonnet 4.6 (default; tunable)Exactly 1 off-pattern variant per call (capped via slice(0,1) in case the model returns more)
LLM grader (rule-following)Anthropic Haiku 4.5 (default; tunable)4-criterion grader · active_voice · no_lead_burial · curiosity · accurate
LLM grader (wildcard accuracy)Same model as the grader1-criterion safety net · accuracy only
AuthCloudflare Access (page) + X-Auth-Token shared bearer (writes)Page is gated by Access; CF-Access-Authenticated-User-Email header drives per-user identity. WRITE_TOKEN secret for /generate /select /outcome.
Rate limit / budgetD1 atomic counters (variant_user_quota_* + variant_team_budget)INSERT...ON CONFLICT DO UPDATE—serializable per-row, cannot over-shoot under concurrency.

Model dispatch is by env-var prefix: any model name starting with @cf/ routes to Workers AI via env.AI.run (free, account-bound); anything else (e.g. claude-sonnet-4-6) routes to Anthropic via fetch. Workers AI Llama 3.3 70B is a tested fallback for cost-down—quality is meaningfully worse on the rule-following generator (more clichés, weaker prompt adherence) but acceptable for the LLM grader role if needed.

Pipeline · stages in order

One POST /generate call traverses these stages. Up to 2 regeneration passes if the pool falls short.

  1. Validate body—headline ≥20 chars, primary keywords required, article ≤50,000 chars, format/persona/platform/site allowlisted, prompt-injection regex screen on the article body.
  2. Idempotency check—if client_request_id matches a recent (≤5min) generation for the same email, return its generation_id without re-running the pipeline.
  3. Rate-limit gate—read D1 counters for daily/monthly/budget. Skip caps if email is in UNLIMITED_USER_EMAILS.
  4. Anonymization scrub—load denylist from the DENYLIST_JSON Cloudflare Secret (cached per-isolate). Replace any matched name with […] in headline, article, primary/secondary keywords, also-avoid entries.
  5. Generate · rule-following + wildcard in parallel via callRuleFollowing + callWildcard. Retry once on transient (429/5xx) Anthropic + Workers AI failures.
  6. Embed · single Workers AI batch call · seed + each also-avoid + each candidate variant. On second-attempt failure, degrade to cosine=0 (rule-based grader still gates).
  7. Cosine gate · per-cohort ceilings (rule-following: 0.97 default / 0.93 article; wildcard: 0.95). Track every dropped candidate in allOfferedCandidates with dropped_at_gate='cosine'.
  8. Rule-based grader pre-check · char_count, formula, named-entity lead, primary keywords, banned patterns. Drop on any non-LLM rule failure with dropped_at_gate='grader_rule'.
  9. LLM grader · batched call for rule-following candidates only · 4 criteria. Wildcards skip this stage. On grader failure, degrade to rule-based-only score (LLM criteria null) instead of 502.
  10. Compose graded set · rule-following with merged LLM results (drop sub-85) + wildcards with rule-based-only score + accuracy safety-net call.
  11. Set-level diversity · formula cap (≤2 per formula). Article-mode adds pairwise cosine ≤0.85 + pairwise Jaccard ≤0.45 between rule-following variants.
  12. Regen loop · if pool < requestedN and passes < 2, re-prompt the LLM with explicit failure-mode feedback (which gate dropped the most variants this pass).
  13. Persist · D1 batch—generation row + every offered candidate (rule + wildcard, kept + dropped, with dropped_at_gate annotation) + back-compat write to legacy variant_offered_wildcards. Atomic.
  14. Record cost · atomic D1 counter increment (always—failed pipelines still count for DoS protection). 80%-budget Slack alert fires once per crossing.
  15. Return · variants + pipeline stats + quota snapshot + recovery_advice if pool short.

Model assignments

All three LLM roles are env-var-driven. Edit workers/variant/wrangler.toml [vars] block + wrangler deploy to swap any role. Dispatch by prefix: @cf/... → Workers AI (free), anything else → Anthropic.

RoleEnv varCurrent defaultWhy this model
Rule-following generatorMODEL_RULE_FOLLOWINGclaude-sonnet-4-6Strong prompt-following—the full credit-worthy formula taxonomy (about two dozen) + scope discipline + multi-facet diversity is genuinely hard for smaller models. Tested Llama 3.3 70B; quality drop was significant (cliché flood, weaker adherence).
Wildcard generatorMODEL_WILDCARDclaude-sonnet-4-6Was Opus 4.7 (cost-heavy); was Llama 3.3 70B (quality dropped + multi-wildcard issues); Sonnet 4.6 is the right cost/quality balance for a single creative variant per call.
LLM graderMODEL_GRADERclaude-haiku-4-5-20251001Classification task with loose default-PASS prompts; Haiku is cheap and reliable. Tested Llama 70B; Haiku's structured-JSON output is more dependable.
Embeddings(hardcoded)@cf/baai/bge-large-en-v1.5Free on Workers AI, 1024-dim, performs well on news-text similarity. Same model the CSA Diff Tool uses for parity.

Cost model. ~$0.06–$0.10 per /generate call on the current Sonnet+Sonnet+Haiku stack. Recent observed range $0.057–$0.099 (varies with article-body length + regen passes).

D1 schema · 18 tables + 1 view

Single D1 database (data-headlines-history) shared with workers/history. Migrations 0001–0014 applied; current schema_version is 14. Migration 0004 added the kept toggle (unkept_at + seed_fingerprint + v_active_kept view); 0006/0007/0008 added partial unique indexes; 0009 lowercased legacy user_email values for case-consistent matching; 0010 added embed_degraded sentinel column on offered_candidates + offered_wildcards and widened the offered_candidates UNIQUE to (generation_id, cohort, variant_text); 0011 widened the variant_selections UNIQUE to match (cohort included) and added partial UNIQUE(is_active=1) on rules_versions + prompts; 0012 added the variant_jobs async-job queue (no version stamp—table-only, so MAX(version) jumps 12→13 across the 0012/0013 pair); 0013 tightened the article-mode pairwise gates; 0014 added the headline-experiment data layer (variant_rule_sets + variant_config_specs, plus rule_set_id/spec_hash/headline_type columns on generations + offered_candidates).

TablePurposeKey columns
variant_generationsOne row per /generate call. Audit + reproducibility tuple.request_id · user_email · headline · calibration_id · prompt_*_id · model_* · rules_source_sha · denylist_version · article_mode · returned_n · est_cost_usd · duration_ms · primary_keywords · secondary_keywords · client_request_id · generated_at_iso
variant_offered_candidatesEvery candidate offered (rule + wildcard, kept + dropped)—canonical surface for the cohort study (added migration 0003).generation_id · cohort · variant_text · similarity_to_input · grader_score · grader_breakdown · fingerprint_json · picked · dropped_at_gate · pass_index · created_at_iso
variant_offered_wildcardsLegacy wildcard-only table from migration 0001. Worker dual-writes for back-compat with older cohort analyzers. Migration 0008 added a UNIQUE(generation_id, variant_text) so the dual-write INSERT OR IGNORE is race-safe.(same as offered_candidates filtered to wildcard cohort)
variant_selectionsOne row per /select. UNIQUE(generation_id, variant_text)—double-click can't double-insert.generation_id · variant_text · cohort · grader_score · grader_breakdown · n_offered_in_set · rank_in_set · selected_at_iso
variant_fingerprintsStructural + lexical features per selected variant (1:1). Powers cohort clustering.selection_id · char_count · leading_pos · primary_verb · formula_match · bigram_signature · trigram_signature · pos_sequence_signature · cluster_key · syllable_count · named_entities_json · dependency_skeleton
variant_outcomes_joinPost-publish outcome metrics joined to selection. UPSERT on selection_id with COALESCE so partial enrichments accumulate.selection_id · url · publication · publish_date · headline_in_production · edit_distance_to_pick · total_pvs · search_pvs · social_pvs · applenews_pvs · smartnews_pvs · ctr · dwell_seconds_p50 · scroll_depth_p50 · ecpm_usd · best_position · outcome_source · matched_at_iso
variant_calibrationsVersioned numeric pipeline knobs. Generation rows reference the active row.score_floor · cosine_gate_input_default · cosine_gate_input_article · cosine_gate_wildcard · pairwise_cosine_max · pairwise_jaccard_max · formula_per_set_max · char_min · char_target_min · char_target_max · max_regen_passes · is_active
variant_promptsSHA-256-keyed prompt registry. One row per (mode, prompt_hash). Auto-seeded by the worker.mode · prompt_hash · prompt_text · applied_at_iso · is_active
variant_modelsProvider + pricing snapshot per model identifier ever seen.name · provider · cost_input_per_m_usd · cost_output_per_m_usd · first_seen_at_iso · is_active
variant_rules_versionscsa-content-standards SHA bundled at deploy time (via sync_rules.py).source_repo · source_sha · generated_at_iso · is_active
variant_rule_setsHot-swappable experimental rule arms for the headline experiment (added migration 0014). Active-row pattern (single-active partial UNIQUE)—the worker reads the active row fail-open at generation time. Holds the CTRL/EMG/FWD framing definitions + signal-word menus + the expert directive in rule_set_json.label · is_active · applied_at_iso · rule_set_json · notes
variant_config_specsImmutable, content-addressed config snapshots—the per-headline "receipt" (added migration 0014). One row per distinct config; spec_hash = SHA-256 of the canonicalized spec JSON; INSERT OR IGNORE so it stays small. A generation's spec_hash dereferences here to the exact config that produced it.spec_hash (PK) · spec_json · created_at_iso · label
variant_format_candidatesCluster of recurring wildcard structures surfaced by analyze_variant_cohort.py. Status enum: surfaced → under_review → promoted / rejected / retired.cluster_key · n_wildcards_offered · n_wildcards_picked · cluster_pick_rate · lift_vs_global · pv_median_ratio_point · status · promoted_format_name · csa_standards_sha
variant_format_eventsAppend-only audit log of every status transition (governance trail).candidate_id · ts_iso · prev_status · new_status · actor · detail
variant_user_quota_dailyAtomic per-user per-day counter (PRIMARY KEY user_email + ymd).user_email · ymd · count
variant_user_quota_monthlyAtomic per-user per-month counter.user_email · ym · count
variant_team_budgetAtomic team-wide spend counter + idempotent 80%-alert flag.ym · spend_usd · alert_80_fired · alert_100_fired
variant_jobsAsync-job lifecycle queue for /generate-async (added migration 0012). One row per job; status queued → running → complete | failed with fine-grained stage tracking. Stores the request body (replay) + the full result_json on complete; rows >7 days are reaped.job_id (PK) · user_email · status · stage · request_body · result_json · error_code · created_at

Portability. Every *_at epoch_ms field has an *_iso mirror column populated at write time. Indexes cover every filter/group-by surface (calibration, prompt, model, rules SHA, cluster_key, total_pvs, publish_date, etc.). Foreign keys declared; PRAGMA foreign_keys = ON set per-request in the worker.

Versioning · calibrations / prompts / models / rules

Every variant_generations row carries the full reproducibility tuple—what numeric calibration was active, which exact prompt template was sent, which model handled each role, which csa-content-standards SHA the rule corpus came from, which denylist version was loaded.

Why this matters. The variant tool will be tuned over time. Score floors will move, prompts will be loosened or tightened, models will be swapped. A/B comparisons across versions become trivial SQL: WHERE calibration_id = X vs Y. Rolling improvements are measurable rather than asserted.

How a new calibration ships

  1. Edit constants in workers/variant/src/index.js (or wrangler.toml env vars for the LLM models).
  2. Run wrangler deploy.
  3. On next request, ensureMetadata() reads the active variant_calibrations row. If the new constants don't match, INSERT a new calibration row + deprecate the prior. Worker stamps the new ID onto subsequent generations.
  4. Compare A/B over a few days: SELECT calibration_id, AVG(returned_n), AVG(est_cost_usd), AVG(rejected_grader) FROM variant_generations WHERE generated_at > ? GROUP BY calibration_id.

How a new prompt ships

Edit the prompt template in workers/variant/src/llm.js. Worker computes SHA-256 of the system prompt at request time. INSERT-OR-IGNORE into variant_prompts (idempotent on hash). Subsequent generations reference the new prompt_id; A/B comparison via WHERE prompt_rule_id = X.

Cohort study + format-promotion ledger

The wildcard cohort is the long-term feedstock for surfacing new headline standard formats—recurring wildcard structures that the operator consistently picks (and that perform once published) become candidates for promotion to the rule library.

The pipeline

  1. Collect · every offered candidate persists with structural fingerprint (cluster_key = formula × leading_pos × first 3 POS tokens). Both kept + dropped wildcards land in variant_offered_candidates.
  2. Cluster · scripts/analyze_variant_cohort.py groups wildcards by cluster_key. Per cluster: count offered, count picked, pick rate, performance lift if outcomes joined. Aborts when /export/* is truncated at 5000 rows so reports are never silently biased.
  3. Surface candidates · clusters with significantly higher pick rate or PV lift than the rule-following baseline land in variant_format_candidates with status='surfaced'.
  4. Review · operator (Pierce + content team lead) reviews surfaced candidates. Status moves to under_reviewpromoted / rejected.
  5. Promote · promoted clusters are codified into csa-content-standards as new headline format guidelines. The promotion event records the csa-content-standards SHA at promotion time, so a reviewer can diff exactly what landed in the standards docs because of this candidate.
  6. Audit · variant_format_events append-only log captures every status transition (who promoted what when, against which standards SHA).

What's shipped vs planned

  • Shipped: migrations 0003-0014 schema (variant_offered_candidates · variant_format_candidates · variant_format_events · v_active_kept · UNIQUE indexes · lowercased user_email · embed_degraded sentinel · cohort-aware UNIQUE on offered_candidates + variant_selections · partial UNIQUE(is_active=1) on rules_versions + prompts · the variant_jobs async queue · the headline-experiment data layer variant_rule_sets + variant_config_specs with per-call/per-headline attribution columns). Worker writes every offered candidate's fingerprint, including diversity-dropped candidates. scripts/analyze_variant_cohort.py computes cluster pick-rates + bootstrap CIs + edit-distance tail; aborts cleanly on /export/* truncation rather than silently bias; non-zero-baseline guards on candidate-format surfacing; isinstance NaN guards on CI bounds; html.escape on operator-supplied report content.
  • Planned: weekly cron to refresh format-candidate rankings; a UI surface for reviewing surfaced candidates; promotion-to-standards integration.

Anonymization · DENYLIST_JSON

The variant tool reads the colleague-name denylist from a Cloudflare Secret (DENYLIST_JSON) at request time—never bundled into the build artifact, never committed to git. Per-isolate cache so we re-parse the secret only on cold start.

Format:

{"version":"2026-05-07","names":["<colleague-1>","<colleague-2>","..."]}

(Legacy: a bare JSON array of strings is also accepted.)

Where it scrubs:

  • Pre-call · headline, article body, primary + secondary keywords, also-avoid entries—all scrubbed before they reach the LLM. Names replaced with […] placeholders.
  • Post-call · LLM outputs scrubbed before they reach the user (defense in depth—the LLM shouldn't introduce new colleague names but if it echoes one that survived input scrub via partial match, this catches it).
  • Hard reject · any candidate variant containing a denylist name post-scrub is dropped from the pool entirely (treated as a denylist failure).

The denylist version stamps onto every variant_generations row (denylist_version column) for reproducibility—if a name is added or removed from the denylist, the version stamp lets you scope queries to before/after the change.

Pierce is the only permitted name across all repos; the denylist excludes him. To set / update:

cd workers/variant && wrangler secret put DENYLIST_JSON < /tmp/denylist.json

Deployment · secrets + migrations

Detail in workers/variant/DEPLOY.md. Summary here.

One-time setup

  1. Apply migrations to remote D1 in order, 00010014 (0001_variant_tables.sql0011_selections_cohort_unique_and_active_partials.sql0012_async_jobs.sql0013_tighten_article_pairwise_gates.sql0014_headline_experimentation.sql). Or use npm run migrate:all from workers/variant/. Then run the one-shot seed 0014_seed_phase1_rule_set.sql ONCE after migrating (it's a deliberate operator step, NOT in the migrate chain; idempotent via WHERE NOT EXISTS). Confirm the schema_version: 14 stamp via /healthz.
  2. Set worker secrets: WRITE_TOKEN (32-hex shared bearer for /generate /select /outcome), ANTHROPIC_API_KEY (McClatchy-billed), DENYLIST_JSON (anonymization), UNLIMITED_USER_EMAILS (comma-separated emails that bypass quota; secret because it lists colleague addresses), SLACK_WEBHOOK_URL (optional, for 80%-budget alerts).
  3. Bundle the rule corpus + denylist before first deploy: python3 scripts/sync_rules.py. (DENYLIST_JSON is runtime-loaded; only RULES.js is bundled.)
  4. Add /variant/ to the existing CF Access app on data-headlines.pierce.tools (already covered by the data-headlines.pierce.tools Access policy by default).

Per-deploy

cd workers/variant
python3 scripts/sync_rules.py       # refresh RULES.js from csa-content-standards
wrangler deploy                     # ship Worker
curl -s https://data-headlines-variant.pierce-williams.workers.dev/healthz | jq

The /healthz response surfaces: schema_version, bindings (ai/db/anthropic_key—the legacy KV binding was retired in migration 0003), active models, rules_loaded, denylist configuration + version, active calibration row. /healthz/deep additionally pings Workers AI for a live latency check. All other GET endpoints require X-Auth-Token.

API endpoints

Method · PathAuthPurpose
GET /healthzNoneDB ping + bindings + active calibration + denylist status
GET /healthz/deepNoneSame + a live env.AI ping (latency + reachability)
POST /generate-asyncX-Auth-Token + CF AccessThe live client path. Enqueues a generation job and returns {job_id, status:'queued', request_id} at HTTP 202 in <1s; the worker runs the full pipeline in the background (Cloudflare Queue consumer, or ctx.waitUntil() fallback) so the client never hits CF's 30s wall-clock cap. Same body as /generate.
GET /job/:idX-Auth-Token + CF AccessOwner-scoped poll of an async job (id = the v4 UUID job_id). Returns stage (queued → generating → regenerating) + terminal status (complete | failed) + created_at/started_at/completed_at; on complete, result is the full /generate response body inline (no second fetch). A poll of a dead/stale job reaps it to failed with a recovery message.
POST /generateX-Auth-Token + CF AccessLegacy synchronous full pipeline—runs the whole thing under one request (superseded by generate-async + poll for the live client). Body: headline, primary_keywords, secondary_keywords, article, n, format/persona/platform/site, also_avoid, client_request_id
POST /regeneratesameAlias for /generate (client passes a fresh nonce; idempotency guard ensures no double-billing)
POST /selectsameKeep variant. UNIQUE(generation_id, variant_text)—double-click safe. Returns idempotent: true on duplicate; re_kept: true when toggling on a previously-soft-deleted row (clears unkept_at). Backfills offered + fingerprint rows if /generate's batch 2 had failed.
POST /unselectsameSoft-delete kept variant. Body accepts either seed_fingerprint (preferred—soft-deletes ALL active rows for this user across generations) or generation_id (legacy). User-scoped—one operator can't unkeep another's row; returns kept_by_other_user noop instead.
POST /outcomesameLink selection → published article + outcome metrics (PVs by surface, CTR, dwell, scroll, eCPM). UPSERT with COALESCE—partial enrichments accumulate. Rejects soft-deleted selections (409 ERR_SELECTION_UNKEPT). publish_date round-trip-validated against the parsed Date so impossible calendar dates are caught.
GET /seed/:fingerprint/keptX-Auth-Token + CF AccessCross-generation hydration. Returns active kept variants for a seed_fingerprint, joined across all generations sharing that fingerprint. Used by the client to paint Kept ✓ on cards after a regenerate.
GET /metrics/quota-mineX-Auth-Token + CF AccessCaller's quota snapshot (read from CF Access email header)
GET /metrics/my-history?days=NX-Auth-Token + CF AccessPer-day breakdown + recent 30 generations for the calling user
GET /metrics/quota-by-userX-Auth-Token + CF Access (or ?token=<WRITE_TOKEN> for admin scripts)Admin: per-user breakdown for today + month + team budget
GET /metrics/cohort-countsX-Auth-Token + CF AccessSelections by cohort: lifetime decisions AND currently-active. Plus wildcards offered/picked + ever-pick rate.
GET /metrics/diversity-rejection-rate?since=X-Auth-Token + CF AccessPipeline rejection-rate aggregates over a window. Validates ?since=.
GET /export/offered-candidates?since=X-Auth-Token + CF AccessEvery offered candidate (cohort + dropped + picked)—canonical cohort study export. Surfaces truncated:bool when LIMIT 5000 hits.
GET /export/wildcards-offered?since=X-Auth-Token + CF AccessLegacy wildcards-only export (back-compat with older analyzers).
GET /export/selections?since=&active_only=&include_emails=&token=X-Auth-Token + CF AccessSelections + fingerprints + outcomes joined. Default returns ALL decisions (including soft-deleted) for cohort counting; pass active_only=1 to filter to currently-kept. user_email is redacted to NULL by default; pass include_emails=1&token=<WRITE_TOKEN> for trusted callers.

Troubleshooting

SymptomLikely causeFix
Pills show "?/?" forever on page load/metrics/quota-mine fetch failed (worker down or Access misconfig)Open browser devtools · Network · refresh; check /metrics/quota-mine response. If 401, CF Access policy doesn't cover the user's email.
0 variants returned, no recovery panelPipeline error before regen loop (embed failure, validation failure)Check response code/error fields. ERR_INJECTION = article body had a prompt-injection pattern; rephrase. ERR_VALIDATION = headline/keyword length issue.
Variants all cluster on one angleDefault mode (no article body); diversity gates inactivePaste the full article body to switch to article-mode (pairwise diversity gates kick in).
"persist_failed: true" on responseD1 batch insert failed but variants still returnedVariants are usable; cohort study just won't capture this run. Copy a variant manually to clipboard. Check D1 status; retry the next generation.
Variants contain a colleague's nameDENYLIST_JSON secret unset or stalewrangler secret put DENYLIST_JSON. /healthz shows denylist.configured.
429 ERR_RATE_DAILY before lunchDaily cap hit (10/day default)Wait until 00:00 UTC OR (admin) reset the user's row in variant_user_quota_daily OR add the email to UNLIMITED_USER_EMAILS.
ERR_WEEKEND on a Saturday or SundayTried on a weekend—the tool is weekday-only (Mon–Fri ET) for capped usersIt's back online Monday. For urgent off-hours access, add the email to UNLIMITED_USER_EMAILS (unlimited users bypass the weekend gate).
429 ERR_BUDGETTeam-wide budget exhausted ($70 default)Edit BUDGET_MONTHLY_USD in wrangler.toml + redeploy. Or wait for the 1st of the month.
Wildcard never appears in resultsWildcard failed cosine OR rule-based pre-check OR accuracy safety netCheck ?debug=1 response. If wildcard cosine repeatedly >0.95, the seed is too unique to wildcardize (rare). If accuracy fails, the wildcard may be hallucinating; review with editorial eye.
Help bubbles render as empty boxesCSS not loadedHard-refresh; cache-bust URL stamp on variant.css.

Variant-specific glossary

Cohort

Either rule_following (n−1 publishable variants per call, full grader floor) or wildcard (1 off-pattern variant per call, hard rules + accuracy safety-net only). Stored on variant_offered_candidates.cohort.

Cluster key

Compact structural fingerprint computed from formula × leading_pos × first 3 POS tokens (e.g. 'Possessive | noun | NN-NN-VB'). Used as the cohort study's primary GROUP BY for surfacing recurring wildcard structures.

Calibration

A bundled snapshot of every numeric pipeline knob (score_floor · cosine gates · pairwise gates · char range · regen passes). Stored as a row in variant_calibrations; the active row's id stamps onto every generation.

Prompt hash

SHA-256 hex of a system prompt template at the moment of dispatch. Stored in variant_prompts with the full template text. Lets you A/B prompt revisions without losing history.

Article-mode

Pipeline regime when the operator pastes the article body. Activates pairwise diversity gates between rule-following variants (cosine ≤ 0.85, Jaccard ≤ 0.45) and switches the LLM grader's accuracy criterion to "judge against the article" (more permissive). Mode pill in the results header tells the operator which regime is active.

Recovery advice

Specific actionable guidance appended to /generate responses when the pool is short of N. Surfaced in the UI as an amber-bordered panel with concrete advice ("move keyword X to secondary," "try without an article body," etc.) and a regenerate button.

Idempotency replay

If client_request_id matches a prior generation by the same email within 5 minutes, the worker returns the prior generation_id without re-running the pipeline. Prevents double-billing on transient client retries.

Unlimited-users exception

Comma-separated emails in the UNLIMITED_USER_EMAILS env var. Listed users bypass all daily/monthly/team-budget caps. Calls still record into the counter tables for transparency. Pierce is on this list (he funds the underlying API).