What this console is
A private dashboard at data-headlines.pierce.tools that turns published-headline performance into editorial decisions.
Every article McClatchy publishes lands in the Tracker (the content team lead's master sheet). Snowflake enriches that with platform-level page-view data—Apple News, SmartNews, Yahoo, Newsbreak, search, social, subscriber, newsletter. Tarrow adds platform-syndication detail. The console reads from that joined surface and asks one question per tile: which combinations of pub, topic, formula, vertical, and writer beat their baseline by enough to act on?
It's the analytics counterpart to the Keywords Decisioning Console: keywords decides what to write next; headlines learns from what was written.
Who uses each surface
| Role | Primary surface | What they do |
|---|---|---|
| Content strategist | Findings + Grader | Daily headline grade + weekly tile review |
| Content team lead | Findings + Authors | Per-author percentile review, formula coaching |
| Anyone post-publish | Grader page | Review per-headline scorecards from today's automated run |
| New team member | Docs | Read this; then ask questions. |
Concepts 101 · zero-knowledge primer
The console reuses a handful of recurring terms. If anything below is unfamiliar, start here.
Page-views (PVs / total_pvs)
The total times an article was loaded across every distribution surface in the period (Apple News + SmartNews + Yahoo + Newsbreak + search + social + direct + newsletter + AMP + subscriber). Sourced from STORY_TRAFFIC_MAIN in Snowflake.
Lift
A multiplier comparing one group's median PVs to a baseline. +50% means the group's median article hit 1.5× the baseline. Negative = below baseline. The baseline depends on context: usually the publication's overall median, sometimes the vertical median, sometimes the untagged-formula baseline.
Median (not mean)
The console reports medians everywhere because page-view distributions are heavy-tailed—one viral hit can shift a mean by 5×. Medians are stable; viral outliers don't distort the signal.
Headline formula
A regex-classified template the headline matches. Priority-ordered lead-position patterns (number_lead, heres, quote_lead, name_lead) then structural cues (did_you_miss, what_to_know, explainer, definitional, how_to, question, dash_directive, superlative_list, comparative, negation_lead, curiosity_gap) and a length-based structural fallback (short_punchy, descriptive_standard, descriptive_long) so every headline buckets. See the full taxonomy in the glossary.
Vertical
One of three named TH-channel verticals: Mind/Body, Everyday Living, Experiences. Articles published outside these channels (search/discover work) aren't bucketed.
Content type
Either News & Regional (T1 newspapers—Miami Herald, KC Star, Charlotte Observer, etc.) or L&E (US Weekly, Woman's World, and Life & Style). The active lens at the top of Findings restricts every tile to one or the other. (InTouch and Closer are retired L&E brands the console recognizes only for News-vs-Entertainment tag coloring—they aren't in the active Findings cohort.)
Verdict
Each tile assigns each row a verdict: go (real lift, strong evidence), test (directional, worth piloting), or skip (below baseline or thinly evidenced). Color-coded green / amber / coral on the tile rail and pills.
P-value
The probability the observed lift could have arisen by chance from the baseline distribution. p<0.05 = real signal; p<0.10 = directional; otherwise no detectable difference. Computed via Mann-Whitney U-test (non-parametric—no assumption of normal distributions).
The lens filter (News / L&E / All)
Three pills at the top of Findings. Active pill governs the cohort (a cohort = a group of publications sharing a content type) + primary metrics across every tile below.
| Lens | Cohort | Primary metric framing |
|---|---|---|
| All publications | Full active cohort (count is computed at runtime + shown next to the pill) | Aggregate; useful for portfolio audits |
| News & Regional | T1 newspapers—Miami Herald, KC Star, Charlotte Observer, Fort Worth Star-Telegram, Sacramento Bee, Raleigh News & Observer, Centre Daily Times, Idaho Statesman, plus the rest of the McClatchy T1 portfolio | CTR + Search Ranking. Editorial truth: straight headlines outperform punny ones on Google News/Discover. |
| Lifestyle & Entertainment | US Weekly, Woman's World, and Life & Style (InTouch + Closer are retired brands, recognized only for tag coloring) | Social Shares + Scroll Depth. Editorial truth: track the curiosity gap—questions kept open beat answers spoiled. |
Findings: the 7 tiles
The home page (/findings/) renders seven analytical tiles. Each tile is a self-contained question + answer; click anywhere on a tile to open its drawer with the underlying data.
- Per-publication: top-performing topics—which topics each pub hits hardest.
- Per-publication: top-performing formulas—which headline patterns earn lift at each pub.
- Per-vertical formula × site—three TH-channel rails (M/B, EL, Exp) with cluster batting average.
- Trends Over Time—per-pub PV trajectory by quarter.
- Formula × Topic × Publication—three-way interaction including weather as a topic dim.
- Bottom-Performer Patterns—confirmed anti-patterns to avoid.
- Team Performance—per-author percentile vs publication median.
Above the tiles sit the lens pills (News / L&E / All) and a filter-chip row that narrows every tile at once to one content lane, publication type, or headline formula. Several tiles (and the home page) also carry a recent top-performers panel you can rank by traffic or by reader-engagement.
Tile · Per-publication topics
For each publication, the topics that exceed the pub's median PVs by the largest margin. Lift = (topic median − pub median) ÷ pub median. n = articles in the topic at that pub during the lookback window.
Use: assignment prioritization. If Politics at Miami Herald is +34% with n=120 and p<0.05, that's a durable signal—the next politics piece probably outperforms a same-pub baseline article.
Caveat: topic mix shifts seasonally. Lift numbers in the per-pub formula + vertical-cube tiles default to recency-weighted (last 12 months, decayed by ~0.95^weeks-since-publish ≈ 3-month half-life) so "what's working now" beats "what worked over the year."
Tile · Per-publication formulas
Same shape as Per-publication topics, but classifying by headline formula instead of topic. The baseline is the pub's overall median PVs across all headlines; lift = (formula median − pub median) ÷ pub median. Every headline buckets into one of the formulas in the taxonomy (the length-based structural fallback catches anything without a signature pattern), so there's no untagged baseline.
News editorial truth: direct-declarative + quote-lead likely outperform wordplay on local news in Google News/Discover. The data confirms or revises that prior per publication.
L&E editorial truth: curiosity-gap likely beats spoiled-answer. Confirmed on UsW with +42% vs −35% (n=19 / n=12, p<0.05).
Tile · Per-vertical formula × site cube
Three rails—Mind/Body, Everyday Living, Experiences—each showing the formula × site combinations with the largest lift vs vertical median.
The cluster batting average in the meta line answers: of all CSA-touched articles (those with a cluster_id), what fraction sit in a cluster with a positive cluster_vs_co_median? It's article-level, so larger clusters count more (more articles = more votes). Reported as "1 in N"—lower N = better.
Use: formula assignment at the vertical level, or evaluating whether a creator's cluster strategy is paying off.
Tile · Trends Over Time
Per-publication median PVs by quarter. Lift trend = (latest quarter median − first quarter median) ÷ first quarter median.
Tile · Sweet-spot finder (pub × topic × formula)
A topic might look flat at a pub overall—but a specific formula can unlock it. This tile finds those three-way sweet spots so the per-tile-1 and per-tile-2 single-axis views don't bury them.
What it shows
Every (publication, topic, formula) combination with n ≥ 3 articles, scored by lift vs that publication's median PVs. Sorted by lift descending. Verdict overlays the same go / test / skip rule as everywhere else. The cube is the most fine-grained tile in the console — at small corpus sizes it surfaces only the densest 1-3 cells; as the article volume grows it fills out.
How to use it
- Look for combinations marked go with high n. Those are the assignable sweet spots—pub + topic + formula triples that beat the baseline reliably.
- If a topic shows flat or negative on the per-pub topics tile, scan here before writing it off—the right formula may rescue it.
- Use the highlighted rows to directly compare editorial-principle pairs (e.g. curiosity-gap vs spoiled-answer at UsW).
Weather signal sidecar
The right pane shows the 30-day weather lift on T1 newspapers (median PVs of weather-classified articles vs the T1 baseline) plus the count. Weather is included in the main table as a topic dimension; the sidecar separates it so seasonality is visible at a glance.
Tile · Bottom-Performer Patterns
Anti-patterns: combinations whose median PVs sit ≤0.5× their pub median, with n ≥ 10.
Currently three confirmed patterns: local-team sports without national hook (T1s), spoiled-answer headlines (L&E), punny/wordplay on local news (T1s).
Use: what to avoid. Full guidance with examples lives in the bottom-performer drawer of each per-publication tile.
Tile · Team Performance
Per-author percentile vs publication median. Author shown as initials (first + last name leading letters; collisions disambiguated with a digit suffix). Sorted by pub-percentile descending.
A writer at 70th pctile means their median article ranks at the 70th percentile of articles published at their primary pub—read it like a class rank: their typical article beats 70% of what their pub published.
Two extra percentile readings sit beside the headline one so writers are compared like-for-like: vertical-peer percentile (rank against only other writers in the same content lane—fairer than ranking a recipes writer against a politics writer) and tier-peer percentile (rank against writers at comparable-size publications). The small ± figure on each is the wiggle room from sample size (a bootstrap confidence range).
Recent top-performers panel (+ engagement rank)
A side panel that rides several tiles (and the home page) listing the standout published headlines from the last 7 days. New since the 5/04 docs cut: you choose what to rank them by.
An axis toggle above the list—four buttons: PVs, Engaged %, Time on page, Cluster cont.—lets you choose the ranking metric, and your choice is remembered on your next visit:
| Rank by | What it means | Normalized? |
|---|---|---|
| PVs | Page-view traffic. | Ranked by how far the article beat its own publication's normal level, so big and small papers compare fairly. |
| Engaged % | The share of visits where readers actually engaged rather than bounced. | Same—vs the publication's normal engaged rate. |
| Time on page | How long readers stayed (median). | Same—vs the publication's normal dwell. |
| Cluster cont. | The share of readers who clicked on to another linked article in the same group. | Absolute (the value is already a normalized rate): 30%+ reads as strong, 15–30% middling, under 15% weak. |
An article missing the chosen measure drops off that list (e.g. a piece with no engagement coverage won't appear when you rank by Engaged %). When you rank by PVs, each row also carries a small decile badge (like D8) showing which tenth of its content lane the article lands in by lift (how far it beat the publication median).
Reading the numbers (lift · confidence range · verdict dot · decile)
Four small affordances appear throughout Findings. None require statistics background—here's how to read each at a glance. Full definitions are in the glossary.
- Lift — how far above or below normal a group performed vs a typical article in the same publication or lane.
+50%= half again as much (1.5×);−20%= a fifth less;0%= exactly normal. - Confidence range — the bracket after a lift number, like
[+22, +47], is the likely range of the true effect: about 95% confident the real lift sits between those two numbers. Narrower = more reliable; wider = more uncertainty. It only shows when there are enough articles to compute it (n ≥ 10). - Verdict dot — the colored dot in each tile's header is that tile's overall call at a glance: green = trusted and actionable (go), amber = worth testing (test), coral/red = skip, or too uncertain to act on yet.
- Decile badge — on top-performer rows ranked by traffic, a badge like
D8says which tenth of its content lane the article sits in by lift (how far it beat the publication median);D10is the very top tenth (the badge is highlighted when an article reaches it).
Each of these has a (?) help bubble on the live page with the same plain-language definition. The docs and the tooltips are kept in lock-step.
Anatomy: tile + drawer
Every tile has the same skeleton:
- Rail. Vertical accent strip on the left edge. Color encodes verdict—green =
go, amber =test, coral =skip. - Title row (t1). Tile name + (?) help bubble + a one-line meta summary (n records · key signal). Always visible.
- Chevron (▸ / ▾). Right edge of the title row. Click anywhere on the title row to expand or collapse the drawer.
- Drawer (t2). Hidden until expanded. Two-pane: main column with description + data table; side column with recent top-performers (or per-tile signal—formula hit rates, vertical totals + cluster batting average, weather signal).
Default state: all 7 tiles land collapsed. Click to open the drawer for the tile you want to dig into. Multiple drawers can stay open at once.
Grader page + today's grades
Two grader surfaces feed off the same daily run:
- Today's published grades on the home page—4 KPIs (headlines graded · avg score · range · top issue) reflecting the most recent grader run.
- Grader page—full per-headline scorecard for every headline in the lookback window, sorted worst-first. Each headline card expands to show the full rule-by-rule breakdown plus per-vertical tips. The page also carries a 30-day history strip, a per-outlet pass-rate matrix (every rule × every publication), and a criterion-correlations panel (which rules tend to be missed together).
The grader runs daily at 10:17 CDT (15:17 UTC) against the Tracker. It scores each headline against the rule set (a mix of fixed computer checks and a Groq LLM for the four judgment-call rules—active voice, lead burial, curiosity, accuracy), writes the results to docs/grader/index.html + docs/data/grader-signals.json + docs/grader/history.json, and commits. The score is the share of rule-weight a headline earned (see Grader tiers + the score).
Manual trigger (operators only): gh workflow run headline-grader.yml. The in-page Run-now button was removed 2026-05-04—a hardcoded passcode + localStorage PAT isn't a defensible auth surface for a public-readable site.
Grader tiers + the score
Every rule belongs to a tier, and tiers carry different weight. A headline's score is the share of applicable rule-weight it earned—not a simple count of passes.
The score, in one sentence
Add up the weights of the rules a headline passed, divide by the weights of all rules that applied to it, multiply by 100. Rules that don't apply (the platform-title rules, or a keyword rule when no keyword was supplied) are left out of the denominator entirely—they neither help nor hurt. So two headlines can be scored against different rule sets and still compare fairly.
The four tiers + their weights
| Tier | Share of score | What it covers |
|---|---|---|
| Structure & Length | 27% | The headline's shape: its length, how it opens, whether it reads as plain active English, and whether the news is up front (not buried). |
| Formula & Signal | 27% | Using a proven headline pattern, including the target search keyword, avoiding "what to know" filler, and—when the story has one—surfacing the expert/study signal. |
| Quality Flags | 36% | The "don't do this" rules plus two judgment calls: no "did you miss," no question/all-caps/banned punctuation, and yes to curiosity and accuracy. Carries the most weight of any tier. |
| Platform-Specific | 9% | The custom titles sent to Apple News and SmartNews. The two length rules (an_title_chars / sn_title_chars) are scored (weight 1), but their Tarrow source isn't joined to the grader yet, so they almost always read "?" (N/A) and drop out of the denominator in practice. Only apple_heres is the weight-0 informational rule here. |
The Grader page shows these four as a KPI strip with each tier's average pass rate and its weight (e.g. "27% wt"). Within a headline card, a mini stacked bar shows where that headline's points came from by tier (blue = Structure, purple = Formula, green = Quality), with a dark tail for points it didn't earn.
generate_grader.py:CRITERIA and roll up automatically; the percentages above are the live tier sums (Structure 6 / Formula 6 / Quality 8 / Platform 2 = 22 total weight → 27 / 27 / 36 / 9%). See the full 19-criteria table for the per-rule weights.Format-aware grader rules
Some article formats lock the H1 to a pattern that universal rules would penalize. The grader detects the article's Format column and overrides the universal criteria where they conflict with the format spec.
Synthesis rule
Format spec wins where it conflicts with universal rules (since the format spec is the locked editorial standard). Universal rules apply where the format is silent.
Format → required H1 + exempt criteria
| Format | Required H1 | Exempts (would otherwise penalize) |
|---|---|---|
| Everything to Know | [Subject]: Everything You Need to Know | no_vague_wtk |
| What to Know Next | [Subject]: What's Happening, Why, and What Could Be Next | no_vague_wtk |
| Discover Explainer | What Is [Topic]? / Who Is [Person]? | no_questions |
| Recipe | Must contain "Recipe" | — |
| Timeline | [Subject]: A Complete Timeline / Breakdown | no_articles (leading "A") |
| Recap | [Show] Recap: [N] Biggest Moments From [Episode] | — |
| Obituary | [Name] Dead: [Descriptor] Was [Age] | — |
| Follow-Up Content | [Name] Dead at [Age] | — |
| Interview | [Name] on [Topic]: '[Quote]' (EXCLUSIVE) | — |
| Fan Theory | [Show] Fan Theory About [X] Will Blow Your Mind | — |
| Fan Question | Biggest Questions About [Show] Answered | — |
Source of truth: csa-content-standards/docs/<format>.md. Each format spec carries the canonical pattern; the grader's _FORMAT_RULES table is a synthesis of those.
What happens if the Format column is blank
The universal rule set applies (the same 19 criteria as everywhere else). No format-specific override fires. The card's meta line shows the author + vertical + platform + brand but no format badge.
The headline experiment (framing test + expert signal)
A live test, run inside the Variant tool, that separates two very different kinds of editorial knowledge: a lever we've proven and want to enforce, and a hypothesis we want to test before trusting. This section is the high-level "what + why"; the operator how-to lives in the Variant tool documentation.
The two halves, and why they're treated differently
The June 2026 headline analysis turned up one finding strong enough to act on now and one question still open. Putting them on the same tool forced a clean split:
- Proven lever — the expert/study signal. When a story is built on a credentialed expert or a named study, naming that in the headline reliably lifts performance (the analysis measured roughly +71% page-views and +50% engagement time). Because it's a measured result, not a guess, it's enforced: it became a scored grader rule (see Expert/study signal), alongside length and the punctuation ban.
- Open hypothesis — the trend-stage framing. Whether describing where a trend sits in its lifecycle changes how readers engage is unknown. So it's tested, not scored: a headline can be written in a test framing without that framing helping or hurting its grade. Scoring a framing would amount to claiming we already know the answer—and we don't yet.
The organizing rule is simple: codify what's proven, validate what's not. A framing only graduates into a scored rule once engagement data picks a winner.
The three framing styles under test
Each is a way of framing the same trend in the headline—same article, different angle on where the trend stands:
| Style | The angle it takes | Feel of the language |
|---|---|---|
| Established (the current default + baseline) | The trend is recognized and authority-backed. | "Explained," "What the Research Says," "Why X Has Become…" |
| Emerging | The trend is early-stage and just breaking mainstream—the reader is ahead of the curve. | "Just Starting to…," "Is Growing," "Why Experts Are Increasingly…" |
| Forecast | The trend is evolving; forward-looking and predictive. | "What Comes Next," "Where X Is Headed," "What This Means for You" |
Writers begin running the test on 2026-06-10. The framing styles are tuned weekly as data arrives, so they're deliberately kept somewhere editable without a redeploy—not baked into the governed rule library. The test runs alongside normal generation: a writer opts a headline into a test style, or generates the usual production headline; it's one labeled lane, not a separate tool.
The per-headline "receipt"
For the eventual "which style won?" to be honest, every generated headline has to be traceable to the exact setup that produced it—the thresholds, the rule set in force, how the expert signal was delivered, and which framing was under test. So each generation records an immutable config snapshot: a content-addressed "receipt." Two headlines with the same receipt provably ran under the same configuration; a different receipt means something changed. When engagement numbers land weeks later, each headline's outcome attributes to its precise receipt rather than to a guess about what was live at the time.
Bridge to keywords console
The console emits an outbound feed at /data/headline-outcomes.json with every published article's outcome (PV, lift, formula, topic, content type, vertical, publication date). The Keywords Decisioning Console reads that feed for its Decision Log lens—keywords with a published headline get the headline grade + PV outcome paired back automatically.
Inbound bridge (filter Findings tiles by a keyword brief) is stubbed in the t2-side panel—pending the data-keywords brief-id metadata to land in the Tracker.
The (?) help bubbles
Click any (?) icon to open a popover with a tight definition of that term, metric, or tile. Click anywhere outside or press Esc to close. Enter / Space with the icon focused also toggles.
Content lives in docs/js/help-blurbs.js as a single registry—every term documented in one file. To reference a blurb from any page: <span class="help" data-help="some-id"></span>. The icon + popover are auto-hydrated.
What auto-updates vs what's manual
| Surface | Cadence | Trigger |
|---|---|---|
| Findings tiles | Weekly | Mon 19:23 UTC cron → tiles_new.py |
| Today's published grades | Daily | 10:17 CDT cron → generate_grader.py |
| Snowflake enrichment | Weekly | Same Mon cron (precedes tiles_new.py) |
| Tarrow XLSX | Weekly (Tarrow-side) | Auto-fetched at run start; falls back to last successful |
| Grader run | On demand | gh workflow run headline-grader.yml (operators only) |
Troubleshooting
Tiles show stale data
Check the live badge in the topbar. If it says missing for a feed, that feed's cron hasn't fired today. If all 3 feeds are bound but data looks old:
- Hard-refresh (Cmd+Shift+R)—Cloudflare's cache can lag a deploy by a few minutes.
- Every JS/CSS/JSON URL is now stamped with
?v=YYYYMMDDHHMMbased on the cron run, so each new deploy auto-busts both the browser cache and Cloudflare's CDN cache. - If the badge says
0 graded, no headlines were published in the last 24h.
Grader page modal won't close
If Cancel doesn't dismiss, click the dimmed area outside the modal box. Resolved as of 2026-05-02 commit 001356d.
Vertical filter doesn't show all options
Three verticals only: Mind/Body, Everyday Living, Experiences. Articles classified as "Discover" in source data are search/discover work mode, not a vertical—they bucket as — in vertical-keyed UI.
Writer initials collide
Two writers with same initials get a digit suffix: RB and RB2. The mapping isn't published; reach out if you need to dereference.
FAQ
Why are PV numbers different from what I see in Amplitude?
This console reads from MCC_PRESENTATION.CONTENT_SCALING_AGENT.TRACKER_ENRICHED, which composes per-platform PVs from STORY_TRAFFIC_MAIN + Tarrow XLSX. Amplitude only covers O&O; MSN traffic for example is Tarrow-only. Numbers will differ.
Why isn't InTouch / Closer / Life & Style on the lens filter?
The filter pills show the top-3 buckets (All / News / L&E), not per-publication pills. The active L&E cohort is US Weekly, Woman's World, and Life & Style; an L&E lens covers all three. InTouch and Closer are not in the active cohort—they're retired brands the console recognizes only for News-vs-Entertainment tag coloring, so they're graded for historical continuity but don't appear in the live Findings cohort.
Why is the (?) icon called the help bubble?
The (?) icon opens a popover with a "what is this?" blurb. The pattern is shared across the keywords + headlines consoles.
Architecture
Static HTML site deployed to Cloudflare Pages. No build step, no React, no bundler—pages are emitted by Python scripts and served directly. JS interactions are vanilla (no framework).
data/ # source—Snowflake enrichment + Tarrow XLSX
snowflake_enrichment.json # ← snowflake_enrich.py (weekly cron)
Top Stories 2026 Syndication.xlsx # ← Tarrow (weekly)
tiles_new.py # weekly cron → docs/data/tiles.json + docs/index.html
generate_grader.py # daily cron → docs/grader/index.html + docs/data/grader-signals.json
docs/ # served by Cloudflare Pages
index.html # home (Findings)
data/ # JSON feeds for tile rendering
js/ # vanilla JS interactivity
css/styles.css + legacy-pages.css
Data sources + refresh cadence
| Source | What it provides | Refresh | Path |
|---|---|---|---|
Snowflake TRACKER_ENRICHED | Per-article PVs by platform, vertical, author, formula classifier, cluster_id | Weekly (Mon) | data/snowflake_enrichment.json |
| Tarrow XLSX | Per-platform syndication titles + PVs (AN, MSN, Yahoo, SmartNews) | Weekly | Top Stories 2026 Syndication.xlsx (repo root) |
| Tracker sheet | Live editorial Tracker (headline, author, brand, vertical, dates)—source-of-truth that feeds TRACKER_ENRICHED | Real-time (Google Sheet) | Flows into Snowflake; grader reads TRACKER_ENRICHED (since 2026-06-05) |
| Groq LLM | Active-voice + curiosity-gap + accuracy criteria | Daily (per-headline) | API call from grader |
File map
| Path | Role |
|---|---|
tiles_new.py | Tile builders (~1450 lines). Reads enrichment.json, classifies formula + topic, computes lift + confidence ranges + verdicts + deciles + engagement axes, emits tiles.json + headline-outcomes.json + index.html. |
generate_grader.py | Grader (~3900 lines). Reads Tracker sheet, scores the 19 criteria (17 weighted + 2 informational), emits grader/index.html + grader-signals.json + history.json. |
scripts/apply_v2_styles.py | Sub-page migration. Strips legacy embedded CSS, links shared stylesheets, injects v2 topbar. |
docs/css/styles.css | v2 design system tokens. Mirrors data-keywords/docs/css/styles.css. |
docs/css/legacy-pages.css | Class re-themer for legacy generate_site.legacy.py output (.pb-tile, .agg-tbl, etc.). |
docs/js/v2-render.js | Reads window.__DATA_HEADLINES__ URLs → injects real data into tile bodies. |
docs/js/help.js + help-blurbs.js | (?) popover system + content registry. |
docs/js/table-align.js | Auto-aligns table headers to match data-cell alignment. |
Snowflake TRACKER_ENRICHED schema
The vetted boundary table read by snowflake_enrich.py. Path: MCC_PRESENTATION.CONTENT_SCALING_AGENT.TRACKER_ENRICHED.
Columns this console relies on:
| Column | Type | Source |
|---|---|---|
published_url | VARCHAR | Tracker sheet → DYN_STORY_META_DATA |
headline | VARCHAR | Tracker H1 column |
author | VARCHAR | Tracker AUTHOR column |
domain | VARCHAR | Parsed from URL |
vertical | VARCHAR | AUTHOR_VERTICAL_MAP join |
publication_date | DATE | Tracker |
total_pvs | NUMBER | STORY_TRAFFIC_MAIN sum (reach metric—fan-out included) |
search_pvs / social_pvs / direct_pvs / applenews_pvs / smartnews_pvs / newsbreak_pvs / yahoo_pvs / subscriber_pvs / amp_pvs / newsletter_pvs | NUMBER | Per-platform breakdown from STORY_TRAFFIC_MAIN + Tarrow |
pub_median_pvs | NUMBER | Computed in routine |
article_vs_co_median | FLOAT | Computed (origin-PVs basis post-cross-syndication-screen—see below) |
is_hit | INTEGER | 1 if at-or-above publication median (origin-PVs basis post-screen) |
cluster_id / cluster_vs_co_median / cluster_hit_rate | VARCHAR / FLOAT / FLOAT | Cluster join + computed (origin-PVs basis + winsorized post-screen) |
origin_pvs / syndicated_pvs / n_syndication_sites / syndicated_share / syndication_juice / marfeel_mediums | NUMBER / NUMBER / NUMBER / FLOAT / VARCHAR / ARRAY | Designed; merge-ready when Marfeel→Snowflake feed lands. See cross-syndication distortion screen. |
MCC_PRESENTATION.CONTENT_SCALING_AGENT only. Direct reads of MCC_RAW or upstream tables are not allowed.Tracker sheet integration
The editorial Tracker is a Google Sheet maintained by the content team lead, treated as the source-of-truth. It flows into the Snowflake model MCC_PRESENTATION.CONTENT_SCALING_AGENT.TRACKER_ENRICHED (rebuilt ~2×/day), and the grader reads that—it switched off the direct Sheets read on 2026-06-05 (the sheet stopped receiving new dated rows, frozen at 6/02, while the Snowflake model stayed current).
Required columns (used by this console): Headline, Author, Brand Type, Publication Date, Published URL/Link, Primary Keywords, Syndication platform, Format, Content Type, Rotation #.
Header aliases are documented in scripts/shared/constants.py—when the content team lead renames a column, the alias mapping keeps the pipeline stable until the next ingest.
Tarrow XLSX intake
Weekly per-platform syndication file from Tarrow (former vendor name retained as data-product label). Provides Apple News / SmartNews / Yahoo / MSN per-platform titles + PVs. MSN data is platform-side only—not in Snowflake—so Tarrow is the only path to MSN signal.
The data team's 2026-04-18 offer to load this XLSX into Snowflake as a proper data model is in flight as substrate work.
Cross-syndication distortion screen
Designed end-to-end; merge-ready when the Marfeel→Snowflake feed lands. Solves a structural data-quality issue that distorts every topic / cluster / format / author rollup the console renders.
Why it exists
On 2026-05-06, exec/leadership flagged that distribution hand-picking already-strong stories for cross-syndication "juices" apparent topic performance—his canonical example was a Field-Level-Media-syndicated piece spread across ~24 syndication targets. When the console reads total_pvs to render leaderboards + cluster aggregates, those juiced articles inflate every rollup they appear in. Operators reading the console can't tell native topic strength from syndication fan-out. The distortion is selection-on-success: syndication is downstream of strong early performance on the origin publication, not a topic-strength signal.
What changes
New columns on TRACKER_ENRICHED (fed by an upcoming MCC_PRESENTATION.CONTENT_SCALING_AGENT.MARFEEL_ARTICLE_BY_MEDIUM table built by the data engineer):
| Column | Meaning |
|---|---|
origin_pvs | PVs from the home publication only (Marfeel medium = article_domain) |
syndicated_pvs | total_pvs − origin_pvs |
n_syndication_sites | Distinct non-origin medium count |
syndicated_share | syndicated_pvs / total_pvs, bounded [0,1] |
syndication_juice | none / light / heavy tier (heavy ≥10 sites & ≥60% syndicated; light ≥3 & ≥30%) |
marfeel_mediums | Sorted distinct medium array—the actual syndication targets that carried the article |
Performance-signal columns switch from total-PVs basis to origin-PVs basis at the same migration:
is_hit—1 iforigin_pvs ≥ pub_median_pvs(wastotal_pvs ≥)article_vs_co_median—usesorigin_pvsinstead oftotal_pvscluster_total_pvs+cluster_hits—recomputed onorigin_pvscluster_vs_co_median—uses newcluster_total_pvs_trimmedwith the heaviest-juiced article per cluster excluded (winsorized)article_pv_share_of_domain_month—origin-PVs basisauthor_hit_count+author_hit_rate—origin-PVs basis
Reach + revenue columns stay on total-PVs basis because every PV is real revenue regardless of medium: total_pvs, article_programmatic_revenue_live, author_avg_pvs, author_pv_stddev.
UI surfacing
Two new operator-facing surfaces ride the new columns:
- Per-article juice chip on every card:
🔁 N-site, X% syndicatedwhensyndication_juice != 'none'(orange forlight, red forheavy; hover reveals the medium list). - Leaderboard "Hide heavy-juiced" toggle: localStorage-persisted, default off. When on, filters
syndication_juice = 'heavy'rows from leaderboard rankings. Article still openable through deeplink.
What this is NOT
- Not a hide-juiced policy. Heavily-syndicated articles still appear on every card view + every leaderboard by default. The toggle is opt-in.
- Not a content-quality judgment. The chip is descriptive, not evaluative—a heavily-juiced article isn't bad content; it just shouldn't be read as topic-strength signal.
- Not a Marfeel feed dependency for correctness logic. The CTE is fail-open (defaults safely if no match): when no Marfeel row matches an article,
origin_pvsfalls back tototal_pvsandsyndication_juicedefaults tonone. Pre-feed-land state behaves identically to pre-screen state. - Not a replacement for
cross_site_*columns (McClatchy-internal newspaper syndication). Marfeel covers external / platform syndication targets; both apply to most articles; both useful.
Implementation
Master design + SQL diffs in ops-hub/docs/cross-syndication-screen.md. UI spec + snowflake_enrich.py diffs in data-headlines/dev-docs/cross-syndication-screen-ui.md. Governance entry in csa-content-standards data-universe-labeling.
Cron schedules
| Workflow | Cadence | Trigger |
|---|---|---|
weekly-site-refresh.yml | Mon 19:23 UTC | ops-hub cross-repo-dispatcher.yml (native cron broken post-transfer) |
headline-grader.yml | Daily 10:17 CDT | GitHub Actions schedule |
Both workflows can be triggered manually via gh workflow run or the GitHub Actions UI.
Deployment + Cloudflare Pages
Deployed to data-headlines.pierce.tools from the main branch. Auto-deploys on push (~30s). Cloudflare Pages also exposes a default *.pages.dev URL for the project; that endpoint resolves but is operational-only—always use the custom domain for stakeholder-facing links.
Access is gated by Cloudflare Access—visiting any page bounces through a one-time-PIN login (whitelisted emails only) before serving content. SPA-fallback is enabled, which serves /index.html for unknown paths—keep this in mind when adding new sub-pages (use absolute paths if the page sits at a depth where relative resolution would 404).
Anonymization governance
Per ops-hub/ANONYMIZATION.md: no proper colleague names in any committed file. Pierce is the only permitted name. Authors in tiles.json appear as initials; grader-signals.json drops the author field entirely; meeting transcripts are extracted to anonymized summaries and originals deleted.
Enforcement runs as Step 4.5 of /sync-repos—fail-closed on denylist match, transcript-filename match, verbatim-content patterns, or snapshot-directory contamination.
Quality gates
Pre-push gates from ops-hub:
- Anonymization—Step 4.5 of
/sync-repos. - Conciseness—Step 4.6: project descriptions ≤400 chars, nextActions ≤25 words, CONTEXT.md ≤150 lines.
PV / total_pvs
Page-views: total times an article was loaded across every distribution surface (Apple News, SmartNews, Yahoo, Newsbreak, search, social, direct, newsletter, AMP, subscriber). Sourced from STORY_TRAFFIC_MAIN in Snowflake and aggregated into total_pvs on TRACKER_ENRICHED.
Reach metric—fan-out included. total_pvs is the right number for reach + revenue rollups (every PV is real money regardless of medium). It is not the right number for topic-strength / cluster-strength / author-strength judgments—those use origin_pvs post-cross-syndication-screen to remove selection-on-success distortion. See Cross-syndication distortion screen for full context.
Lift
A multiplier comparing one group's median PVs to a baseline. lift = (group_median − baseline_median) ÷ baseline_median. Positive = above baseline, negative = below. The baseline depends on context: usually publication median, sometimes vertical median, sometimes untagged-formula median.
Median (vs mean)
Page-view distributions are heavy-tailed—a single viral hit can shift a mean by 5×. The console uses medians everywhere because they're robust to outliers. When a tile reports n=20 articles with median PVs of 1,200, that's the 10th-ranked article's PV count.
P-value (Mann-Whitney U)
The probability the observed lift could have arisen by chance from the baseline distribution under the null hypothesis (no difference). Computed via Mann-Whitney U-test—non-parametric, rank-based, no normality assumption.
- p < 0.05—conventional significance; treat as real signal
- p < 0.10—directional; worth a controlled pilot
- p ≥ 0.10—no detectable difference; data is too thin
Verdict (go / test / skip)
| Verdict | Color | Threshold |
|---|---|---|
| go | Green (--go #4ade80) | Real lift, p<0.05, n large enough |
| test | Amber (--test #fbbf24) | Directional, worth piloting |
| skip | Coral (--skip #f87171) | Below baseline or thinly evidenced |
Verdict color encodes on the tile rail (left edge), pills, and KPI accents. The Verdict column pill in tile tables shows the verdict label first followed by the confidence reading: go · p<0.05, test · p<0.10, test · no diff, skip · p<0.05.
Confidence (high / mod / low)
Overlay on verdict, conveying statistical confidence:
- conf-high (cyan #38bdf8)—p<0.05, n large
- conf-mod (violet #c084fc)—directional p<0.10
- conf-low (slate #94a3b8)—anecdote / no detectable difference
Confidence range on lift (the bracket)
The bracket after a lift number—like [+22, +47]—is the likely range of the true effect: we're about 95% confident the real lift sits between those two numbers. A narrower bracket means a more reliable estimate; a wider one means more uncertainty.
It's a 95% normal-approximation interval around the lift, computed in tiles_new.py:_lift_ci_normal. It only appears when the group has at least 10 articles (and the baseline median is non-zero); below that the data is too thin to bracket, so the cell shows the lift alone. Use the bracket to tell a steady, dependable lift (tight bracket on many articles) from a fluky one (wide bracket on few).
Decile badge (D1–D10)
On top-performer rows ranked by traffic, a small badge like D8 says which tenth of its content lane (vertical) the article lands in by lift (how far it beat the publication median). D10 is the top tenth; D1 is the bottom tenth. The badge is highlighted green when an article reaches the top deciles, amber in the middle, and muted lower down—a one-glance read of "how far up its lane did this land."
Computed per vertical in tiles_new.py (lift_decile_within_vertical): the article's traffic rank within its lane, bucketed into ten. It's a within-lane ranking, so a recipes article is placed against other recipes articles, not against politics. It shows only on the PV ranking axis of the recent top-performers panel.
Engagement metrics (engaged % · time on page · cluster continuation)
Beyond raw clicks, three reader-behavior measures say whether a headline actually delivered. They're now the primary read on the top-performers panel; page-view trend stays as the simple traffic-only reference.
- Engaged % (
engaged_session_rate)—the share of visits where readers engaged rather than bounced. - Time on page (
median_time_on_page_seconds)—how long readers stayed, as a median (one marathon reader doesn't distort it). - Cluster continuation (
csa_cluster_continuation_rate)—the share of readers who clicked on to another linked article in the same group. 30%+ is strong, 15–30% middling, under 15% weak.
These are Amplitude/Marfeel-derived and land on each article via the Wing-B engagement substrate plus the recurring Marfeel ETL—live and longitudinally reliable from 2026-05-18 forward. They're NULL when an article has no engagement coverage (it predates the window, or wasn't tracked), so the panel skips an article on any axis it's missing rather than show a misleading zero. The "beat the publication's normal level" standard used by the hit-rate-by-formula sidecar uses these measures where they exist and falls back to plain traffic where they don't.
Cluster batting average
Article-level: of all CSA-touched articles (those with a cluster_id), the share whose cluster's median PV beat the publication's company median (cluster_vs_co_median > 0). Reported as "1 in N"—lower N = better.
Larger clusters get proportionally more votes (each article in a winning cluster counts), so this reads as "how often does a CSA-touched article sit in a cluster that's outperforming."
Current batting average sits in the 1-in-2.5 to 1-in-3 range. The exec/leadership Q3 target is 1 in 3.3 (≈30%).
Methodology shift coming. Post-cross-syndication-screen, cluster_vs_co_median uses cluster_total_pvs_trimmed (heaviest-juiced article per cluster excluded) computed against origin_pvs per article. Expect cluster batting averages to shift downward modestly as previously-juiced clusters revert to native strength. See Cross-syndication distortion screen.
Curiosity gap
Editorial principle: lead with the question, not the answer. "The Secret to Her 20-lb Weight Loss" (gap kept open → click) outperforms "She Lost 20-lbs Eating Kale" (answer spoiled → no click). Active in the L&E lens.
Confirmed at US Weekly with n=19 vs n=12 control, +42% vs −35% lift, p<0.05.
Formula taxonomy (full list)
Canonical list lives in formula_taxonomy.py at repo root — single source of truth shared by the Findings classifier (tiles_new.py:classify_formula) and the published-headline grader (generate_grader.py:_formula). The variant tool's Cloudflare Worker (data-headlines-variant, separate codebase) is synced manually against this list.
Priority-ordered: penalized-filler phrases first (so they bucket correctly even when a credit-worthy pattern also matches), then lead-position patterns, then structural cues, then stylistic markers, then a length-based fallback. First match wins. Every non-empty headline buckets — untagged is reserved for the empty-string degenerate case only.
Role column: credit = legitimate editorial formula, earns "Formula present" credit in the grader, classifier buckets here, variant Worker generates candidates here. penalized = recognized vague-filler pattern, classifier buckets it for cohort analysis but the grader does NOT award credit (the per-criterion penalty fires separately). fallback = length-based classifier-only bucket; doesn't represent an editorial formula.
Penalized-filler phrases (priority 1)
These bucket FIRST so a headline like "X: Everything You Need to Know" doesn't accidentally credit on a stylistic marker further down the list.
| Formula | Trigger | Role | Example |
|---|---|---|---|
did_you_miss | Phrase "did you miss" | penalized | "Did You Miss the Aurora Last Night?" |
what_to_know | "what to know" / "what you should know" | penalized | "Severe Weather Hits SC: What to Know" |
explainer | "everything (you need) to know" | penalized | "SHA Wellness Retreat: Everything You Need to Know" |
Lead-position patterns (priority 2)
| Formula | Trigger | Role | Example |
|---|---|---|---|
heres | Starts with "Here's" / "Heres" / "Here are" | credit | "Here's How to Cancel a Subscription" |
quote_lead | Starts with a quotation mark | credit | "'Devastated': Family Reacts to Verdict" |
name_lead | Proper-name lead + reporting verb (says / reveals / admits / shares) | credit | "Kendall Jenner Says This Weighted Blanket Helped…" |
inside_lead | Starts with "Inside" | credit | "Inside Hollywood's Ice Bath Obsession" |
why_lead | Starts with "Why" | credit | "Why Creative Retreats Are the Biggest Travel Trend of 2026" |
action_verb_lead | Starts with Watch / See / Meet / Listen / Read / Behold / Witness | credit | "Watch: A NeeDoh Compared to Other Stress Balls" |
Numeric + listicle (priority 3)
list_with_quantifier comes before number_lead so "5 Things to Know" buckets as the more specific listicle pattern rather than the bare number-lead bucket.
| Formula | Trigger | Role | Example |
|---|---|---|---|
list_with_quantifier | "N + plural noun" listicle (things / reasons / ways / tips / hacks / kitchen items / benefits / …) | credit | "5 Things You Should Know About Sleep" |
number_lead | Starts with a digit (and isn't a listicle) | credit | "5 Senate Republicans Break From Trump…" |
Structural cues (priority 4)
| Formula | Trigger | Role | Example |
|---|---|---|---|
definitional | "what is/are…", "why is/are…", "how does/do…", or "explained" | credit | "What Is a NeeDoh? Everything Rosé Has Said About…" |
how_to | "how to/i/you can", "ways to", "tips/tricks to" | credit | "How to Reduce Microplastic Exposure" |
question | Ends with question mark | credit (+ separate no_questions penalty) | "Is the Aurora Coming Back Tonight?" |
dash_directive | Em/en-dash rhythm — directive after dash, or any em-dash spacing | credit | "Most People Use Too Much Detergent—Here's How to Stop" |
superlative_list | "best", "top", "greatest", "ultimate", "easiest", "smartest" | credit | "The 7 Best Cold Plunge Resorts Around the World in 2026" |
comparative | "vs", "versus", "better than" | credit | "What Science Says vs. What Influencers Claim" |
negation_lead | "stop", "avoid", "don't", "never", "skip", "quit" | credit | "Stop Spending Hundreds on Sleep Aids — Try These 6 Hacks…" |
curiosity_gap | "secret", "reveal", "surprising", "shocking", "hidden", "little-known" | credit | "The Secret Behind Jelly Roll's Weight Loss" |
Stylistic markers (priority 5)
Fire often, less differentiating. Last in priority so they only catch headlines that don't trigger any more specific editorial pattern above.
| Formula | Trigger | Role | Example |
|---|---|---|---|
possessive | Apostrophe-s or s-apostrophe possessive | credit | "Khloé Kardashian's Beef Tallow Routine" |
colon_structured | Subject + colon + space + word | credit | "Beef Tallow: The Wellness Trend That Won't Die" |
Length-based fallback (classifier only)
Headlines that don't trigger any pattern above bucket by word count so per-pub × formula groups still populate. These are NOT editorial formulas — the grader returns "No recognized formula" (no credit) for headlines that fall through to this layer.
| Formula | Trigger | Role | Example |
|---|---|---|---|
short_punchy | ≤ 7 words | fallback | "Beef Tallow Is Everywhere Now" |
descriptive_standard | 8–15 words | fallback | "Khloé Kardashian Joins the Growing List of Celebrities Endorsing Beef Tallow for Skin" |
descriptive_long | ≥ 16 words | fallback | "Your Nervous System Needs More Than Just 1 Deep Breath. These Popular Breathwork Apps Can Actually Help With Anxiety" |
Topic taxonomy
Regex-based classifier in tiles_new.py:classify_topic(). Priority-ordered: specific patterns fire first; if no rule matches and Snowflake has an IAB tag, that's used; otherwise a token-cue fallback buckets to wellness_general / lifestyle_general / news_general. untagged is reserved for headlines with zero topical cue words (rare in practice).
Traditional news topics
weather_severe, weather_heatwave, weather_general, politics, crime, sports_local, sports_national, royals, finance.
TH wellness/lifestyle topics
Tuned to the Trend Hunter content corpus that dominates today's dataset:
wellness_menopause— menopause, perimenopause, hot flashes, HRT, estrogenwellness_sleep— sleep, insomnia, circadian, melatonin, napswellness_retreat— retreats, spas, wellness boot camps, sanctuarieswellness_stress— stress, anxiety, breathwork, vagus nerve, cortisol, cold plunge, meditationwellness_beauty— skincare, beef tallow, wrinkles, botox, red light therapywellness_fitness— run clubs, workouts, pilates, sarcopenia, gymwellness_nutrition— protein, electrolytes, hydration, gut health, loaded water, collagencelebrity_health— weight loss, Ozempic, surgery, cancerhome_hazards— mold, microplastics, PFAS, forever chemicals, laundry detergent, tap waterhome_garden— garden, flowers, curb appeal, landscaping, lawnreal_estate— home inspection, home buying, mortgage, housing marketrecipes— recipes, baking, meal prep, dinner/breakfastentertainment— movies, TV shows, Netflix, streamingcelebrity_general— Kardashian, Jenner, Hollywood stars (when not under a wellness bucket)lifestyle_trend— viral, TikTok, trend, Gen Z, "explainer"travel_food— travel, destinations, food tours, restaurants
Fallback buckets
wellness_general / lifestyle_general / news_general — fire when a headline contains a general cue word (wellness/health/luxury/trend/report/etc.) but no specific topic pattern matched. These keep per-pub × topic groups populated for analysis even when a headline doesn't fit a named bucket.
Content type (News vs L&E)
| Content type | Publications | Primary metric |
|---|---|---|
| News & Regional | Miami Herald, KC Star, Charlotte Observer, Fort Worth Star-Telegram, Sacramento Bee, Raleigh News & Observer, Centre Daily Times, Idaho Statesman, The State, Lexington Herald-Leader, Bradenton Herald, Belleville News-Democrat, Fresno Bee, Modesto Bee, Macon Telegraph, Sun Herald, plus 10 more T1 papers | CTR + Search Ranking |
| Lifestyle & Entertainment | US Weekly, Woman's World, and Life & Style (InTouch + Closer are retired brands, recognized only for tag coloring) | Social Shares + Scroll Depth |
Verticals (the 3)
- Mind/Body (color: emerald
--vert-mb)—health, fitness, wellness - Everyday Living (color: amber-orange
--vert-el)—home, décor, lifestyle, parenting - Experiences (color: pink
--vert-ex)—travel, recipes, finance, retirement
These are the three named TH-channel verticals. Articles published outside these channels (search/discover work) are bucketed as — in vertical-keyed UI; they aren't a fourth vertical.
Writer initials
Author identifier in committed JSON. Computed from full author name: first letter of first word + first letter of last word. "Jane Doe" → JD. "Alex Roe-Smith" → AR. Collisions get a digit suffix (JD, JD2, …).
Live Tracker sheet uses canonical full AUTHOR field; initials are a render-layer transformation only.
Grader 19 criteria (+ tier weights)
Defined once in generate_grader.py:CRITERIA. A headline's score = (weight of rules it passed) ÷ (weight of rules that applied to it) × 100. Rules with weight 0 are informational (shown, never scored); rules that return "N/A" for a given headline drop out of its denominator. The four tiers carry 27 / 27 / 36 / 9% of the total weight (see Grader tiers + the score).
| Key | Plain name | Tier | Wt | Method |
|---|---|---|---|---|
char_count | Character count (outlet-specific) | Structure | 1 | Rule-based (90–104 default; UsW / WW / Life & Style 90–100) |
subject_leads | Named entity leads | Structure | 1 | Rule-based |
no_articles | No "The / A / An" lead | Structure | 1 | Rule-based |
active_voice | Active voice | Structure | 1 | LLM (Groq) |
no_lead_burial | No lead burial | Structure | 2 | LLM |
formula | Formula present | Formula | 2 | Rule-based |
no_vague_wtk | No "what to know" filler | Formula | 1 | Rule-based |
keyword | Keyword present | Formula | 1 | Rule-based |
expert_signal | Expert/study signal (proven lever) | Formula | 2 | Rule-based · scored only when the story has an expert/study to surface (else N/A) |
number | Number lead | Formula | 0 | Informational |
no_dym | No "Did you miss" | Quality | 2 | Rule-based |
no_questions | No question headline | Quality | 1 | Rule-based |
no_allcaps | No all-caps words | Quality | 1 | Rule-based |
no_banned_punct | No em/en-dash, colon, semicolon | Quality | 1 | Rule-based · skipped (N/A) for UsW / WW / Life & Style |
curiosity | Curiosity gap | Quality | 1 | LLM |
accurate | Factually accurate | Quality | 2 | LLM |
apple_heres | Here's / Here are | Platform | 0 | Informational |
an_title_chars | Apple News title (90–120 chars) | Platform | 1 | Rule-based · scored only when the article is matched in Tarrow (else N/A) |
sn_title_chars | SmartNews title (70–90 chars) | Platform | 1 | Rule-based · scored only when matched in Tarrow (else N/A) |
Weight 0 = informational (shown but not scored)—number and apple_heres, because whether they help depends on which platform the title goes to, and the grader sees only the website headline. The two platform-title rules (an_title_chars, sn_title_chars) would grade the custom Apple News / SmartNews titles, but those live in the Tarrow file that isn't joined to the grader yet, so they read "?" (N/A) for almost every headline and stay out of the score in practice. All 19 grade the H1 from the Tracker, never the platform-specific syndication titles.
expert_signal (the proven lever from the June 2026 analysis—see the headline experiment), no_banned_punct (the TH public-facing ban on em/en-dash, colon, semicolon), and the two platform-title rules an_title_chars / sn_title_chars are all now in the registry.Expert/study signal (proven lever)
When an article is built on an expert or a study—a credentialed person, researcher, or doctor, or named research—putting that signal in the headline reliably lifts performance. The June 2026 headline analysis measured this as the single highest-impact finding (roughly +71% page-views and +50% engagement time), which is why it's an enforced, scored grader rule rather than just a tip.
How the rule behaves: it fails only when the source clearly has expert/study material but the headline leaves it out; it passes when the headline surfaces it. It shows "?" (not scored) when the story has no expert/study material to surface, when the brand is US Weekly, Woman's World, or Life & Style (where the rule doesn't apply), or when the tool can't confirm the expert from the article body.
Anti-fabrication is structural: the signal must come from the article body—the writer has done the research. The tool never invents an expert or study the body doesn't support. To pass: name the expert or study in the headline when the story actually has one.
Headline letter grades
Score → letter grade mapping for the visual grade pill:
| Score | Grade |
|---|---|
| 90–100 | A |
| 80–89 | B |
| 70–79 | C |
| 60–69 | D |
| <60 | F |
Grade trends per author are tracked in docs/grader/history.json (30-day rolling window).
Anonymization in practice
Per ops-hub/ANONYMIZATION.md, no proper colleague names appear in any committed file (Pierce is the sole exception).
Where it shows up
docs/data/tiles.json·team_perf.writers[].initials—first letter of first name + first letter of last name (collisions get a digit suffix). Noauthorfield at all.docs/data/grader-signals.json· per-headline objects haveverticalbut noauthor.docs/grader/index.html· the_author_initials()helper renders authors as initials in every hcard. The full name is only used in-process for theAUTHOR_VERTICALlookup, never serialized.
Mapping back to actual names
The live Tracker sheet has the canonical AUTHOR field. Operators with sheet access can dereference initials by sorting on AUTHOR + matching to the displayed pub-percentile. Initials map is not published.
Enforcement
Pre-push gate: /sync-repos Step 4.5 invokes /anonymization which scans every committed file against the 55-name denylist + transcript-filename patterns + verbatim-content patterns. Fail-closed.
Variant tool · what it is
A live editor-in-the-loop tool at /variant/ that takes one headline + its primary keywords (and optionally the article body) and returns 5–10 publication-ready alternative phrasings, plus one structurally-novel ⚡ wildcard for the long-term cohort study.
The tool answers a different question than Findings or Grader. Findings tells you what shape worked across thousands of past articles. Grader scores one headline. Variant generates options for an article you're about to publish—every option verified against the same production grader, every option containing every primary keyword you supplied, every option diverse enough from the seed that selecting one is a real editorial decision.
Who uses it. Content team writers + the content team lead. Each user gets a cap of 10 generations/day, 200/month, with the team sharing a ~700-runs/month budget ceiling, and the tool runs weekday-only (Mon–Fri, Eastern; closed weekends). Pierce is on an unlimited exception list because he funds the underlying API.
Why it exists. Variants offered for a single article were running lexically too similar—sibling bridges in a cluster looked interchangeable. The variant tool is the editorial-side fix: forced facet diversity across an option set when the article body provides article-supported latitude, plus a long-term study aimed at promoting the most-picked off-pattern wildcard structures into the standard format library.
For the full operator reference—form fields, the style test, the expert lever, reading the result cards, rate limits, troubleshooting—see the Variant tool documentation.
The headline experiment in the Variant tool
The headline experiment (trend-stage framing test + the proven expert/study lever) runs inside the Variant tool. Here's where it sits in this tool's architecture; the writer-facing controls and the A/B report are in the Variant tool documentation.
Three rule homes, three change cadences
The experiment is built on a clean split between what's settled, what's under test, and the record that ties a result back to its setup:
| Layer | Lives in | Changes how often | Governs |
|---|---|---|---|
| Proven / settled | The bundled rule corpus + the scored criteria registry | Rarely (only when a hypothesis graduates) | Length, punctuation ban, keyword presence, formula taxonomy, and the scored expert/study signal. |
| Under test (volatile) | A hot-editable database row (no redeploy) | Weekly+ | The Established / Emerging / Forecast framing definitions + their signal-word menus. |
| Provenance (the receipt) | An immutable, content-addressed config snapshot per generation | Every call | The complete setup—thresholds + rule set + expert-delivery mode + framing under test—so attribution and rollback are exact. |
The decision rule: a rule belongs in the settled/scored layer once an engagement winner has been picked and it's being enforced; it stays in the volatile database layer while it's still a hypothesis being injected and validated-but-not-scored. Scored ⇒ settled; under-test ⇒ database.
How it grades
- Expert signal (proven) grades like any scored rule, using the same "score it when the story supplies an expert/study, otherwise N/A" gate the keyword rule already uses.
- Framing (hypothesis) runs through a conformance check that confirms a test headline actually follows its assigned style, but never enters the score—so an unproven framing can't tank a production headline's grade or block generation.
Both the expert directive and the assigned framing ride the existing generation call—no extra LLM round-trips, no added cost. All the schema is additive and every new read fails open: a missing row falls back to bundled defaults exactly like the calibration read, so the experiment can never take the production tool down.
dev-docs/headline-experimentation/spec.md (problem statement + architecture). Writers begin the framing test 2026-06-10; styles are tuned weekly. A winning framing graduates into the scored layer only once engagement data picks it.Quickstart · 60 seconds
- Paste the seed headline you want alternatives for. The tool produces rephrasings of the SAME article—not pitches for a different one.
- List the primary keywords (load-bearing SEO terms, comma-separated). Every variant the tool offers will contain ALL of them. Strict editorial rule.
- Optional but recommended: paste the full article body. The tool switches to "maximize difference" mode—pairwise diversity gates kick in, and the grader judges accuracy against the article instead of the headline alone. Without article body, the grader stays conservative.
- Click Generate variants. Wait 5–15 seconds. Each card shows score + similarity + structural breakdown.
- Keep the variant(s) you'd ship—the button toggles to Kept ✓; multi-pick is supported and kept state survives regeneration of the same seed (persists server-side for performance analysis). Or Copy to clipboard, or ↻ Regenerate for a fresh batch with the same inputs.
The form fields
| Field | Required | What it does |
|---|---|---|
| Seed headline | Yes (≥20 chars) | The version you wrote. Tool produces alternatives of the same article. |
| Primary keywords | Yes | Comma-separated. ALL must appear in every variant (case-insensitive). If a keyword being required would over-constrain variants, move it to Secondary. |
| Secondary keywords | No | Comma-separated. Bonus signal if a variant naturally surfaces them; not gated. |
| Article text | No (recommended) | Full article body. Switches the tool to article-mode (see below). Up to 50,000 chars (truncated to 8,000 in the LLM prompt for cost + injection-surface control). |
| Variants (count) | Always 10 | The tool always targets 10 alternative phrasings. Returns as many as the gates allow (minimum 3, including at least 1 ⚡ wildcard—guaranteed via fallback promotion if the wildcard would otherwise be missing). No operator slider—every call is max-effort. |
| Also avoid | No | One previously-published headline per line. New variants gated against each by the same cosine ceiling. Use when you've already shipped a deepener + bridges in the same cluster and need the next variant distinct from all of them. |
| Format / Persona / Platform / Outlet | No | Apply rule overrides from csa-content-standards. Only ACTIVE (finalized) formats / personas / platforms / outlets are listed—pending or draft entries are excluded. As bundled in the current rule corpus (RULES.js, synced from csa-content-standards): formats × 5 (Everything to Know · What to Know Next · Discover Explainer · Follow-Up Content · FAQ), personas × 2 (Curious Optimizer (TH) · Discover Browser), platforms × 5 (Apple News · SmartNews · MSN · Google News · Trend Hunter B2C), outlets × 4 (US Weekly · Woman's World · Life & Style · Trend Hunter B2C). Formats still pending in the standards (Recipe · Timeline · Recap · Obituary · Interview · Fan Theory · Fan Question) aren't in this list and default to general guidelines for grading. |
Default vs article-mode
The tool runs in one of two regimes depending on whether the article body field is filled. The mode pill in the results header tells you which is active.
| Regime | Active when | Diversity bar | Accuracy bar |
|---|---|---|---|
| Default | No article body | "Options for selection"—variants may share content tokens across options (you pick one, rest are discarded). Only formula cap (≤2 per formula) and cosine (meaning-similarity, 0–1) vs seed enforced. | Conservative—variants must stay within the seed headline's stated scope. No "experts say," no specific findings, no listicle counts beyond what the seed implies. |
| Article-mode | Article body provided | "Maximize difference"—pairwise cosine (meaning-similarity, 0–1) ≤ 0.85 + pairwise content-token Jaccard (word-overlap, 0–1) ≤ 0.45 between rule-following variants (tightened 2026-05-27 per exec/leadership variety-weighting ask). Forces facet diversity across the set. | Permissive—variants may name any technique / mechanism / finding / source / comparison present in the article body, even if absent from the seed. The grader judges against the article, not the headline alone. |
Article-mode is what content team should reach for when she has the article in hand. Without it, the tool is honest about its narrower latitude and shows the conservative regime in the mode pill.
The wildcard slot
Every /generate returns exactly one variant tagged ⚡ wildcard. The wildcard is generated under an off-pattern prompt—its job is structural novelty, not faithful rephrasing.
What's different about a wildcard:
- Bypasses the rule-following LLM grader (active voice / lead burial / curiosity). Those criteria are calibrated for ordinary variants and would fight the wildcard's role.
- Still passes every hard structural rule at 100% (length, named-entity lead, formula, primary keywords, banned patterns)—wildcards are publishable shape, not garbage.
- Single accuracy safety-net check via a one-criterion LLM call—catches wildcards that introduce fabricated claims (e.g. "experts say" when no experts are in the source).
- Has its own cosine ceiling (0.95 vs seed) regardless of mode—the cohort study values structural fingerprint novelty, not semantic distance (meaning-difference), so a wildcard at 0.95 cosine with a structurally novel POS (part-of-speech) sequence is exactly what the study wants.
Verify wildcards editorially before publishing—they bypass the LLM judgment criteria by design. The chip + tooltip on each wildcard card discloses this.
The wildcard cohort feeds the format-promotion ledger—recurring wildcard structures that the editor consistently picks become candidate new headline standard formats.
Score floor · what's guaranteed
The grader is a port of the same criteria registry in generate_grader.py that runs daily on published headlines (kept in lock-step by a parity test). The variant tool applies it differently to each cohort.
Hard structural rules · 100% pass on every variant (rule-following + wildcard)
- Length · distinct numbers, not one band. The hard floor is 75 chars—a variant-tool runtime override (
VARIANT_CHAR_RANGE.mininindex.js) that permits tighter phrasings; it sits below the daily grader's publish floor. The scoring target band is 90–104 (target_min/target_max, matching the daily grader since 2026-06-05). Above target there's an 11-char editorial over-tolerance (VARIANT_CHAR_RANGE.over_tol, added 2026-06-16): options 105–115 chars soft-pass carrying an amber ⚑ editorial-judgment flag instead of dropping, so writers get a fuller set and decide whether to trim; only <75 or >115 drop. This tolerance is variant-tool-only—the daily grader carries noover_toland still hard-fails any headline over target. So a 75–89-char variant clears the floor and is publishable but scores below one in 90–104; a 105–115-char variant passes flagged. - Named-entity lead · no "The / A / An / It / This / There / He / She / They / Here" lead
- Formula present · matches at least one credit-worthy editorial formula (about two dozen, e.g. Number lead · "Here's" · Possessive · Colon-structured · Em-dash · How / Why / Inside leads · Action-verb · List-with-quantifier · Comparative · and more—see the full formula taxonomy)
- Primary keywords · every primary keyword the operator supplied appears in the variant
- Banned patterns · no question-mark ends · no "What to Know" / "Did you miss" · no all-caps non-acronyms
LLM-judged criteria · ≥85 composite score (rule-following only)
Four binary criteria graded by an LLM: active_voice, no_lead_burial, curiosity, accurate. Composite score weights each—failing one weight-1 criterion = 94 (passes), failing one weight-2 criterion = 88 (passes), failing two = ≤82 (fails). The 85 floor tolerates a single borderline LLM judgment-call miss but rejects combinations or multiple substantive misses.
Wildcards · one-criterion accuracy safety net
Wildcards skip the 4-criterion LLM grader entirely. Instead a single LLM call judges accuracy alone (does the wildcard introduce a falsifiable claim the source contradicts or doesn't cover?). Default PASS. Catches fabrications without fighting the structural-novelty objective.
Active gates · cosine + grader + diversity
| Gate | Default mode | Article-mode | What it catches |
|---|---|---|---|
| Cosine vs seed (rule-following) | < 0.97 | < 0.93 | Near-literal duplicates of the seed headline. Tighter in article-mode where the article gives variants room to legitimately diverge. |
| Cosine vs each "Also avoid" entry | < 0.97 | < 0.93 | Collisions with previously-selected sibling headlines from the same cluster. |
| Cosine vs seed (wildcard) | < 0.95 | < 0.95 | Wildcard-specific ceiling—independent of mode (cohort study values structural fingerprint, not semantic distance). |
| Pairwise cosine between rule-following variants | — | < 0.85 | Forces facet diversity in article-mode; drops the lower-grader variant on collision (tiebreak: drop higher cosine-to-seed). Tightened from 0.92 on 2026-05-27. |
| Pairwise content-token Jaccard between rule-following variants | — | < 0.45 | Catches variants that share most content nouns—different formulas of the same content angle. Tightened from 0.55 on 2026-05-27. |
| Formula cap | ≤ 2 per formula | ≤ 2 per formula | No formula hogs the option set across the full credit-worthy formula taxonomy. Lowest-scoring excess get dropped. |
| Grader (rule-following) | ≥ 85 composite | ≥ 85 composite | Hard structural rules at 100% + LLM-judged criteria pass at ≥85 floor. |
| Grader (wildcard) | Hard rules + accuracy safety-net | Hard rules + accuracy safety-net | Hard structural rules at 100%; LLM-judged criteria bypassed; single accuracy check catches fabrications. |
If the pool falls below the requested N after all gates, the pipeline runs up to 2 regeneration passes with explicit failure-mode feedback (e.g. "char_count under 80 in 4 variants—extend with article specifics"). After 2 passes, returns whatever's in the pool with a recovery-advice payload.
Quotas + caps
Four guardrails. A weekday-only window (Mon–Fri ET), per-user daily, per-user monthly, and the team-wide budget. Atomic D1 counters—concurrent calls cannot over-shoot. Visible to the user as pills above the form.
| Cap | Default | Resets | Behavior at limit |
|---|---|---|---|
| Daily generations / user | 10/day | 00:00 UTC | 429 + recovery panel ("resets at 00:00 UTC") |
| Monthly generations / user | 200/month | 1st of month UTC | 429 + recovery panel ("resets on the 1st") |
| Weekday-only (Mon–Fri ET) | Closed Sat/Sun (Eastern) | Reopens Monday | ERR_WEEKEND + recovery panel ("back online Monday") |
| Team-wide budget | $70/month → ~700 runs at $0.10/call | 1st of month UTC | 429 to ALL users + "contact ops to extend" |
| Slack alert | 80% of team budget | once per crossing | Webhook fires once when team_spend ≥ $56—early warning before hard cap |
UI pills: "today X/Y" · "month X/Y" · "team ~X/Y runs left"—pills turn amber as you approach a cap, red when reached. The team pill converts dollars to estimated runs at the current per-call cost so users see a single shared ceiling in the units they care about.
Unlimited-users exception
The UNLIMITED_USER_EMAILS env var (comma-separated emails) bypasses every cap—the per-user daily and monthly limits and the weekend gate—for listed users. Calls still record into the counter tables and accrue cost for transparency in the usage log—they just never get blocked. Pierce is on this list because he funds the underlying API.
To tune the [vars] block: edit workers/variant/wrangler.toml [vars] (RATE_DAILY · RATE_MONTHLY · BUDGET_MONTHLY_USD · BUDGET_ALERT_FRACTION · EST_COST_PER_CALL_USD) and wrangler deploy.
UNLIMITED_USER_EMAILS lives as a Worker secret (not [vars]) because it lists colleague email addresses—committing those would violate anonymization. Update via echo '$LIST' | wrangler secret put UNLIMITED_USER_EMAILS; see DEPLOY.md §4 for rotation procedure.
If 0 variants returned
The tool tries up to 2 regeneration passes before giving up. If the pool is still empty (or short of N), the response carries a recovery_advice array—specific actionable guidance based on the dominant failure mode that pass.
The UI surfaces it as an amber-bordered "what to try" panel below the (empty) results, with a ↻ Regenerate button next to it. Examples of what the panel says depending on the failure pattern:
- "Most candidates were rejected as too similar to your seed (cosine gate). Try with article body to unlock more varied phrasings."
- "Most candidates failed the rule-based grader—your seed may be too short to extend into the 90–104 char target or contain banned patterns. Try lengthening the seed."
- "You're requiring all 4 primary keywords in every variant—that's a strict bar. Move the least-essential keyword to 'secondary keywords' to widen the pool."
- "Pairwise diversity rejected several candidates—your topic may have only a couple of viable facets. Try generating fewer (e.g., n=5) or remove an 'also avoid' entry."
Form state persists in sessionStorage—the operator can adjust inputs and regenerate without re-pasting everything.
My usage log
Every authenticated page load fetches the user's own usage from D1 (server-side, never localStorage) and populates the pills + the collapsible "My usage log" panel below them.
Personal section (CF-Access-authenticated email is the filter—users see only their own activity):
- Per-day breakdown · last 30 days (calls / variants returned / cost in $)
- Recent generations · last 30 (timestamp, headline preview, returned_n / requested_n, mode, request_id for forensics)
Team-wide visibility in the quota pills above the form: today's team total + this month's spend vs budget are surfaced as part of the standard quota line so concurrent operators see capacity without a separate panel. Per-user monthly breakdown is available via GET /metrics/quota-by-user if needed for ops review; not currently rendered in the UI.
Refreshes on every page load. The cap-enforcement counters are atomic in D1; the read endpoint is purely visibility—concurrent /generate calls cannot over-shoot the limits regardless of what this panel reports.
Architecture · Worker + D1 + Workers AI
Cloudflare Worker at workers/variant/ backed by D1 (shared with workers/history) + Cloudflare Workers AI (embeddings) + Anthropic API (generation + grading). Static page at docs/variant/.
| Component | Technology | Purpose |
|---|---|---|
| Static page | Plain HTML/JS/CSS in docs/variant/ | Form + results UI; served by Cloudflare Pages |
| API | Cloudflare Worker (workers/variant/src/index.js) | POST /generate-async (live client path) · /generate (legacy sync) · /select · /unselect · /outcome · /regenerate; GET /job/:id · /healthz · /seed/:fp/kept · /metrics/* · /export/* |
| Database | Cloudflare D1 (data-headlines-history) | Shared with workers/history. 18 variant_* tables (plus schema_versions for migration tracking) + v_active_kept view across migrations 0001–0014. |
| Embeddings | Workers AI @cf/baai/bge-large-en-v1.5 | 1024-dim L2-normalized vectors; cosine similarity gates. Free within neuron limits, attached to the same Cloudflare account. |
| Generation (rule-following) | Anthropic Sonnet 4.6 (default; tunable via env var) | n−1 publishable variants per call |
| Generation (wildcard) | Anthropic Sonnet 4.6 (default; tunable) | Exactly 1 off-pattern variant per call (capped via slice(0,1) in case the model returns more) |
| LLM grader (rule-following) | Anthropic Haiku 4.5 (default; tunable) | 4-criterion grader · active_voice · no_lead_burial · curiosity · accurate |
| LLM grader (wildcard accuracy) | Same model as the grader | 1-criterion safety net · accuracy only |
| Auth | Cloudflare Access (page) + X-Auth-Token shared bearer (writes) | Page is gated by Access; CF-Access-Authenticated-User-Email header drives per-user identity. WRITE_TOKEN secret for /generate /select /outcome. |
| Rate limit / budget | D1 atomic counters (variant_user_quota_* + variant_team_budget) | INSERT...ON CONFLICT DO UPDATE—serializable per-row, cannot over-shoot under concurrency. |
Model dispatch is by env-var prefix: any model name starting with @cf/ routes to Workers AI via env.AI.run (free, account-bound); anything else (e.g. claude-sonnet-4-6) routes to Anthropic via fetch. Workers AI Llama 3.3 70B is a tested fallback for cost-down—quality is meaningfully worse on the rule-following generator (more clichés, weaker prompt adherence) but acceptable for the LLM grader role if needed.
Pipeline · stages in order
One POST /generate call traverses these stages. Up to 2 regeneration passes if the pool falls short.
- Validate body—headline ≥20 chars, primary keywords required, article ≤50,000 chars, format/persona/platform/site allowlisted, prompt-injection regex screen on the article body.
- Idempotency check—if
client_request_idmatches a recent (≤5min) generation for the same email, return its generation_id without re-running the pipeline. - Rate-limit gate—read D1 counters for daily/monthly/budget. Skip caps if email is in
UNLIMITED_USER_EMAILS. - Anonymization scrub—load denylist from the
DENYLIST_JSONCloudflare Secret (cached per-isolate). Replace any matched name with[…]in headline, article, primary/secondary keywords, also-avoid entries. - Generate · rule-following + wildcard in parallel via
callRuleFollowing+callWildcard. Retry once on transient (429/5xx) Anthropic + Workers AI failures. - Embed · single Workers AI batch call · seed + each also-avoid + each candidate variant. On second-attempt failure, degrade to cosine=0 (rule-based grader still gates).
- Cosine gate · per-cohort ceilings (rule-following: 0.97 default / 0.93 article; wildcard: 0.95). Track every dropped candidate in
allOfferedCandidateswithdropped_at_gate='cosine'. - Rule-based grader pre-check · char_count, formula, named-entity lead, primary keywords, banned patterns. Drop on any non-LLM rule failure with
dropped_at_gate='grader_rule'. - LLM grader · batched call for rule-following candidates only · 4 criteria. Wildcards skip this stage. On grader failure, degrade to rule-based-only score (LLM criteria null) instead of 502.
- Compose graded set · rule-following with merged LLM results (drop sub-85) + wildcards with rule-based-only score + accuracy safety-net call.
- Set-level diversity · formula cap (≤2 per formula). Article-mode adds pairwise cosine ≤0.85 + pairwise Jaccard ≤0.45 between rule-following variants.
- Regen loop · if pool < requestedN and passes < 2, re-prompt the LLM with explicit failure-mode feedback (which gate dropped the most variants this pass).
- Persist · D1 batch—generation row + every offered candidate (rule + wildcard, kept + dropped, with dropped_at_gate annotation) + back-compat write to legacy
variant_offered_wildcards. Atomic. - Record cost · atomic D1 counter increment (always—failed pipelines still count for DoS protection). 80%-budget Slack alert fires once per crossing.
- Return · variants + pipeline stats + quota snapshot + recovery_advice if pool short.
Model assignments
All three LLM roles are env-var-driven. Edit workers/variant/wrangler.toml [vars] block + wrangler deploy to swap any role. Dispatch by prefix: @cf/... → Workers AI (free), anything else → Anthropic.
| Role | Env var | Current default | Why this model |
|---|---|---|---|
| Rule-following generator | MODEL_RULE_FOLLOWING | claude-sonnet-4-6 | Strong prompt-following—the full credit-worthy formula taxonomy (about two dozen) + scope discipline + multi-facet diversity is genuinely hard for smaller models. Tested Llama 3.3 70B; quality drop was significant (cliché flood, weaker adherence). |
| Wildcard generator | MODEL_WILDCARD | claude-sonnet-4-6 | Was Opus 4.7 (cost-heavy); was Llama 3.3 70B (quality dropped + multi-wildcard issues); Sonnet 4.6 is the right cost/quality balance for a single creative variant per call. |
| LLM grader | MODEL_GRADER | claude-haiku-4-5-20251001 | Classification task with loose default-PASS prompts; Haiku is cheap and reliable. Tested Llama 70B; Haiku's structured-JSON output is more dependable. |
| Embeddings | (hardcoded) | @cf/baai/bge-large-en-v1.5 | Free on Workers AI, 1024-dim, performs well on news-text similarity. Same model the CSA Diff Tool uses for parity. |
Cost model. ~$0.06–$0.10 per /generate call on the current Sonnet+Sonnet+Haiku stack. Recent observed range $0.057–$0.099 (varies with article-body length + regen passes).
D1 schema · 18 tables + 1 view
Single D1 database (data-headlines-history) shared with workers/history. Migrations 0001–0014 applied; current schema_version is 14. Migration 0004 added the kept toggle (unkept_at + seed_fingerprint + v_active_kept view); 0006/0007/0008 added partial unique indexes; 0009 lowercased legacy user_email values for case-consistent matching; 0010 added embed_degraded sentinel column on offered_candidates + offered_wildcards and widened the offered_candidates UNIQUE to (generation_id, cohort, variant_text); 0011 widened the variant_selections UNIQUE to match (cohort included) and added partial UNIQUE(is_active=1) on rules_versions + prompts; 0012 added the variant_jobs async-job queue (no version stamp—table-only, so MAX(version) jumps 12→13 across the 0012/0013 pair); 0013 tightened the article-mode pairwise gates; 0014 added the headline-experiment data layer (variant_rule_sets + variant_config_specs, plus rule_set_id/spec_hash/headline_type columns on generations + offered_candidates).
| Table | Purpose | Key columns |
|---|---|---|
variant_generations | One row per /generate call. Audit + reproducibility tuple. | request_id · user_email · headline · calibration_id · prompt_*_id · model_* · rules_source_sha · denylist_version · article_mode · returned_n · est_cost_usd · duration_ms · primary_keywords · secondary_keywords · client_request_id · generated_at_iso |
variant_offered_candidates | Every candidate offered (rule + wildcard, kept + dropped)—canonical surface for the cohort study (added migration 0003). | generation_id · cohort · variant_text · similarity_to_input · grader_score · grader_breakdown · fingerprint_json · picked · dropped_at_gate · pass_index · created_at_iso |
variant_offered_wildcards | Legacy wildcard-only table from migration 0001. Worker dual-writes for back-compat with older cohort analyzers. Migration 0008 added a UNIQUE(generation_id, variant_text) so the dual-write INSERT OR IGNORE is race-safe. | (same as offered_candidates filtered to wildcard cohort) |
variant_selections | One row per /select. UNIQUE(generation_id, variant_text)—double-click can't double-insert. | generation_id · variant_text · cohort · grader_score · grader_breakdown · n_offered_in_set · rank_in_set · selected_at_iso |
variant_fingerprints | Structural + lexical features per selected variant (1:1). Powers cohort clustering. | selection_id · char_count · leading_pos · primary_verb · formula_match · bigram_signature · trigram_signature · pos_sequence_signature · cluster_key · syllable_count · named_entities_json · dependency_skeleton |
variant_outcomes_join | Post-publish outcome metrics joined to selection. UPSERT on selection_id with COALESCE so partial enrichments accumulate. | selection_id · url · publication · publish_date · headline_in_production · edit_distance_to_pick · total_pvs · search_pvs · social_pvs · applenews_pvs · smartnews_pvs · ctr · dwell_seconds_p50 · scroll_depth_p50 · ecpm_usd · best_position · outcome_source · matched_at_iso |
variant_calibrations | Versioned numeric pipeline knobs. Generation rows reference the active row. | score_floor · cosine_gate_input_default · cosine_gate_input_article · cosine_gate_wildcard · pairwise_cosine_max · pairwise_jaccard_max · formula_per_set_max · char_min · char_target_min · char_target_max · max_regen_passes · is_active |
variant_prompts | SHA-256-keyed prompt registry. One row per (mode, prompt_hash). Auto-seeded by the worker. | mode · prompt_hash · prompt_text · applied_at_iso · is_active |
variant_models | Provider + pricing snapshot per model identifier ever seen. | name · provider · cost_input_per_m_usd · cost_output_per_m_usd · first_seen_at_iso · is_active |
variant_rules_versions | csa-content-standards SHA bundled at deploy time (via sync_rules.py). | source_repo · source_sha · generated_at_iso · is_active |
variant_rule_sets | Hot-swappable experimental rule arms for the headline experiment (added migration 0014). Active-row pattern (single-active partial UNIQUE)—the worker reads the active row fail-open at generation time. Holds the CTRL/EMG/FWD framing definitions + signal-word menus + the expert directive in rule_set_json. | label · is_active · applied_at_iso · rule_set_json · notes |
variant_config_specs | Immutable, content-addressed config snapshots—the per-headline "receipt" (added migration 0014). One row per distinct config; spec_hash = SHA-256 of the canonicalized spec JSON; INSERT OR IGNORE so it stays small. A generation's spec_hash dereferences here to the exact config that produced it. | spec_hash (PK) · spec_json · created_at_iso · label |
variant_format_candidates | Cluster of recurring wildcard structures surfaced by analyze_variant_cohort.py. Status enum: surfaced → under_review → promoted / rejected / retired. | cluster_key · n_wildcards_offered · n_wildcards_picked · cluster_pick_rate · lift_vs_global · pv_median_ratio_point · status · promoted_format_name · csa_standards_sha |
variant_format_events | Append-only audit log of every status transition (governance trail). | candidate_id · ts_iso · prev_status · new_status · actor · detail |
variant_user_quota_daily | Atomic per-user per-day counter (PRIMARY KEY user_email + ymd). | user_email · ymd · count |
variant_user_quota_monthly | Atomic per-user per-month counter. | user_email · ym · count |
variant_team_budget | Atomic team-wide spend counter + idempotent 80%-alert flag. | ym · spend_usd · alert_80_fired · alert_100_fired |
variant_jobs | Async-job lifecycle queue for /generate-async (added migration 0012). One row per job; status queued → running → complete | failed with fine-grained stage tracking. Stores the request body (replay) + the full result_json on complete; rows >7 days are reaped. | job_id (PK) · user_email · status · stage · request_body · result_json · error_code · created_at |
Portability. Every *_at epoch_ms field has an *_iso mirror column populated at write time. Indexes cover every filter/group-by surface (calibration, prompt, model, rules SHA, cluster_key, total_pvs, publish_date, etc.). Foreign keys declared; PRAGMA foreign_keys = ON set per-request in the worker.
Versioning · calibrations / prompts / models / rules
Every variant_generations row carries the full reproducibility tuple—what numeric calibration was active, which exact prompt template was sent, which model handled each role, which csa-content-standards SHA the rule corpus came from, which denylist version was loaded.
Why this matters. The variant tool will be tuned over time. Score floors will move, prompts will be loosened or tightened, models will be swapped. A/B comparisons across versions become trivial SQL: WHERE calibration_id = X vs Y. Rolling improvements are measurable rather than asserted.
How a new calibration ships
- Edit constants in
workers/variant/src/index.js(or wrangler.toml env vars for the LLM models). - Run
wrangler deploy. - On next request,
ensureMetadata()reads the activevariant_calibrationsrow. If the new constants don't match, INSERT a new calibration row + deprecate the prior. Worker stamps the new ID onto subsequent generations. - Compare A/B over a few days:
SELECT calibration_id, AVG(returned_n), AVG(est_cost_usd), AVG(rejected_grader) FROM variant_generations WHERE generated_at > ? GROUP BY calibration_id.
How a new prompt ships
Edit the prompt template in workers/variant/src/llm.js. Worker computes SHA-256 of the system prompt at request time. INSERT-OR-IGNORE into variant_prompts (idempotent on hash). Subsequent generations reference the new prompt_id; A/B comparison via WHERE prompt_rule_id = X.
Cohort study + format-promotion ledger
The wildcard cohort is the long-term feedstock for surfacing new headline standard formats—recurring wildcard structures that the operator consistently picks (and that perform once published) become candidates for promotion to the rule library.
The pipeline
- Collect · every offered candidate persists with structural fingerprint (cluster_key = formula × leading_pos × first 3 POS tokens). Both kept + dropped wildcards land in
variant_offered_candidates. - Cluster ·
scripts/analyze_variant_cohort.pygroups wildcards by cluster_key. Per cluster: count offered, count picked, pick rate, performance lift if outcomes joined. Aborts when /export/* is truncated at 5000 rows so reports are never silently biased. - Surface candidates · clusters with significantly higher pick rate or PV lift than the rule-following baseline land in
variant_format_candidateswithstatus='surfaced'. - Review · operator (Pierce + content team lead) reviews surfaced candidates. Status moves to
under_review→promoted/rejected. - Promote · promoted clusters are codified into csa-content-standards as new headline format guidelines. The promotion event records the csa-content-standards SHA at promotion time, so a reviewer can diff exactly what landed in the standards docs because of this candidate.
- Audit ·
variant_format_eventsappend-only log captures every status transition (who promoted what when, against which standards SHA).
What's shipped vs planned
- Shipped: migrations 0003-0014 schema (variant_offered_candidates · variant_format_candidates · variant_format_events · v_active_kept · UNIQUE indexes · lowercased user_email · embed_degraded sentinel · cohort-aware UNIQUE on offered_candidates + variant_selections · partial UNIQUE(is_active=1) on rules_versions + prompts · the variant_jobs async queue · the headline-experiment data layer variant_rule_sets + variant_config_specs with per-call/per-headline attribution columns). Worker writes every offered candidate's fingerprint, including diversity-dropped candidates.
scripts/analyze_variant_cohort.pycomputes cluster pick-rates + bootstrap CIs + edit-distance tail; aborts cleanly on /export/* truncation rather than silently bias; non-zero-baseline guards on candidate-format surfacing; isinstance NaN guards on CI bounds; html.escape on operator-supplied report content. - Planned: weekly cron to refresh format-candidate rankings; a UI surface for reviewing surfaced candidates; promotion-to-standards integration.
Anonymization · DENYLIST_JSON
The variant tool reads the colleague-name denylist from a Cloudflare Secret (DENYLIST_JSON) at request time—never bundled into the build artifact, never committed to git. Per-isolate cache so we re-parse the secret only on cold start.
Format:
{"version":"2026-05-07","names":["<colleague-1>","<colleague-2>","..."]}
(Legacy: a bare JSON array of strings is also accepted.)
Where it scrubs:
- Pre-call · headline, article body, primary + secondary keywords, also-avoid entries—all scrubbed before they reach the LLM. Names replaced with
[…]placeholders. - Post-call · LLM outputs scrubbed before they reach the user (defense in depth—the LLM shouldn't introduce new colleague names but if it echoes one that survived input scrub via partial match, this catches it).
- Hard reject · any candidate variant containing a denylist name post-scrub is dropped from the pool entirely (treated as a denylist failure).
The denylist version stamps onto every variant_generations row (denylist_version column) for reproducibility—if a name is added or removed from the denylist, the version stamp lets you scope queries to before/after the change.
Pierce is the only permitted name across all repos; the denylist excludes him. To set / update:
cd workers/variant && wrangler secret put DENYLIST_JSON < /tmp/denylist.json
Deployment · secrets + migrations
Detail in workers/variant/DEPLOY.md. Summary here.
One-time setup
- Apply migrations to remote D1 in order,
0001→0014(0001_variant_tables.sql…0011_selections_cohort_unique_and_active_partials.sql→0012_async_jobs.sql→0013_tighten_article_pairwise_gates.sql→0014_headline_experimentation.sql). Or usenpm run migrate:allfromworkers/variant/. Then run the one-shot seed0014_seed_phase1_rule_set.sqlONCE after migrating (it's a deliberate operator step, NOT in the migrate chain; idempotent viaWHERE NOT EXISTS). Confirm theschema_version: 14stamp via/healthz. - Set worker secrets:
WRITE_TOKEN(32-hex shared bearer for /generate /select /outcome),ANTHROPIC_API_KEY(McClatchy-billed),DENYLIST_JSON(anonymization),UNLIMITED_USER_EMAILS(comma-separated emails that bypass quota; secret because it lists colleague addresses),SLACK_WEBHOOK_URL(optional, for 80%-budget alerts). - Bundle the rule corpus + denylist before first deploy:
python3 scripts/sync_rules.py. (DENYLIST_JSON is runtime-loaded; only RULES.js is bundled.) - Add
/variant/to the existing CF Access app on data-headlines.pierce.tools (already covered by the data-headlines.pierce.tools Access policy by default).
Per-deploy
cd workers/variant
python3 scripts/sync_rules.py # refresh RULES.js from csa-content-standards
wrangler deploy # ship Worker
curl -s https://data-headlines-variant.pierce-williams.workers.dev/healthz | jq
The /healthz response surfaces: schema_version, bindings (ai/db/anthropic_key—the legacy KV binding was retired in migration 0003), active models, rules_loaded, denylist configuration + version, active calibration row. /healthz/deep additionally pings Workers AI for a live latency check. All other GET endpoints require X-Auth-Token.
API endpoints
| Method · Path | Auth | Purpose |
|---|---|---|
| GET /healthz | None | DB ping + bindings + active calibration + denylist status |
| GET /healthz/deep | None | Same + a live env.AI ping (latency + reachability) |
| POST /generate-async | X-Auth-Token + CF Access | The live client path. Enqueues a generation job and returns {job_id, status:'queued', request_id} at HTTP 202 in <1s; the worker runs the full pipeline in the background (Cloudflare Queue consumer, or ctx.waitUntil() fallback) so the client never hits CF's 30s wall-clock cap. Same body as /generate. |
| GET /job/:id | X-Auth-Token + CF Access | Owner-scoped poll of an async job (id = the v4 UUID job_id). Returns stage (queued → generating → regenerating) + terminal status (complete | failed) + created_at/started_at/completed_at; on complete, result is the full /generate response body inline (no second fetch). A poll of a dead/stale job reaps it to failed with a recovery message. |
| POST /generate | X-Auth-Token + CF Access | Legacy synchronous full pipeline—runs the whole thing under one request (superseded by generate-async + poll for the live client). Body: headline, primary_keywords, secondary_keywords, article, n, format/persona/platform/site, also_avoid, client_request_id |
| POST /regenerate | same | Alias for /generate (client passes a fresh nonce; idempotency guard ensures no double-billing) |
| POST /select | same | Keep variant. UNIQUE(generation_id, variant_text)—double-click safe. Returns idempotent: true on duplicate; re_kept: true when toggling on a previously-soft-deleted row (clears unkept_at). Backfills offered + fingerprint rows if /generate's batch 2 had failed. |
| POST /unselect | same | Soft-delete kept variant. Body accepts either seed_fingerprint (preferred—soft-deletes ALL active rows for this user across generations) or generation_id (legacy). User-scoped—one operator can't unkeep another's row; returns kept_by_other_user noop instead. |
| POST /outcome | same | Link selection → published article + outcome metrics (PVs by surface, CTR, dwell, scroll, eCPM). UPSERT with COALESCE—partial enrichments accumulate. Rejects soft-deleted selections (409 ERR_SELECTION_UNKEPT). publish_date round-trip-validated against the parsed Date so impossible calendar dates are caught. |
| GET /seed/:fingerprint/kept | X-Auth-Token + CF Access | Cross-generation hydration. Returns active kept variants for a seed_fingerprint, joined across all generations sharing that fingerprint. Used by the client to paint Kept ✓ on cards after a regenerate. |
| GET /metrics/quota-mine | X-Auth-Token + CF Access | Caller's quota snapshot (read from CF Access email header) |
| GET /metrics/my-history?days=N | X-Auth-Token + CF Access | Per-day breakdown + recent 30 generations for the calling user |
| GET /metrics/quota-by-user | X-Auth-Token + CF Access (or ?token=<WRITE_TOKEN> for admin scripts) | Admin: per-user breakdown for today + month + team budget |
| GET /metrics/cohort-counts | X-Auth-Token + CF Access | Selections by cohort: lifetime decisions AND currently-active. Plus wildcards offered/picked + ever-pick rate. |
| GET /metrics/diversity-rejection-rate?since= | X-Auth-Token + CF Access | Pipeline rejection-rate aggregates over a window. Validates ?since=. |
| GET /export/offered-candidates?since= | X-Auth-Token + CF Access | Every offered candidate (cohort + dropped + picked)—canonical cohort study export. Surfaces truncated:bool when LIMIT 5000 hits. |
| GET /export/wildcards-offered?since= | X-Auth-Token + CF Access | Legacy wildcards-only export (back-compat with older analyzers). |
| GET /export/selections?since=&active_only=&include_emails=&token= | X-Auth-Token + CF Access | Selections + fingerprints + outcomes joined. Default returns ALL decisions (including soft-deleted) for cohort counting; pass active_only=1 to filter to currently-kept. user_email is redacted to NULL by default; pass include_emails=1&token=<WRITE_TOKEN> for trusted callers. |
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Pills show "?/?" forever on page load | /metrics/quota-mine fetch failed (worker down or Access misconfig) | Open browser devtools · Network · refresh; check /metrics/quota-mine response. If 401, CF Access policy doesn't cover the user's email. |
| 0 variants returned, no recovery panel | Pipeline error before regen loop (embed failure, validation failure) | Check response code/error fields. ERR_INJECTION = article body had a prompt-injection pattern; rephrase. ERR_VALIDATION = headline/keyword length issue. |
| Variants all cluster on one angle | Default mode (no article body); diversity gates inactive | Paste the full article body to switch to article-mode (pairwise diversity gates kick in). |
| "persist_failed: true" on response | D1 batch insert failed but variants still returned | Variants are usable; cohort study just won't capture this run. Copy a variant manually to clipboard. Check D1 status; retry the next generation. |
| Variants contain a colleague's name | DENYLIST_JSON secret unset or stale | wrangler secret put DENYLIST_JSON. /healthz shows denylist.configured. |
| 429 ERR_RATE_DAILY before lunch | Daily cap hit (10/day default) | Wait until 00:00 UTC OR (admin) reset the user's row in variant_user_quota_daily OR add the email to UNLIMITED_USER_EMAILS. |
| ERR_WEEKEND on a Saturday or Sunday | Tried on a weekend—the tool is weekday-only (Mon–Fri ET) for capped users | It's back online Monday. For urgent off-hours access, add the email to UNLIMITED_USER_EMAILS (unlimited users bypass the weekend gate). |
| 429 ERR_BUDGET | Team-wide budget exhausted ($70 default) | Edit BUDGET_MONTHLY_USD in wrangler.toml + redeploy. Or wait for the 1st of the month. |
| Wildcard never appears in results | Wildcard failed cosine OR rule-based pre-check OR accuracy safety net | Check ?debug=1 response. If wildcard cosine repeatedly >0.95, the seed is too unique to wildcardize (rare). If accuracy fails, the wildcard may be hallucinating; review with editorial eye. |
| Help bubbles render as empty boxes | CSS not loaded | Hard-refresh; cache-bust URL stamp on variant.css. |
Variant-specific glossary
Cohort
Either rule_following (n−1 publishable variants per call, full grader floor) or wildcard (1 off-pattern variant per call, hard rules + accuracy safety-net only). Stored on variant_offered_candidates.cohort.
Cluster key
Compact structural fingerprint computed from formula × leading_pos × first 3 POS tokens (e.g. 'Possessive | noun | NN-NN-VB'). Used as the cohort study's primary GROUP BY for surfacing recurring wildcard structures.
Calibration
A bundled snapshot of every numeric pipeline knob (score_floor · cosine gates · pairwise gates · char range · regen passes). Stored as a row in variant_calibrations; the active row's id stamps onto every generation.
Prompt hash
SHA-256 hex of a system prompt template at the moment of dispatch. Stored in variant_prompts with the full template text. Lets you A/B prompt revisions without losing history.
Article-mode
Pipeline regime when the operator pastes the article body. Activates pairwise diversity gates between rule-following variants (cosine ≤ 0.85, Jaccard ≤ 0.45) and switches the LLM grader's accuracy criterion to "judge against the article" (more permissive). Mode pill in the results header tells the operator which regime is active.
Recovery advice
Specific actionable guidance appended to /generate responses when the pool is short of N. Surfaced in the UI as an amber-bordered panel with concrete advice ("move keyword X to secondary," "try without an article body," etc.) and a regenerate button.
Idempotency replay
If client_request_id matches a prior generation by the same email within 5 minutes, the worker returns the prior generation_id without re-running the pipeline. Prevents double-billing on transient client retries.
Unlimited-users exception
Comma-separated emails in the UNLIMITED_USER_EMAILS env var. Listed users bypass all daily/monthly/team-budget caps. Calls still record into the counter tables for transparency. Pierce is on this list (he funds the underlying API).