What Kind of Story Is This?

Starting with Editors

Before writing a line of code, we talked to reporters and editors. What, in their experience, moves the needle with readers? The answer came from every direction: the topic, yes — but also the format. An investigation into city government lands differently than a quick-hit news brief on the same subject. Both matter. Neither tells the whole story alone.

That conversation shaped two parallel modeling efforts: one to classify what a story is about, the other to classify how it's told.

Classifying Topics with BERTopic

For content topics, we used BERTopic — a modern approach to topic modeling that groups articles by semantic similarity rather than keyword overlap. Articles were pulled from MongoDB and routed through an ingestion pipeline into Snowflake for processing.

We trained multiple BERTopic variants across a range of embedding strategies, dimensionality reduction techniques, and clustering hyperparameters. Once clusters formed, we used an LLM to sample stories from each and assign semantically meaningful topic names — a step that turned opaque cluster IDs into labels editors could actually use.

Model selection relied on two complementary scores: coherence (are stories within a topic genuinely similar?) and diversity (are topics distinct from one another?). Finalists were then evaluated against a hand-labeled set of 100 articles, tested on exact and marginal correctness.

Scatter plot of BERTopic model candidates plotted by coherence score on the x-axis and diversity score on the y-axis, with three well-balanced candidates highlighted in green — Each point is a trained BERTopic variant. The three green candidates sit in the upper-right — high on both coherence and diversity — and advanced to human evaluation.

Side-by-side table comparing topic labels generated by three model phases, showing how topic names evolved across iterations — Topic labels refined across three model phases. LLM-generated names grew more precise with each iteration as embedding and clustering parameters were tuned.

The final model surfaced a rich map of the Star Tribune's editorial terrain — from hyperlocal coverage like Minnesota House Elections to cultural beats like Music Festivals and Minnesota Wild hockey.

Bubble chart showing the distribution of story topics, where each circle represents a topic cluster, sized by number of stories and positioned by semantic relatedness — The final topic space. Each bubble is a cluster; size reflects story volume and proximity reflects semantic relatedness. Sports topics cluster together on the right, politics and policy on the left.

Classifying Story Archetypes

Topic alone doesn't explain performance. A feature story and an inverted pyramid on the same subject are fundamentally different products. To capture that, we worked with the news team to define four story archetypes:

Inverted PyramidA medium-length, fact-forward news brief — the workhorse of daily coverage.
ExplainerContext-first pieces that help readers understand complex situations: "What experts say," "Five things to know."
FeatureLonger narrative pieces centered on a subject or person, rich with backstory and scene-setting.
Investigation / AnalysisIn-depth reporting that uncovers hidden or complex facts through sourcing and data.

To generate training data, we used Prodigy, an annotation tool for unstructured text. Journalists and editors labeled hundreds of articles in a working lunch session, sorting stories into the four buckets.

Hand-drawn circular diagram showing the four-stage ML workflow: annotate data, train NLP model, automate predictions, validate model — The archetype classifier's iterative workflow: human-labeled data trains the model, predictions are automated at scale, then spot-checked by editors to catch drift before the next training round.

With that labeled dataset in hand, we trained a multi-class deep learning classifier using spaCy, optimized on macro F1 score to ensure balanced performance across all four categories — including the rarest ones.

Editors sorted hundreds of stories in a single lunch. The annotations they produced in an afternoon became the foundation for a classifier that now labels thousands of articles a month.

What We Learned

With both models running, we could cross every article on its topic and archetype — and then model expected subscriber acquisition for each combination using regression.

The results were instructive. Investigations and analytical pieces are, unsurprisingly, among the most powerful acquisition drivers. But the archetype effect isn't uniform. The best format depends heavily on the topic.

Dot plot showing expected number of new subscriber conversions for stories broken down by topic on the y-axis and archetype shown as color-coded dots, revealing that investigative pieces on Minnesota politics are the strongest acquisition driver — Expected subscriber conversions by topic and archetype. For Minnesota Elections and Criminal Cases, investigative pieces (blue) dominate. For Housing and Urban Development, features (yellow) outperform — format advantage varies by beat.

For housing and crime coverage, for example, features — not investigations — were the strongest subscriber acquisition lever. That kind of finding gives editors something concrete: not just what to cover, but how to cover it.