Skip to content
Better together

How The New York Times is using generative AI as a reporting tool

LLMs help reporters transcribe and sort through hundreds of hours of leaked audio.

Kyle Orland | 104
Artist's conception of a New York Times reporter consulting with an LLM for document research. This is exactly what it looked like in real life, shut up. Credit: Getty Images
Artist's conception of a New York Times reporter consulting with an LLM for document research. This is exactly what it looked like in real life, shut up. Credit: Getty Images
Story text

The rise of powerful generative AI models in the last few years has led to plenty of stories of corporations trying to use AI to replace human jobs. But a recent New York Times story highlights the other side of that coin, where AI models simply become a powerful tool aiding in work that still requires humanity's unique skillset.

The NYT piece in question isn't directly about AI at all. As the headline "Inside the Movement Behind Trump’s Election Lies" suggests, the article actually reports in detail on how the ostensibly non-partisan Election Integrity Network "has closely coordinated with the Trump-controlled Republican National Committee." The piece cites and shares recordings of group members complaining of "the left" rigging elections, talking of efforts to "put Democrats on the defensive," and urging listeners to help with Republican turnout operations.

To report the piece, the Times says it sifted through "over 400 hours of conversations" from weekly meetings by the Election Integrity Network over the last three years, as well as "additional documents and training materials." Going through a trove of information that large is a daunting prospect, even for the team of four bylined reporters credited on the piece. That's why the Times says in a note accompanying the piece that it "used artificial intelligence to help identify particularly salient moments" from the videos to report on.

Let a machine transcribe it all

The first step was using automated tools to transcribe the videos, resulting in a set of transcripts that "totaled almost five million words," the note says. This isn't exactly a bold new use of AI at this point—the Times itself was writing about Otter.ai's automated transcription tools back in 2019.

If your last experience with AI transcription is that old, though, you might not be aware how much progress has been made in the quality and accuracy of machine transcription. Wirecutter's updated guide to automated transcription services notes that the best AI transcription service it tested in 2018 was only 73 percent accurate, while the least accurate they tested in 2024 was 94 percent accurate. What's more, Wirecutter notes that the best current systems, like OpenAI's Whisper, "are somewhat more accurate than the least-precise human-powered transcriptions."

If you don't have a 1960s secretary who can do your audio transcription for you, AI tools can now serve as a very good stand-in.
If you don't have a 1960s secretary who can do your audio transcription for you, AI tools can now serve as a very good stand-in. Credit: Getty Images

This rapid advancement is definitely bad news for people who make a living transcribing spoken words. But for reporters like those at the Times—who can now accurately transcribe hundreds of hours of audio quickly and accurately at a much lower cost—these AI systems are now just another important tool in the reporting toolbox.

Leave the analysis to us?

With the automated transcription done, the NYT reporters still faced the difficult task of reading through 5 million words of transcribed text to pick out relevant, reportable news. To do that, the team says it "employed several large-language models," which let them "search the transcripts for topics of interest, look for notable guests and identify recurring themes."

Summarizing complex sets of documents and identifying themes has long been touted as one of the most practical uses for large language models. Last year, for instance, Anthropic hyped the expanded context window of its Claude model by showing off its ability to absorb the entire text of The Great Gatsby and "then interactively answer questions about it or analyze its meaning," as we put it at the time. More recently, I was wowed by Google's NotebookLM and its ability to form a cogent review of my Minesweeper book and craft an engaging spoken-word podcast based on it.

There are important limits to LLMs' text analysis capabilities, though. Earlier this year, for instance, an Australian government study found that Meta's Llama 2 was much worse than humans at summarizing public responses to a government inquiry committee.

Australian government evaluators found AI summaries were often "wordy and pointless—just repeating what was in the submission."
Australian government evaluators found AI summaries were often "wordy and pointless—just repeating what was in the submission." Credit: Getty Images

In general, the report found that the AI summaries showed "a limited ability to analyze and summarize complex content requiring a deep understanding of context, subtle nuances, or implicit meaning." Even worse, the Llama summaries often "generated text that was grammatically correct, but on occasion factually inaccurate," highlighting the ever-present problem of confabulation inherent to these kinds of tools.

The LLM/human hybrid reporter

These important limitations highlight why it's still important to have humans involved in the analysis process here. The NYT notes that, after querying its LLMs to help identify "topics of interest" and "recurring themes," its reporters "then manually reviewed each passage and used our own judgment to determine the meaning and relevance of each clip... Every quote and video clip from the meetings in this article was checked against the original recording to ensure it was accurate, correctly represented the speaker’s meaning and fairly represented the context in which it was said."

By using a hybrid approach that involves both LLMs and human analysis, the Times is able to exploit the strengths and limit the weaknesses of both sides. The LLMs—with their ability to quickly digest and sort through mountains of information—provide an extremely useful first pass that picks out potentially relevant recordings for the reporters to analyze. Those reporters in turn provide an important check on the LLM's tendency to confabulate "factually inaccurate" information and help provide the "deep understanding of context, subtle nuances, or implicit meaning" that the Australian government found LLMs were generally unable to.

Here, generative AI serves a role somewhat akin to a drug-sniffing dog or truffle-hunting pig, pointing out potentially interesting morsels for their human masters to consider. But the automated LLMs aren't as reliable as the animals in these roles, leaving it up to the humans to still double-check whether what has turned up is relevant and/or accurate.

That analogy probably isn't very comforting to the human transcribers and researchers that would probably have been needed for a massive reporting task like this in the past. Still, for the reporters who can now quickly automate large parts of this kind of research, generative AI is already proving to be another useful digital tool.

Photo of Kyle Orland
Kyle Orland Senior Gaming Editor
Kyle Orland has been the Senior Gaming Editor at Ars Technica since 2012, writing primarily about the business, tech, and culture behind video games. He has journalism and computer science degrees from University of Maryland. He once wrote a whole book about Minesweeper.
104 Comments