Strands vs LangGraph: Benchmark results didn’t match my expectations — sanity check? #1391

T-Rishi444 · 2025-12-24T07:34:35Z

T-Rishi444
Dec 24, 2025

I wanted to sanity-check some results I got while comparing Strands with LangGraph. I ran a small, fair, apples-to-apples benchmark using the same set of questions (simple → complex) and the same evaluation criteria.

What I observed:

First-pass correctness was the same for both (5/6)
LangGraph was ~6× faster in my setup
LangGraph used ~25–30× fewer tokens
Strands tended to improve answers via refinement loops

This made me wonder whether Strands is generally considered “better” mainly because of validation, auditability, and structured execution — rather than raw latency or token efficiency.

I’d love to understand:

Are there recommended Strands configurations or best practices that would change this comparison?
Is this simply a case of different tools optimized for different goals?

I’m genuinely trying to learn here and make sure my assumptions and setup are sound. Any insight would be appreciated!

Manoj-Gujare · 2026-01-07T18:23:59Z

Manoj-Gujare
Jan 7, 2026

Your benchmark results provide a very accurate snapshot of the architectural trade-offs between these two frameworks. The core difference lies in Explicit Orchestration (LangGraph) versus Autonomous Reasoning (Strands). LangGraph operates primarily as a state machine where you, the developer, define the logic, nodes, and transitions. Because the path is largely pre-determined, the LLM doesn't have to "think" about what to do next at every step; it only processes the task assigned to that specific node. This lack of overhead is exactly why you are seeing significantly lower latency and token usage—you aren't paying the "reasoning tax" that comes with an agent constantly self-evaluating its next move.

Strands, on the other hand, is built for autonomy and utilizes a model-driven reasoning loop (often following the ReAct pattern). In this setup, the LLM is responsible for planning, executing tools, observing outputs, and then re-planning. This explains the 25–30× token difference: every time the agent "thinks" or loops back to refine an answer, it re-processes the prompt and tool history. As you observed, this makes Strands superior for "First-pass correctness" on complex tasks because it is natively designed to self-correct and audit its own work, whereas LangGraph only self-corrects if you have explicitly built a "retry" or "validation" edge into your graph.

To optimize your Strands configuration for better efficiency, you might look into aggressive context management. Since Strands passes the history of tool calls back and forth, the token count can snowball quickly; pruning old tool outputs or using a smaller "haiku-class" model for simple execution steps while reserving larger models for planning can bridge the performance gap. Ultimately, you've correctly identified that these tools are optimized for different goals: LangGraph is a "builder’s tool" meant for predictable, high-speed production flows, while Strands is a "reasoning engine" meant for high-stakes accuracy where the cost of a token is less important than the quality of the refinement.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strands vs LangGraph: Benchmark results didn’t match my expectations — sanity check? #1391

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Strands vs LangGraph: Benchmark results didn’t match my expectations — sanity check? #1391

Uh oh!

T-Rishi444 Dec 24, 2025

Replies: 1 comment

Uh oh!

Manoj-Gujare Jan 7, 2026

T-Rishi444
Dec 24, 2025

Manoj-Gujare
Jan 7, 2026