Benchmarking Web Search in LLMs: OpenAI vs Gemini vs Groq

Benchmarking Web Search in LLMs: OpenAI vs Gemini vs Groq

Testing GPT-4o, Gemini 2.5 Flash, and Groq Compound across three query types — real-time retrieval, fact extraction, and multi-hop reasoning.

Introduction

Modern LLMs are no longer limited to static training data. Web search integration allows them to retrieve real-time information, reduce hallucinations, and stay current beyond their training cutoff.

But this introduces real engineering questions:

  • How does query complexity affect token usage and cost?
  • Does search architecture (tool-call vs. grounding vs. agent routing) change the economics?
  • Which model handles different query types best?
  • What is the latency tradeoff at each complexity level?

This benchmark tests three LLM systems across three distinct query types to answer these questions:

| Model | Search Mechanism | |---|---| | OpenAI GPT-4o | Explicit tool call (web_search) | | Gemini 2.5 Flash | Infrastructure-level Google Search grounding | | Groq Compound | Agent-based multi-model routing |

Note: This benchmark compares three systems with fundamentally different search architectures — tool-based invocation (OpenAI GPT-4o), infrastructure-level grounding (Gemini 2.5 Flash), and agent-based routing (Groq Compound). As a result, metrics such as token usage, cost, and latency are not strictly equivalent across systems. This analysis should be interpreted as a comparison of architectural approaches to search-enabled LLM systems, rather than a direct model-to-model ranking.


Query Design

Three queries were designed to test LLMs across all major search techniques:

queries = [
    {
        "id": 1,
        "type": "search_required",
        "category": "real_time",
        "query": "What are the latest updates about NVIDIA in 2026? List the most recent facts about the Nvidia and use the websearch tool to gather more context about it."
    },
    {
        "id": 2,
        "type": "fact_extraction",
        "category": "real_time",
        "query": "List the 3 most recent announcements made by NVIDIA in 2026 with dates."
    },
    {
        "id": 3,
        "type": "multi_hop_search",
        "category": "real_time",
        "query": "What recent developments by NVIDIA in 2026 are related to AI chips, and how do they compare to competitors?"
    }
]

| Query | Type | What it tests | |---|---|---| | Q1 | Search Required | Broad retrieval — can the model proactively invoke search? | | Q2 | Fact Extraction | Precision retrieval — structured facts with dates | | Q3 | Multi-hop Search | Cross-domain reasoning — retrieve + compare across entities |


Implementation

OpenAI (GPT-4o)

Uses the responses.create endpoint with an explicit web_search tool declaration. GPT-4o autonomously decides when to invoke the tool, fetches results, and injects them into its context as additional input tokens.

Architecture note: Retrieved web content is injected directly into the prompt context. This is why input tokens are high (17K–31.7K) — every retrieved document becomes part of the visible prompt, which is both transparent and expensive.

Gemini (2.5 Flash)

Uses GenerateContentConfig with GoogleSearch() grounding. Unlike OpenAI, retrieved content is not passed through the tokenized prompt — it is fused at Google's infrastructure level before reaching the model.

Architecture note: Input tokens stay near zero (20–36 tokens) regardless of query complexity because web content bypasses the tokenized prompt entirely. Gemini 2.5 Flash also generates internal reasoning tokens before producing output — these are not included in output billing but do contribute to latency.

Groq (Compound)

Uses groq/compound — an agent-based router that internally dispatches to multiple models. For these queries, Groq used llama-3.3-70b-versatile for planning/routing and openai/gpt-oss-120b for generation. Web search is a built-in tool — no explicit declaration needed.

Architecture note: The usage_breakdown field in the response exposes per-model token consumption. Unlike OpenAI, internal routing steps are not visible in LangSmith traces — the system is a black box at the step level.


Results

Latency

| Query | Type | OpenAI GPT-4o | Gemini 2.5 Flash | Groq Compound | |---|---|---|---|---| | Q1 | Search Required | 11.92s | 21.01s | 10.53s | | Q2 | Fact Extraction | 7.74s | 5.03s | 14.03s | | Q3 | Multi-hop Search | 16.41s | 15.91s | 11.41s | | Average | | 12.02s | 13.98s | 11.99s |

Key observations:

  • Groq leads on Q1 and Q3 but is slowest on Q2 (fact extraction). Agent routing overhead is costly for simple queries that don't benefit from parallelism.
  • Gemini is fastest on Q2 (5.03s) — its grounding architecture handles focused retrieval efficiently — but slowest on Q1 (21.01s), likely due to extended reasoning token generation.
  • OpenAI is most consistent across query types (7.74s–16.41s range vs. Gemini's 5.03s–21.01s).

Token Usage

OpenAI GPT-4o

| Query | Type | Input Tokens | Output Tokens | Total Tokens | |---|---|---|---|---| | Q1 | Search Required | 17,000 | 1,200 | 18,200 | | Q2 | Fact Extraction | 17,200 | 458 | 17,658 | | Q3 | Multi-hop Search | 31,700 | 1,800 | 33,500 |

Q3 spike: Multi-hop search caused OpenAI to retrieve results from multiple sources (NVIDIA + competitors), nearly doubling input tokens to 31.7K. Every retrieved document is injected into the prompt — so query breadth directly inflates input cost.

Gemini 2.5 Flash

| Query | Type | Input Tokens | Output Tokens | Reasoning Tokens | Total | |---|---|---|---|---|---| | Q1 | Search Required | 36 | 906 | 1,800 | 2,742 | | Q2 | Fact Extraction | 20 | 305 | 340 | 665 | | Q3 | Multi-hop Search | 26 | 2,100 | 212 | 2,338 |

Reasoning tokens vary based on how the model allocates internal computation. Q1 uses 1,800 reasoning tokens to produce 906 output tokens. Q3 uses only 212 reasoning tokens but produces 2,100 output tokens — the model front-loads effort differently based on query structure.

Input tokens stay flat: 20–36 tokens regardless of query type or retrieved content volume. This is the core advantage of infrastructure-level grounding.

Groq Compound

| Query | Type | Input Tokens | Output Tokens | Total Tokens | |---|---|---|---|---| | Q1 | Search Required | 6,500 | 2,300 | 8,800 | | Q2 | Fact Extraction | 38,500 | 2,200 | 40,700 | | Q3 | Multi-hop Search | 17,800 | 2,900 | 20,700 |

Q2 anomaly — highest tokens for simplest query: Groq Compound used 38.5K input tokens for the fact extraction query — more than any other model on any query. This suggests the agent retrieved extensive context before distilling 3 facts, a sign of inefficient context management for precision tasks.


Cost Breakdown

OpenAI GPT-4o

(Pricing: $2.50/1M input, $10.00/1M output)

| Query | Type | Input Cost | Output Cost | Total Cost | |---|---|---|---|---| | Q1 | Search Required | $0.0426 | $0.0118 | $0.0543 | | Q2 | Fact Extraction | $0.0430 | $0.0046 | $0.0475 | | Q3 | Multi-hop Search | $0.0793 | $0.0179 | $0.0972 | | Total | | $0.1649 | $0.0343 | $0.1990 |

Gemini 2.5 Flash

($0.15/1M input, ~$3.50/1M output, reasoning tokens at output rate)

| Query | Type | Input Cost | Output Cost | Reasoning Cost | Total Cost | |---|---|---|---|---|---| | Q1 | Search Required | $0.0001 | $0.0032 | $0.0063 | $0.0045 | | Q2 | Fact Extraction | $0.0001 | $0.0011 | $0.0012 | $0.0009 | | Q3 | Multi-hop Search | $0.0001 | $0.0074 | $0.0007 | $0.0053 | | Total | | $0.0003 | $0.0117 | $0.0082 | $0.0107 |

Groq Compound

(GPT-OSS-120B: $0.15/1M input, $0.60/1M output. Web Search: $5.00/1,000 requests)

| Query | Type | Input Cost | Output Cost | Search Fee | Total Cost | |---|---|---|---|---|---| | Q1 | Search Required | $0.000975 | $0.001380 | $0.005000 | $0.0074 | | Q2 | Fact Extraction | $0.005775 | $0.001320 | $0.005000 | $0.0121 | | Q3 | Multi-hop Search | $0.002670 | $0.001740 | $0.005000 | $0.0094 | | Total | | $0.009420 | $0.004440 | $0.015000 | $0.0289 |

The $0.005 search fee accounts for 52–68% of Groq's total cost per query. At scale, search request volume — not token count — is the dominant cost driver for Groq Compound.

Full Cost Comparison

| Query | Type | OpenAI GPT-4o | Gemini 2.5 Flash | Groq Compound | |---|---|---|---|---| | Q1 | Search Required | $0.0543 | $0.0045 | $0.0074 | | Q2 | Fact Extraction | $0.0475 | $0.0009 | $0.0121 | | Q3 | Multi-hop Search | $0.0972 | $0.0053 | $0.0094 | | Total (3 queries) | | $0.1990 | $0.0107 | $0.0289 | | Cost ratio vs Gemini | | 18.6× | | 2.7× |


Cross-Model Analysis

How Query Complexity Scales Cost

  • OpenAI scales linearly with retrieved content: Going from Q2 to Q3, input tokens jump from 17.2K to 31.7K — an 84% increase driven by retrieving competitor data alongside NVIDIA data. Cost follows the same curve: Q3 costs 2× Q2.
  • Gemini stays nearly flat: Input tokens across all three queries: 36 → 20 → 26. Output tokens vary but costs remain compressed. The only meaningful variance is reasoning tokens, which Gemini allocates adaptively.
  • Groq is unpredictable: Q2 (the simplest query) produced the highest token count (40.7K) and highest cost ($0.0121) across all Groq queries. The agent appears to over-retrieve for precision tasks. Q1 and Q3 were more efficient despite being more complex.

Latency vs. Cost Efficiency

| Model | Average Latency | Total Cost (3 queries) | Cost per Second | |---|---|---|---| | OpenAI GPT-4o | 12.02s | $0.1990 | $0.0166/s | | Gemini 2.5 Flash | 13.98s | $0.0107 | $0.0008/s | | Groq Compound | 11.99s | $0.0289 | $0.0024/s |

Gemini is not only the cheapest — it delivers the lowest cost-per-second of inference. OpenAI costs 20× more per second of latency than Gemini for the same task set.


Observations

OpenAI GPT-4o

Strengths:

  • Most consistent latency across query types (7.74s–16.41s)
  • Full observability — every tool call, retrieved document, and generation step is visible in LangSmith
  • High output quality and well-structured responses

Weaknesses:

  • Most expensive by a significant margin ($0.199 total vs $0.011 for Gemini)
  • Input token costs scale aggressively with multi-hop queries (Q3: 31.7K input tokens)
  • Search fee is bundled, not transparent — you can't see what you paid per search call

Best for: Production systems where observability, debuggability, and consistent behavior matter more than cost optimization.

Gemini 2.5 Flash

Strengths:

  • Dramatically lowest cost across all three query types ($0.0107 total)
  • Input tokens stay near-zero regardless of query breadth — infrastructure grounding is genuinely architecture-level
  • Reasoning tokens allow adaptive compute: more thinking for complex retrieval, less for simple facts
  • Fastest on focused fact extraction (Q2: 5.03s)

Weaknesses:

  • Slowest on broad retrieval (Q1: 21.01s) — extended reasoning adds latency
  • Less transparent: retrieved content and reasoning steps are not observable
  • Reasoning token billing adds a cost layer that's easy to overlook

Best for: High-volume search workflows, cost-sensitive production APIs, and applications where query breadth varies significantly.

Groq Compound

Strengths:

  • Fastest average latency (11.99s) with best performance on broad retrieval (Q1: 10.53s, Q3: 11.41s)
  • Agent routing handles complex multi-hop queries efficiently
  • Output tokens are consistently high (2,200–2,900) — responses are detailed

Weaknesses:

  • Unpredictable token scaling: Q2 used 40.7K tokens — more than Q3 despite being simpler
  • Limited observability: internal routing steps are opaque
  • Search fee dominates cost structure ($0.005 per request = 52–68% of total cost)
  • Encountered 413 request size errors during broader testing

Best for: Low-latency applications, real-time systems, and complex multi-hop queries where speed is the primary constraint.


Key Engineering Insights

1. Search architecture is the biggest cost determinant — not model size

Gemini 2.5 Flash is not cheaper by specification — it's cheaper because retrieved web content never enters the tokenized prompt. OpenAI's transparency (full context injection) costs 18.6× more for the same task set.

2. Query type exposes architectural weaknesses

Groq's Q2 anomaly (40.7K tokens for the simplest query) reveals that agent-based routing isn't always smarter — it can over-retrieve for precision tasks that would benefit from targeted lookup. Gemini's Q1 slowness (21.01s) shows that extended reasoning, while thorough, isn't always necessary.

3. The cheapest model isn't always the fastest

Gemini is both cheapest and second-fastest on average. But Groq is faster on complex queries (Q1, Q3) while being more expensive. The cost-latency tradeoff isn't linear — architecture matters more than pricing tier.

4. Multi-hop queries punish prompt-injection architectures most

OpenAI's Q3 cost ($0.0972) is nearly 2× its Q1 cost ($0.0543) because multi-hop retrieval pulls content from multiple entities (NVIDIA + AMD + Intel + others), all injected as input tokens. Gemini's Q3 cost ($0.0053) is essentially unaffected by query breadth.

5. Groq's economics flip at scale

At low volume, Groq is cheap. But the flat $0.005 search fee per request means that at 100,000 queries/day, search fees alone cost $500/day — independent of token usage. OpenAI's search is bundled; Gemini's grounding is priced per token (near-zero). Groq's cost structure changes significantly at scale.


Challenges and Limitations

| Challenge | Affected Model | Description | |---|---|---| | Request size errors (413) | Groq | Payload limits exceeded during some test runs | | Opaque agent routing | Groq | Internal model dispatch steps not inspectable | | Reasoning token ambiguity | Gemini | Unclear how reasoning tokens scale with adversarial queries | | Bundled search pricing | OpenAI | Per-search cost not exposed — can't optimize search call frequency | | Search invocation inconsistency | All | Identical queries don't always trigger search |


Final Comparison

| Factor | OpenAI GPT-4o | Gemini 2.5 Flash | Groq Compound | |---|---|---|---| | Avg. Latency | 12.02s | 13.98s | 11.99s ✓ | | Total Cost (3 queries) | $0.1990 | $0.0107 ✓ | $0.0289 | | Cost Scaling with Complexity | Poor (linear) | Flat ✓ | Unpredictable | | Observability | High ✓ | Low | Medium | | Consistency Across Query Types | High ✓ | Medium | Low | | Multi-hop Performance | Medium | Medium | High ✓ | | Fact Extraction Performance | High | High ✓ | Low | | Search Architecture | Prompt injection | Infrastructure grounding | Agent routing |


Conclusion

Running the same three queries across three different architectures surfaces a clear pattern: search mechanism matters more than model capability.

OpenAI GPT-4o pays a transparency tax. Every retrieved document in your prompt is a token you pay for. The upside is full debuggability and consistent behavior. For systems where you need to audit exactly what the model saw, this tradeoff is worth it.

Gemini 2.5 Flash wins on economics by a wide margin. Infrastructure-level grounding decouples retrieval cost from token billing. The 18.6× cost advantage over OpenAI is structural — it won't shrink as you scale up query volume. The main cost is reduced transparency.

Groq Compound is the right tool for speed-sensitive applications, but its cost structure is counterintuitive. The flat search fee means simple queries can cost more than complex ones (as seen in Q2). At high volume, search request count — not tokens — becomes your primary cost.


Recommendations

| Use Case | Recommended Model | Reason | |---|---|---| | Production APIs requiring full audit trails | OpenAI GPT-4o | Complete LangSmith observability | | High-volume search workflows (100K+ queries/day) | Gemini 2.5 Flash | Flat input cost regardless of retrieved content | | Real-time applications (latency < 12s) | Groq Compound | Fastest on complex multi-hop queries | | Precision fact extraction at scale | Gemini 2.5 Flash | Efficient on focused queries, lowest cost | | Multi-hop competitive analysis | Groq Compound | Agent routing handles cross-entity retrieval well | | Cost-sensitive experimentation | Gemini 2.5 Flash | 18.6× cheaper than OpenAI across equivalent tasks |


The best LLM for web search is not the most powerful — it's the one whose search architecture matches your cost, latency, and observability requirements.

Loading...
Newly listed items on Resellpur
Loading...
Recommended items on Resellpur
Download Resellpur app from Play Store – Buy and sell second-hand mobiles, books, and clothes securely