Benchmarks

Our retrieval is the differentiator.

Tableside measured on the public multimodal and document RAG benchmarks, side by side with the systems most teams choose by default. The headline metric is Recall@5 on the retrieval-only setting — that is the differentiator. End-to-end accuracy is a secondary metric that shows the retrieval translates into better answers.

Methodology

Each row in the tables below carries a citation. Re-measured rows are runs we executed under the same conditions as our own system — same corpus, same scoring, same date — and link back to an MLflow run id you can browse for the full per-example breakdown. Paper-reported rows are numbers lifted from the published paper that introduced the benchmark; we cite the table and the date.

We follow the official scoring code for every benchmark. For MRAG-Bench that is eval/score.py — a regex extractor first, GPT fallback if the model output doesn’t parse. We report the fallback rate; above ~5% the published number is suspect. Our v3 prompts get fallback rate to 0% on the smoke set.

Full methodology, run commands, and reproduction instructions: see docs/eval/mragbench-methodology.md and docs/eval/document-rag-methodology.md.

Multimodal RAG

MRAG-Bench

Yan et al. 2024 — arXiv:2410.08182

Vision-centric RAG evaluation. 1,353 queries over a 16,130-image corpus across 9 visual-reasoning scenarios.

System	R@1	R@5	NDCG@5	Accuracy	Latency p50	$ / q	Citation
Tableside (full multimodal RAG) Titan Multimodal G1 joint embedding + source-diversity reranking	28.5%	54.8%	0.275	—	—	—	re-measured mlflow run cd64770e — retrieval-only, full 1,353-query dev set 2026-05-22
Tableside (text-only embedding) Titan v2 text embedding — ablation, no multimodal	—	—	—	—	—	—	re-measured tbd-mlflow-run-id tbd
ColPali (PaliGemma-3B retriever) VLM-as-retriever — no separate text index	—	—	—	—	—	—	re-measured tbd-mlflow-run-id tbd
LangChain RetrievalQA OpenAI text-embedding-3-large + cosine	—	—	—	—	—	—	re-measured tbd-mlflow-run-id tbd
GPT-4o (no retrieval, vision-only) —	—	—	—	58.3%	—	—	paper Yan et al. 2024, Table 4 2024-10
Gemini 1.5 Pro (no retrieval) —	—	—	—	48.5%	—	—	paper Yan et al. 2024, Table 4 2024-10
Claude 3.5 Sonnet (no retrieval) —	—	—	—	47.0%	—	—	paper Yan et al. 2024, Table 4 2024-10
LLaVA-NeXT (no retrieval) —	—	—	—	33.0%	—	—	paper Yan et al. 2024, Table 4 2024-10

Document RAG · single-hop

CRAG

Yang et al. 2024 — arXiv:2406.04744

Comprehensive RAG Benchmark. 2,706 open-domain QA queries across Finance, Sports, Music, Movies, and Open domains.

Numbers landing soon — corpus ingest in progress. Check back in 24 hours.

Track progress at github.com/ronnc/restaurant-chat/issues

Document RAG · cross-document

HotpotQA

Yang et al. 2018 — arXiv:1809.09600

7,400 Wikipedia questions, each requiring 2 supporting paragraphs from distinct articles. Scored on *Joint Recall@k* — BOTH gold paragraphs in the top-k — because OR-based recall over-credits partial retrieval and hides cross-doc failures. Source-diversity reranking is the differentiating signal.

Numbers landing soon — corpus ingest in progress. Check back in 24 hours.

Track progress at github.com/ronnc/restaurant-chat/issues

Document RAG · multi-hop

MultiHop-RAG

Tang & Yang 2024 — arXiv:2401.15391

2,556 questions requiring evidence from 2–4 distinct articles each. Where naive cosine retrieval breaks down and signal-fusion retrievers earn their keep — Inference, Comparison, Temporal, and Null-refusal reasoning types.

Numbers landing soon — corpus ingest in progress. Check back in 24 hours.

Track progress at github.com/ronnc/restaurant-chat/issues

Cross-modal cross-document · apex test

MultiModalQA

Talmor et al. 2021 — arXiv:2104.06039

~29,000 Wikipedia questions where each article has TEXT, TABLES, and IMAGES — answers require cross-referencing modalities. Closest published analog to a real restaurant menu PDF (text description + price table + dish photo per item). This is where the joint multimodal embedding + image-attachment generation pipeline earns its keep or doesn't.

Numbers landing soon — corpus ingest in progress. Check back in 24 hours.

Track progress at github.com/ronnc/restaurant-chat/issues

Every re-measured row links to the MLflow run id it came from. Click a row’s citation to open the full per-example breakdown — every query’s retrieved chunks, the LLM’s raw output, the scoring decision, and the latency + cost trace.