Tableside
Benchmarks

Our retrieval is the differentiator.

Tableside measured on the public multimodal and document RAG benchmarks, side by side with the systems most teams choose by default. The headline metric is Recall@5 on the retrieval-only setting — that is the differentiator. End-to-end accuracy is a secondary metric that shows the retrieval translates into better answers.

Methodology

Each row in the tables below carries a citation. Re-measured rows are runs we executed under the same conditions as our own system — same corpus, same scoring, same date — and link back to an MLflow run id you can browse for the full per-example breakdown. Paper-reported rows are numbers lifted from the published paper that introduced the benchmark; we cite the table and the date.

We follow the official scoring code for every benchmark. For MRAG-Bench that is eval/score.py — a regex extractor first, GPT fallback if the model output doesn’t parse. We report the fallback rate; above ~5% the published number is suspect. Our v3 prompts get fallback rate to 0% on the smoke set.

Full methodology, run commands, and reproduction instructions: see docs/eval/mragbench-methodology.md and docs/eval/document-rag-methodology.md.

Multimodal RAG

Vision-centric RAG evaluation. 1,353 queries over a 16,130-image corpus across 9 visual-reasoning scenarios.

SystemR@1R@5NDCG@5AccuracyLatency p50$ / qCitation
Tableside (full multimodal RAG)
Titan Multimodal G1 joint embedding + source-diversity reranking
28.5%54.8%0.275
re-measured
mlflow run cd64770e — retrieval-only, full 1,353-query dev set
2026-05-22
Tableside (text-only embedding)
Titan v2 text embedding — ablation, no multimodal
re-measured
tbd-mlflow-run-id
tbd
ColPali (PaliGemma-3B retriever)
VLM-as-retriever — no separate text index
re-measured
tbd-mlflow-run-id
tbd
LangChain RetrievalQA
OpenAI text-embedding-3-large + cosine
re-measured
tbd-mlflow-run-id
tbd
GPT-4o (no retrieval, vision-only)
58.3%
paper
Yan et al. 2024, Table 4
2024-10
Gemini 1.5 Pro (no retrieval)
48.5%
paper
Yan et al. 2024, Table 4
2024-10
Claude 3.5 Sonnet (no retrieval)
47.0%
paper
Yan et al. 2024, Table 4
2024-10
LLaVA-NeXT (no retrieval)
33.0%
paper
Yan et al. 2024, Table 4
2024-10

Document RAG · single-hop

Comprehensive RAG Benchmark. 2,706 open-domain QA queries across Finance, Sports, Music, Movies, and Open domains.

Numbers landing soon — corpus ingest in progress. Check back in 24 hours.

Track progress at github.com/ronnc/restaurant-chat/issues

Document RAG · cross-document

7,400 Wikipedia questions, each requiring 2 supporting paragraphs from distinct articles. Scored on *Joint Recall@k* — BOTH gold paragraphs in the top-k — because OR-based recall over-credits partial retrieval and hides cross-doc failures. Source-diversity reranking is the differentiating signal.

Numbers landing soon — corpus ingest in progress. Check back in 24 hours.

Track progress at github.com/ronnc/restaurant-chat/issues

Document RAG · multi-hop

2,556 questions requiring evidence from 2–4 distinct articles each. Where naive cosine retrieval breaks down and signal-fusion retrievers earn their keep — Inference, Comparison, Temporal, and Null-refusal reasoning types.

Numbers landing soon — corpus ingest in progress. Check back in 24 hours.

Track progress at github.com/ronnc/restaurant-chat/issues

Cross-modal cross-document · apex test

~29,000 Wikipedia questions where each article has TEXT, TABLES, and IMAGES — answers require cross-referencing modalities. Closest published analog to a real restaurant menu PDF (text description + price table + dish photo per item). This is where the joint multimodal embedding + image-attachment generation pipeline earns its keep or doesn't.

Numbers landing soon — corpus ingest in progress. Check back in 24 hours.

Track progress at github.com/ronnc/restaurant-chat/issues

Every re-measured row links to the MLflow run id it came from. Click a row’s citation to open the full per-example breakdown — every query’s retrieved chunks, the LLM’s raw output, the scoring decision, and the latency + cost trace.