Our retrieval is the differentiator.
Tableside measured on the public multimodal and document RAG benchmarks, side by side with the systems most teams choose by default. The headline metric is Recall@5 on the retrieval-only setting — that is the differentiator. End-to-end accuracy is a secondary metric that shows the retrieval translates into better answers.
Methodology
Each row in the tables below carries a citation. Re-measured rows are runs we executed under the same conditions as our own system — same corpus, same scoring, same date — and link back to an MLflow run id you can browse for the full per-example breakdown. Paper-reported rows are numbers lifted from the published paper that introduced the benchmark; we cite the table and the date.
We follow the official scoring code for every benchmark. For MRAG-Bench that is eval/score.py — a regex extractor first, GPT fallback if the model output doesn’t parse. We report the fallback rate; above ~5% the published number is suspect. Our v3 prompts get fallback rate to 0% on the smoke set.
Full methodology, run commands, and reproduction instructions: see docs/eval/mragbench-methodology.md and docs/eval/document-rag-methodology.md.
Multimodal RAG
MRAG-Bench
Yan et al. 2024 — arXiv:2410.08182Vision-centric RAG evaluation. 1,353 queries over a 16,130-image corpus across 9 visual-reasoning scenarios.
| System | R@1 | R@5 | NDCG@5 | Accuracy | Latency p50 | $ / q | Citation |
|---|---|---|---|---|---|---|---|
Tableside (full multimodal RAG) Titan Multimodal G1 joint embedding + source-diversity reranking | 28.5% | 54.8% | 0.275 | — | — | — | mlflow run cd64770e — retrieval-only, full 1,353-query dev set 2026-05-22 |
Tableside (text-only embedding) Titan v2 text embedding — ablation, no multimodal | — | — | — | — | — | — | tbd-mlflow-run-id tbd |
ColPali (PaliGemma-3B retriever) VLM-as-retriever — no separate text index | — | — | — | — | — | — | tbd-mlflow-run-id tbd |
LangChain RetrievalQA OpenAI text-embedding-3-large + cosine | — | — | — | — | — | — | tbd-mlflow-run-id tbd |
GPT-4o (no retrieval, vision-only) — | — | — | — | 58.3% | — | — | paper Yan et al. 2024, Table 4 2024-10 |
Gemini 1.5 Pro (no retrieval) — | — | — | — | 48.5% | — | — | paper Yan et al. 2024, Table 4 2024-10 |
Claude 3.5 Sonnet (no retrieval) — | — | — | — | 47.0% | — | — | paper Yan et al. 2024, Table 4 2024-10 |
LLaVA-NeXT (no retrieval) — | — | — | — | 33.0% | — | — | paper Yan et al. 2024, Table 4 2024-10 |
Document RAG · single-hop
Comprehensive RAG Benchmark. 2,706 open-domain QA queries across Finance, Sports, Music, Movies, and Open domains.
Numbers landing soon — corpus ingest in progress. Check back in 24 hours.
Track progress at github.com/ronnc/restaurant-chat/issues
Document RAG · cross-document
7,400 Wikipedia questions, each requiring 2 supporting paragraphs from distinct articles. Scored on *Joint Recall@k* — BOTH gold paragraphs in the top-k — because OR-based recall over-credits partial retrieval and hides cross-doc failures. Source-diversity reranking is the differentiating signal.
Numbers landing soon — corpus ingest in progress. Check back in 24 hours.
Track progress at github.com/ronnc/restaurant-chat/issues
Document RAG · multi-hop
MultiHop-RAG
Tang & Yang 2024 — arXiv:2401.153912,556 questions requiring evidence from 2–4 distinct articles each. Where naive cosine retrieval breaks down and signal-fusion retrievers earn their keep — Inference, Comparison, Temporal, and Null-refusal reasoning types.
Numbers landing soon — corpus ingest in progress. Check back in 24 hours.
Track progress at github.com/ronnc/restaurant-chat/issues
Cross-modal cross-document · apex test
MultiModalQA
Talmor et al. 2021 — arXiv:2104.06039~29,000 Wikipedia questions where each article has TEXT, TABLES, and IMAGES — answers require cross-referencing modalities. Closest published analog to a real restaurant menu PDF (text description + price table + dish photo per item). This is where the joint multimodal embedding + image-attachment generation pipeline earns its keep or doesn't.
Numbers landing soon — corpus ingest in progress. Check back in 24 hours.
Track progress at github.com/ronnc/restaurant-chat/issues
Every re-measured row links to the MLflow run id it came from. Click a row’s citation to open the full per-example breakdown — every query’s retrieved chunks, the LLM’s raw output, the scoring decision, and the latency + cost trace.