[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
-
Updated
Oct 30, 2025 - Python
[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
Community benchmark database for running LLMs on Apple Silicon Macs
Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.
An agent evaluation framework with native multi-turn feedback iteration.
[ICML 2026] CapBencher toolkit: Give your LLM benchmark a built-in alarm for leakage and gaming
The open-source benchmark for LLM memory decay. Measure how Naive, RAG, Chunked RAG, Cascading, and SummaryMemory degrade over 100 conversation turns. Ebbinghaus forgetting curves, 5-provider LLM eval, multi-seed CI. No API key needed.
Testing how well LLMs can solve jigsaw puzzles
Benchmark abierto de 44 modelos LLM con 5,000+ tests reales. Alternativas a Claude, GPT-5 y Gemini para agentes N8N, OpenClaw y emprendedores. Calculadora interactiva + LLM-as-Judge Phi-4.
🚀 A modern, production-ready refactor of the LoCoMo long-term memory benchmark.
Self-hosted LLM API benchmark, monitoring & playground. Compare latency, TTFT, throughput across OpenAI, Anthropic, Gemini & any OpenAI-compatible endpoint. Deploy with one command via Docker. | 自托管 LLM API 性能测试、监控与调试平台,一键 Docker 部署,支持多家服务商对比。
Community-driven behavioral reliability benchmark for LLMs. 231 probes across 19 modules, deterministic scoring, perplexity correlation, layer sensitivity mapping, quant method capture, hardware-stratified community rankings. Every test contributes to the community dataset.
Open-source multi-agent AI debate arena: pit Claude, GPT, Gemini, Ollama & HuggingFace models against each other with frozen-context fairness, evidence-first judging, 20+ personas, code review, and PDF/Markdown reports. CLI + Web UI.
Исследовательский вопрос: можно ли измерить «офисный интеллект» LLM? Попытка — здесь. 100 сценариев, 10 критериев, русский корпоративный контекст.
Comprehensive benchmark of OpenRouter free-tier LLMs for practical applications. Evaluates models for coding, Thai language, and general use.
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
A reproducible, deterministic CLI to measure political bias and positioning of LLMs on economic and social axes.
中文高压复杂任务Benchmark。主要是测模型会不会在真实工作里误事。This is a Chinese-language high-pressure complex task benchmark. The main purpose is to test whether the model will cause problems in real-world applications.
Local LLM BenchMarking
RetardBench is an open, no-censorship benchmark that ranks large language models purely on how retarded they are.
Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."