Open Source Model Performance
Compare model quality metrics and serving benchmarks
Serving Benchmark Configuration
Model Configuration
Parameters
The total number of learnable weights in the model, indicating its capacity and complexity.
Larger parameter counts generally enable more sophisticated reasoning and knowledge retention, but also require more computational resources for inference. Common notations include 'B' for billions (e.g., 70B = 70 billion parameters) and 'A' for active parameters in Mixture-of-Experts models (e.g., 235B A22B = 235B total with 22B active per inference).
Quantization
FP8A technique to reduce model size and memory footprint by using lower-precision numerical formats.
Various quantization levels exist, including BF16 (Brain Float 16), FP8 (8-bit Floating Point), INT4 (4-bit Integer), and MXFP4 (Microscaling 4-bit Floating Point). Our benchmarks focus on minimizing model size while maintaining equivalent performance across standardized quality metrics. All models listed (except GPT-4o) use FP8 quantization, balancing efficiency and accuracy.
Min GPU
The minimum number of GPUs required to load and run the model.
GPU configurations using powers of 2 (1, 2, 4, 8...) are recommended for optimal performance and parallelization efficiency. For example, while MiniMax-M2 requires a minimum of 3 H100 NVL GPUs, deploying with 4 GPUs is preferable in practice. The value shown varies based on the selected hardware type.
Quality Metrics
AI Index
A comprehensive score measuring overall AI capability.
This is a standardized benchmark score that evaluates the model's general intelligence and reasoning abilities across multiple tasks. Scores are sourced from Artificial Analysis.
View source: Artificial AnalysisRetrieval
Measures the model's ability to accurately retrieve relevant documents for answering user questions.
Evaluates the entire flow: correctly classifying the question's intent, calling the appropriate Agent, and executing the retrieve function to fetch relevant documents. The baseline (GPT-4o) achieves 70% when the Agent follows the prescribed flow. Evaluated on 20 fixed questions. Errors may occur due to unintended tool usage, intent misclassification, or retriever performance issues. Additionally, errors can arise when large models have unstable serving, poor exception handling in internal tool usage, or when maximum token length is exceeded.
Citation
Evaluates whether the model correctly cites sources when providing information based on retrieved documents.
For example, checks if responses like 'The coverage of auto insurance is as follows [cite:doc5]' correctly match the answer content with the cited document. Only evaluated on questions where Retrieval was successful, so the denominator varies by model. This is a key metric for ensuring trustworthiness and verifiability in RAG systems.
Korean Naturalness
Assesses how natural and fluent the model's Korean language output is.
Reviews grammatical errors, awkward expressions, unnatural sentence structures, and typos. For example, detects cases like '법적 배상책임을 보상하는 보장입니다' where similar words are redundantly repeated causing awkwardness, or typos like '보항' (mistyping of 보장). Only evaluated on error-free responses, so the denominator varies by model.
Chinese-Free
Measures the absence of unintended Chinese characters in Korean responses.
Some models developed in China may mix Chinese characters in Korean responses. For example, detecting cases like '손해情形을 커버합니다' where '情形' appears as simplified Chinese. Higher scores indicate fewer instances of Chinese character contamination. Only evaluated on error-free responses.
Serving Performance Metrics
TPS (Tokens Per Second)
The number of tokens the model can generate per second.
Higher values indicate faster generation speed. Measured under specific hardware configuration, GPU count, context length, and concurrency settings. GPT-4o metrics (TPS: 108.71) are included as a baseline reference from Artificial Analysis, regularly updated—compare other models' performance relative to the typical user experience with GPT-4o.
TTFT (Time To First Token)
The latency from sending a request to receiving the first token of the response.
Lower values indicate faster initial response. Critical for user-perceived latency in interactive applications. GPT-4o metrics (TTFT: 1.05s) serve as a baseline reference from Artificial Analysis, regularly updated, for comparing typical user experience across models.
Benchmark Configuration
Hardware (H/W)
H100 NVL, H200 NVLThe GPU hardware type used for serving the model.
Different GPU architectures offer varying levels of performance and memory capacity. H200 NVL generally provides better performance than H100 NVL due to higher memory bandwidth.
Number of GPUs
1, 2, 4The number of GPUs used for model inference.
More GPUs allow for larger models and faster processing through parallelization. Increasing GPU count directly improves throughput (TPS) and enables handling more concurrent users.
Concurrency
1, 10, 25, 50, 100The number of simultaneous requests sent to the model.
Higher concurrency tests how well the model maintains performance under load. As concurrent users increase, both TPS and TTFT typically degrade, helping determine the optimal infrastructure scale for your expected traffic.
Context Length
8K, 16K, 32K, 64KThe maximum total token capacity for input and output combined.
All benchmarks use a fixed 4K output token length. The remaining capacity is allocated to input tokens. For instance, an 8K context length means 4K for input and 4K for output, while a 64K context length allows 60K input tokens with the same 4K output. Longer contexts require more memory and computational resources.
Notes
- -Indicates the metric has not been measured yet.
- xIndicates the metric could not be measured (e.g., test failure, incompatible configuration).
- •Quality metrics with variable denominators show the actual (correct/total) count below the percentage.