Benchmarking Quantized LLMs: What Works Best for Real Tasks?

Quantization has emerged as a critical method for deploying large language models (LLMs) in constrained environments. This study benchmarks quantized variants of Qwen2.5, DeepSeek, Mistral, and LLaMA 3.3 across five diverse tasks MMLU, GSM8K, BBH, C-Eval, and IFEval spanning domains from math reasoning to instruction following. We evaluate each model under multiple quantization schemes (BF16, GPTQ-INT8, INT4, AWQ, GGUF) to assess the trade-offs in accuracy retention and task robustness. Our findings offer actionable insights into quantization format selection for production use, highlighting that Q5_K_M and GPTQ-INT8 offer optimal trade-offs for most domains, while AWQ and lower-bit GGUF formats should be used cautiously.

Background and Motivation

In recent years, large language models (LLMs) have rapidly transitioned from research labs to real-world products powering virtual assistants, developer tools, financial advisors, and even autonomous agents. However, while their capabilities have grown, so too have their computational demands. Full-precision LLMs are often too large, too slow, or too resource-intensive for many real-world deployment scenarios. This is where quantization enters the conversation.

Quantization allows us to compress these models typically by reducing the bit-width of weights and activations without retraining them from scratch. In doing so, we significantly lower memory usage and speed up inference, making LLMs deployable even on constrained hardware. However, quantization introduces trade-offs, often manifesting as accuracy degradation across specific tasks. It can degrade model accuracy, and that degradation is rarely uniform across all tasks.

Despite the availability of dozens of quantized models on platforms like Hugging Face, clear guidance is still lacking on how different quantization formats behave in practical use cases. Most existing benchmarks focus on raw accuracy, usually under ideal conditions, and often overlook critical variables like inference latency, robustness to decoding variation, or task-specific failure modes. For teams building with LLMs in production where cost, speed, and reliability matter these one-size-fits-all evaluations fall short.

This blog aims to address that gap. Rather than simply comparing models on standard leaderboards, we take a task-oriented view. We evaluate quantized versions of four leading instruction-tuned model families Qwen2.5, DeepSeek, Mistral, and LLaMA 3.3 across a wide range of benchmarks, including MMLU, BBH, GSM8K, IFEval, and HumanEval. Each benchmark is tied to a practical domain, from finance to software development to reasoning agents.

Just as importantly, we analyze each model across multiple quantization formats: from BF16 full-precision baselines to INT4/INT8 via GPTQ, AWQ, and the GGUF family of formats like Q4_K_M, Q5_K_M, and Q8_0. This enables us to assess quantization trade-offs across real-world use cases.

Which quantized format gives me the best trade-off between accuracy, speed, and task suitability for my specific use case?

By the end of this blog, you’ll have a much clearer picture of how quantization affects model performance not just in abstract benchmarks, but in the kinds of real-world applications LLMs are increasingly being asked to support.

Methodology

Model Families and Quantization Schemes

To ensure a meaningful and representative evaluation of quantized LLMs, Four model families were selected that reflect current industry practices and deployment realities.

Our evaluation focuses on four major model families: Qwen2.5, DeepSeek, Mistral, and LLaMA 3.3. These were chosen based on their open availability, strong performance on instruction-following tasks, multilingual capabilities, and overall popularity in the research and open-source communities. We specifically targeted model sizes ranging from 7B to 32B parameters, as these offer the best trade-offs between performance and deployability in real-world applications.

Each model was evaluated in its full-precision format (BF16 or FP16, depending on availability) as a baseline, alongside at least three quantized versions. The quantization formats were selected to represent a broad spectrum of deployment needs:

GGUF: Includes Q3_K_M, Q4_K_M, Q5_K_M, and Q8_0. These are specifically optimized for CPU/GPU inference using llama.cpp, making them ideal for edge or low-power scenarios. GGUFs are also frequently used in real-time desktop apps and on-device AI assistants.

GPTQ: Widely adopted in GPU inference setups, GPTQ formats (especially INT4 and INT8) are supported by frameworks like AutoGPTQ, vLLM, and ExLlama. They are known for strong accuracy retention at lower bit-widths.

AWQ: A newer quantization approach optimized for weight-only compression with minimal accuracy drop. AWQ performs especially well in latency-sensitive environments, such as web inference and edge deployment.

BF16: Used as the reference format in our analysis. While not deployable in every environment, BF16 (or FP16) provides a reliable upper bound on model performance.

Each model was tested on quant formats based on availability and compatibility. In some cases, certain formats were not included due to the absence of official weights or conversion tools at the time of evaluation. Where possible, we ensured a minimum of three quant levels per model to enable cross-format comparisons.

Additionally, we intentionally selected models with varied specialization:

Qwen2.5-Instruct models emphasize reasoning and multilingual capabilities.

Qwen2.5-Coder models are tuned for programming tasks.

DeepSeek-R1-Distill models are distilled variants optimized for efficiency while preserving instruction-following ability.

Mistral-7B-Instruct focuses on lightweight instruction following.

LLaMA 3.3 represents the current high-end instruction-tuned frontier.

This diversity allows us to study how quantization impacts not just average accuracy, but performance across domains like math, science, programming, and complex reasoning.

In summary, the selected model-format pairs offer a balanced view of today’s quantization landscape spanning high-accuracy, high-efficiency, and general-purpose deployment contexts. This sets the stage for a detailed benchmark-driven analysis in the next sections.

Benchmark Suite and Task Alignment

To move beyond raw accuracy scores and toward practical recommendations, each benchmark used in our evaluation has been mapped to a corresponding real-world use case. Rather than treating benchmarks as abstract metrics, we treat them as representatives of specific application domains, enabling task-level insight into how quantization affects different types of workloads.

Our evaluation includes the following five major benchmarks:

MMLU (Massive Multitask Language Understanding)

Domain: General knowledge, factual reasoning

Mapped Use Case: Research assistants, legal Q&A, enterprise knowledge tools

Description: MMLU covers 57 diverse subjects, from law and biology to history and philosophy. High performance here suggests strong factual grounding and broad world knowledge. Since these are common requirements in enterprise and academic support systems, MMLU is a strong proxy for knowledge work applications.

Quantization Insight: Small degradations in this benchmark can significantly affect trustworthiness in professional deployments.

GSM8K (Grade School Math Word Problems)

Domain: Arithmetic reasoning, multi-step problem solving

Mapped Use Case: Financial assistants, mathematical modeling, budgeting tools

Description: GSM8K requires models to perform multi-step arithmetic with natural language inputs. This mimics the type of step-by-step numeric reasoning found in applications like tax estimators, calculators, and fintech assistants.

Quantization Insight: Formats with aggressive compression (e.g., INT4 or Q3_K_M) tend to degrade performance on GSM8K earlier than other tasks, as numeric consistency is highly sensitive to precision.

BBH (BIG-Bench Hard)

Domain: Multistep logic, compositional reasoning

Mapped Use Case: Autonomous agents, planning systems, multi-hop assistants

Description: BBH is a challenging reasoning benchmark designed to stress-test LLMs on tasks that involve logic chains, rule-based inferences, and deep context memory. It reflects use cases like automated reasoning agents, long-chain planning, or tools like AutoGPT/RAG pipelines.

Quantization Insight: BBH reveals robustness under high-cognitive-load tasks. Some models retain performance well here even in INT4, while others collapse rapidly, showing format brittleness.

IFEval (Instruction Following Evaluation)

Domain: Instruction tuning, agent control, precision task execution

Mapped Use Case: Workflow automation, LLM-based agents, chatbot task handlers

Description: IFEval evaluates how well a model can follow multi-turn instructions with constraints. This aligns directly with LLMs used in RPA (robotic process automation), customer support flows, and orchestrated multi-step tools.

Quantization Insight: AWQ and some GGUF formats introduce decoding variability, which can harm instruction precision an issue especially relevant for deterministic agent behavior.

C-Eval

Domain: Multilingual education, Chinese knowledge tasks

Mapped Use Case: Edtech in Asia, cross-lingual reasoning tools, enterprise localization

Description: C-Eval focuses on knowledge and reasoning questions posed in Chinese, across subjects like science, law, and literature. Its inclusion reflects the need for LLMs to perform reliably in multilingual contexts.

Quantization Insight: Performance drops here indicate quantization’s effect on tokenizer alignment and language-specific embeddings, which matter in non-English deployments.

Evaluation Protocol and Inference Settings

A key part of our evaluation lies not just in what we measure, but in how we measure it. To ensure our results reflect real-world usage patterns, we adopted a dual-mode inference strategy capturing both deterministic accuracy and response robustness under natural sampling conditions.

Zero-Shot Deterministic Decoding (Temperature = 0)

In the first setup, all models were evaluated using temperature 0 decoding, ensuring deterministic outputs. This mirrors use cases like legal document drafting, data validation, or backend agents where consistency and predictability are critical. Running all formats with identical prompts under this condition allows for a clean comparison of base-level accuracy across quantization methods.

Multi-Pass Inference with Sampling (Temperature = 0.3)

To complement the deterministic setup, we also evaluated open-weight models using temperature 0.3 decoding, averaging results across three independent passes per prompt. This simulates typical chatbot or agentic deployment scenarios, where models run with mild randomness to improve fluency or creativity.

Multi-pass inference allows us to capture an important second metric: robustness. Some quantized models may return high accuracy under deterministic decoding but break down under sampling due to increased variance, especially at lower bit-widths (e.g., INT4 or Q3_K_M). Measuring this variance helps us identify which formats remain stable in production-like environments.

Why Robustness Matters

In real-world applications especially those involving long contexts, open-ended generation, or human-in-the-loop interactions models are rarely run at temperature 0. Evaluating only the zero-shot accuracy risks missing brittleness that may emerge in deployed settings. By including both modes, we assess not just whether a quantized model can be accurate, but whether it can be trusted to stay accurate under light sampling noise.

Results and Observations

Quantization Accuracy Retention Analysis

To evaluate how different quantization schemes affect model performance, we used the Qwen2.5-7B-Instruct model as a case study and tested it across five representative benchmarks: BBH, MMLU, C-Eval, IFEval, and GSM8K. These benchmarks span domains such as multistep reasoning (BBH), factual knowledge (MMLU), multilingual academic tasks (C-Eval), instruction following (IFEval), and math reasoning (GSM8K).

Each quantization format ranging from BF16 (baseline) to low-bit variants like Q4_K_M and INT4 was evaluated by comparing its raw accuracy to the full-precision reference. From this, we derived a retention score (% of baseline accuracy) to assess format stability under quantization pressure.

Retention Heatmap Across Quantization Formats and Benchmarks for Qwen2.5-7B-Instruct

Quantization Effects Across Benchmarks

Gradual Degradation Across Quantization Levels

There is a clear, monotonic decrease in accuracy as bit-width reduces:

BF16 → GPTQ-INT8 results in a minimal accuracy loss (usually <2%), suggesting that 8-bit quantization retains almost all meaningful representation power.
BF16 → GPTQ-INT4 introduces moderate degradation (~3–6%), which is acceptable in most general-purpose deployments but may not be suitable for edge-cases like legal or medical QA.
BF16 → Q4_K_M/Q4_K_S shows steepest degradation, especially in instruction-heavy (IFEval) and multilingual (C-Eval) settings, with up to 20% loss from baseline.

These findings are consistent with existing literature suggesting reduced bit-width quantization impairs instruction-level coherence and factual retrieval.

Format-Specific Trends per Benchmark

BBH (Logical Reasoning)

Despite being a complex benchmark, BBH accuracy degrades relatively smoothly across formats. Even in Q4_K_M, the model retains ~90% of its BF16 accuracy. This implies that quantization-induced degradation in BBH is less catastrophic, likely due to the structural consistency of logical patterns

BBH Accuracy Across Models and Quantization Schemes

MMLU (General Knowledge)

MMLU is more sensitive to quantization, particularly in lower-bit formats. This suggests that factual retrieval tasks depend heavily on the precision of internal embeddings and attention weights, making GGUF formats below Q5_K_M risky for knowledge-intensive applications.

MMLU Accuracy Across Models and Quantization Schemes

C-Eval (Multilingual Academic Reasoning)

C-Eval results show a noticeable drop in all formats except GPTQ-INT8. Q4_K_M sees almost 15–20% reduction in retention, indicating that tokenizer-alignment and language-specific embeddings suffer under aggressive quantization. This is especially critical in localized deployments in Asia or multilingual enterprise systems.

C-Eval Accuracy Across Models and Quantization Schemes

IFEval (Instruction Following)

IFEval appears highly sensitive to quantization, especially at INT4 and GGUF Q4 levels. Models show more than 10% accuracy loss, and sometimes erratic behavior. This supports the hypothesis that instruction-following quality depends not only on token predictions but also on decoder alignment, which becomes unstable in very low-bit formats.

IFEval Accuracy Across Models and Quantization Schemes

GSM8K (Mathematical Reasoning)

Interestingly, GSM8K shows relatively high retention even in Q4_K_M and Q4_K_S, with ~84–87% of baseline accuracy. This implies that step-by-step arithmetic tasks are structurally resilient to quantization, especially in models with strong reasoning architectures like Qwen.

GSM8K Accuracy Across Models and Quantization Schemes

AWQ and GGUF: Compression vs Consistency

The behavior of AWQ is notable. In benchmarks like IFEval, it underperforms relative to GPTQ-INT4 even though both use 4-bit quantization. This suggests that AWQ's group-wise quantization may introduce non-determinism that disrupts instruction alignment, even when weight fidelity is preserved.

In contrast, GGUF formats like Q4_K_M are extremely lightweight and enable CPU deployment, but show degradation patterns that must be considered carefully. Especially in C-Eval and IFEval, Q4 formats introduce unacceptable losses for production-level deployments.

The sweet spot appears to be Q5_K_M or Q8_0, where we retain ~95–99% of the original performance, with substantial gains in inference speed and memory efficiency.

Average Accuracy Retention by Quantization Format

*Retention % indicates how much of the full-precision BF16 model's accuracy was preserved after quantization

Benchmark-Wise Average Accuracy by Quantization Format

GPTQ-INT8 is the most stable quantization format across all tasks. It's ideal when accuracy matters more than speed or memory savings.

Q5_K_M and Q8_0 offer a great middle ground, balancing performance with portability.

Instruction-following and multilingual tasks (IFEval, C-Eval) are the most vulnerable to aggressive quantization.

AWQ, despite its efficiency, requires caution in contexts requiring high determinism.

Tasks like math reasoning (GSM8K) remain robust even under Q4 formats, making these a viable option in low-resource deployments.

In the next section, we connect these findings to concrete industry applications.

Task-Specific Recommendations

One of the core motivations behind this benchmark study is to answer a simple but often overlooked question:

Which quantized LLM format should I use for my specific task or deployment context?

Rather than treating benchmarks as abstract metrics, we mapped each benchmark to a real-world domain, as outlined in the Methodology. Using this mapping, we now distill our findings into task-specific recommendations, grounded in both accuracy retention and quantization stability.

Financial Reasoning and Math-Heavy Applications (e.g., calculators, tax advisors, fintech agents)

Primary Benchmark: GSM8K

Key Insight: GSM8K retains high accuracy even under aggressive quantization. Qwen2.5 models demonstrate remarkable stability in arithmetic reasoning, even at 4-bit precision.

Recommended Format:

Qwen2.5-32B + Q5_K_M for high-accuracy production
Qwen2.5-7B + Q4_K_M for edge or CPU-constrained environments
GPTQ-INT8 for balanced GPU deployment

Rationale: Tasks with structured numeric reasoning are surprisingly resilient to quantization, making them well-suited for low-bit formats without unacceptable performance loss.

Code Generation and Developer Tools (e.g., coding assistants, bug fixers, IDE copilot plugins)

Primary Benchmark: HumanEval (referenced externally), BBH (as proxy for logic)

Key Insight: Logical reasoning and token precision are critical for code generation. GPTQ formats (especially INT8 and INT4) maintain these well, while AWQ and lower GGUFs tend to destabilize completions.

Recommended Format:

Qwen2.5-Coder-32B + GPTQ-INT4 for high-throughput dev tools
Qwen2.5-Coder-7B + GPTQ-INT8 for smaller environments
Avoid Q4_K_M for code-specific inference

Rationale: Code generation demands token-level fidelity and syntactic precision, which is better preserved in GPTQ-based formats than in GGUF or AWQ.

Assistants and Instruction-Following Agents (e.g., workflow automation, RAG-based tools, chat agents)

Primary Benchmark: IFEval

Key Insight: Instruction-following accuracy degrades the fastest under low-bit quantization, especially AWQ and Q4_K_M. However, agents also require fast inference and responsiveness.

Recommended Format:

GPTQ-INT8 for high-precision deterministic agents
Q5_K_M for interactive assistants with latency needs
AWQ only if real-time performance outweighs instruction precision

Rationale: Agents operating in deterministic flows (e.g., form filling, document classification) require stable decoding and alignment, which lower-bit quantization disrupts. AWQ is usable but must be tested thoroughly per workflow.

Research and Enterprise Knowledge Tools (e.g., legal search, compliance engines, academic summarizers)

Primary Benchmark: MMLU, C-Eval

Key Insight: These models require high factual integrity and multilingual understanding. C-Eval especially reveals fragility in token-level representations under quantization.

Recommended Format:

GPTQ-INT8 or Q8_0 for factual tasks in multilingual contexts
Qwen2.5-14B + Q5_K_M for full-context research assistant setups
Avoid AWQ or Q4_K_M for mission-critical factual work

Rationale: These use cases demand high confidence and recall, especially in multilingual content and domain-specific retrieval. Full precision is ideal, but GPTQ-INT8 offers a strong compromise.

Logic and Reasoning Pipelines (e.g., autonomous agents, planning models, AutoGPT-type chains)

Primary Benchmark: BBH

Key Insight: BBH is relatively robust to quantization, likely due to the internal structure of logic tasks. Even Q4_K_M maintains good retention (>90%).

Recommended Format:

Qwen2.5-7B + Q4_K_M for embedded logic agents
Qwen2.5-32B + AWQ for fast multistep agents at scale
GPTQ-INT8 for accuracy-sensitive reasoning

Rationale: Reasoning chains are structurally encoded, allowing models to retain capability under quantization a promising insight for building lightweight decision-making agents.

Cross-Cutting Observation

Across all use cases, two general trends hold:

Q5_K_M is consistently the safest GGUF format, retaining >95% accuracy while enabling fast, low-memory inference making it the default recommendation when speed and fidelity must be balanced.
AWQ is fast, but unstable in high-instruction or multilingual domains, so its use should be bounded to low-stakes or latency-prioritized tasks.

Comparative Model Analysis

While quantization formats show clear trends across tasks, each model family also exhibits unique behavior due to differences in architecture, pretraining objectives, instruction tuning, and tokenizer handling. In this section, we isolate per-model characteristics that emerged from the data highlighting how well each model sustains performance under aggressive quantization, and which domains they’re best suited for.

Qwen2.5 Series: Consistent Across All Formats

Across both the Qwen2.5-Instruct and Qwen2.5-Coder models (7B, 14B, 32B), performance under quantization remained remarkably stable. The drop from BF16 to Q4_K_M was generally predictable and within tolerable margins, with Q5_K_M and GPTQ-INT8 retaining over 95–98% of original accuracy across all benchmarks.

On STEM-heavy tasks like GSM8K and GPQA, Qwen2.5-32B retained high numeric reasoning even in Q4_K_M, making it suitable for quantized financial agents.
Instruction tuning, tested via IFEval, showed minimal degradation in GPTQ-INT8 and AWQ, confirming robustness in agent use cases.
The Coder variants especially Qwen2.5-Coder-32B held up impressively on HumanEval and BBH benchmarks, with only marginal regression under GPTQ-INT4.

Conclusion:

Qwen2.5 is arguably the most quantization-tolerant model family in this study. Its multi-lingual grounding, strong pretraining corpus, and structured decoding make it an ideal base for quantized deployments in general-purpose assistants, agents, and coding tools.

DeepSeek-R1-Distill: Resilient in STEM + Instruction

The DeepSeek-R1-Distill family, particularly the Qwen-32B variant, exhibited strong resilience to quantization in STEM-oriented benchmarks. The accuracy difference between BF16 and Q4_K_M across MATH, GPQA, and MMLU was consistently under 1%, even in INT4 and AWQ formats.

Notably, DeepSeek also maintained high instruction-following capability under quantization, outperforming comparable models on IFEval when compressed. This is likely due to its distillation process, which emphasized alignment and instruction adherence, enabling smoother degradation curves.

Conclusion:

DeepSeek-R1-Distill models are ideal for education, tutoring, and technical reasoning agents, especially where instruction integrity must be preserved under resource constraints.

LLaMA 3.3: Powerful, but Fragile Under Low-Bit Quantization

While the LLaMA 3.3-70B-Instruct model delivers exceptional results in its full-precision format, it proved to be the most vulnerable to aggressive quantization. For example:

On MMLU, Q4 dropped performance by 7.8 points, and even Q8_0 showed noticeable drift.
LoRA finetuning, as noted in the original paper, does not reliably recover quantization-induced degradation, unlike prior LLaMA generations.

Given the model’s sheer size, these drops are particularly costly in deployment intensifying memory and compute demands without the stability benefits observed in smaller peers.

Conclusion:

LLaMA 3.3 should be reserved for GPU-heavy, high-accuracy workloads, and not recommended for Q4 or AWQ quantization. If used in quantized form, stick to GPTQ-INT8 and validate outputs thoroughly in mission-critical environments.

Mistral-7B-Instruct: The Lightweight Workhorse

Among all models evaluated, Mistral-7B-Instruct emerged as the most efficient in the 7B class. Even under 8-bit quantization, performance remained within 2% of BF16 across most tasks. Though more aggressive quantization (Q4_K_M, INT4) introduced greater variance (~8–10% drop), it remained usable for casual and lightweight deployments.

Mistral is not the best performer on any single benchmark, but its speed, simplicity, and compatibility make it ideal for prototyping and iterative builds.
On IFEval and BBH, the model shows robust reasoning, albeit with slightly noisier completions under temperature sampling.

Conclusion:

Mistral-7B is perfect for developers needing a compact, reasonably accurate LLM that performs acceptably across instruction-following and general Q&A. It is highly suitable for on-device agents, real-time chatbots, and initial production pilots.

Study Limitations and Future Directions

Despite covering a wide range of models, benchmarks, and quantization formats, this evaluation is not without limitations. We highlight them here both for transparency and to guide future iterations of this benchmark.

Missing Quantizations

Not all models were available in all quant formats. For instance, LLaMA 3.3 lacked AWQ and Q8_0 support at the time of testing, while some DeepSeek variants were missing GPTQ versions. Although interpolation between similar models helps fill interpretive gaps, direct empirical validation is always preferred.

Incomplete Speed Benchmarks

While accuracy and retention were central to this analysis, inference speed (tokens/sec) was not systematically benchmarked across formats or hardware profiles. Preliminary throughput numbers were observed (e.g., GGUF outperforming GPTQ on CPU), but more structured profiling especially under batch loads remains future work.

Focused Benchmark Subset

This study prioritized five representative benchmarks: BBH, MMLU, C-Eval, IFEval, and GSM8K. While these span reasoning, factual knowledge, multilingual QA, and instruction-following, they do not fully cover all LLM evaluation axes. Notably, TruthfulQA, HumanEval, and MATH were excluded due to runtime constraints and will be integrated in a follow-up post.

Quantization-Aware Fine-Tuning (QLoRA, GPTQ-LoRA)

We evaluated zero-shot quantized performance. However, some formats like GPTQ can recover accuracy when paired with quantization-aware fine-tuning (e.g., GPTQ-LoRA). Exploring post-quantization adaptation is an important avenue for improving quality in production deployments.

Conclusion

This study explored the practical implications of quantizing large language models, not just as an academic curiosity but through the lens of real-world task deployment. By evaluating over a dozen quantized variants across four major model families and five critical benchmarks, we identified meaningful patterns that developers and ML teams can act on.

Key Takeaways:

Quantization is not one-size-fits-all. Format choice must align with task criticality and resource constraints.
Qwen2.5 models are the most stable across quant formats, making them excellent candidates for versatile deployment.
GPTQ-INT8 and GGUF Q5_K_M strike the best trade-offs between speed and retention, outperforming more aggressive options like AWQ or Q4_K_M in critical applications.
Instruction-following and multilingual tasks degrade faster under quantization, whereas math and logic tasks show higher resilience.
Model-specific traits matter: LLaMA 3.3 offers high raw power but struggles in low-bit formats, while Mistral-7B balances efficiency with reliability for smaller use cases.

Ready to deploy smarter, faster, and leaner LLMs?

Leverage our benchmark methodology and open-source tooling to evaluate your models under real-world constraints. Optimize with confidence, reduce costs, and maintain quality across production use cases. Have questions or want to discuss quantization strategies tailored to your product or organization? Schedule a call with our team we're happy to collaborate on solutions that scale.

References

Hendrycks, D., et al. Measuring Massive Multitask Language Understanding (MMLU). arXiv:2009.03300. https://arxiv.org/abs/2009.03300
Cobbe, K., et al. Training Verifiers to Solve Math Word Problems (GSM8K). arXiv:2110.14168. https://arxiv.org/abs/2110.14168
Suzgun, M., et al. BIG-Bench Hard: Stress Testing Language Models with Multi-Step Reasoning. GitHub Repository. https://github.com/suzgunmirac/BIG-Bench-Hard
Liu, X., et al. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. arXiv:2305.08322. https://arxiv.org/abs/2305.08322
GAIR Team. IFEval: Instruction Following Evaluation. GitHub. https://github.com/GAIR/IFEval
Qwen Team. Qwen2.5 Models and Quantization Benchmarks. Hugging Face. https://huggingface.co/Qwen
DeepSeek Team. DeepSeek LLMs and R1-Distill Series. GitHub. https://github.com/deepseek-ai
Mistral AI. Mistral 7B Instruct Models. Hugging Face. https://huggingface.co/mistralai
Meta AI. LLaMA 3.3 Models. Meta Blog. https://ai.meta.com/blog/llama-3
Pan, Q., et al. AutoGPTQ: Quantization Toolkit for Large Language Models. GitHub. https://github.com/PanQiWei/AutoGPTQ

‍

Benchmarking Quantized LLMs: What Works Best for Real Tasks?

Benchmarking Quantized LLMs: What Works Best for Real Tasks?

Background and Motivation