AI & Privacy

Complete Ollama Models Guide 2025 - Every Model Explained | Practical Web Tools

Practical Web Tools Team
45 min read
Share:
XLinkedIn
Complete Ollama Models Guide 2025 - Every Model Explained | Practical Web Tools

What are the best Ollama models in 2025? Ollama now offers over 100 open-source AI models for local deployment, ranging from tiny 270M parameter models to massive 671B reasoning systems. The most popular choices are Llama 3.1 8B for general use (108M+ downloads), DeepSeek-R1 for advanced reasoning (75M+ downloads), and Gemma 3 for efficient multimodal tasks (28M+ downloads). This guide covers every model available on Ollama, helping you choose the right one for your specific needs.

Running AI locally has become essential for developers, researchers, and businesses who need privacy, cost control, and offline capability. With Ollama making local deployment as simple as running a single command, the only question remaining is which model to choose.

This comprehensive guide examines every model in the Ollama library, providing the technical details, performance characteristics, and practical recommendations you need to make informed decisions.

What Is Ollama and Why Does Model Selection Matter?

Ollama is an open-source platform that simplifies running large language models locally on your hardware. Instead of sending data to cloud APIs like OpenAI or Anthropic, you download models once and run them entirely on your machine. Your data never leaves your device.

The platform handles the complexity of model quantization, memory management, and optimization automatically. You run ollama run llama3.1 and start chatting within minutes.

Model selection matters because each model has different strengths:

  • Parameter count affects capability and memory requirements
  • Training focus determines whether models excel at code, reasoning, or conversation
  • Quantization level trades quality for speed and memory efficiency
  • Context window limits how much text the model can process at once
  • Architecture type (dense vs. Mixture-of-Experts) impacts efficiency and specialization

Choosing the wrong model wastes hardware resources or leaves performance on the table. This guide helps you match models to your actual needs.

The Meta Llama Family: The Foundation of Local AI

Meta's Llama models form the backbone of local AI. They are the most widely used, best supported, and most thoroughly tested models available. The December 2024 release of Llama 3.3 changed the conversation around open-source large language models, delivering performance comparable to much larger models at a fraction of the computational cost.

Llama 3.3 (70B Parameters)

Llama 3.3 is Meta's latest flagship model, released in December 2024. It offers performance comparable to the much larger Llama 3.1 405B while requiring only 43GB of storage, representing a major advancement in efficient model design.

Key Specifications:

  • Parameters: 70 billion
  • Context Window: 128K tokens
  • Size: 43GB
  • Languages: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai
  • Downloads: 2.9 million
  • Training Data: 15 trillion tokens from public sources (7x larger than Llama 2)
  • License: Llama 3.3 Community License

Architecture Details:

Llama 3.3 is an auto-regressive language model using an optimized transformer architecture with several key innovations:

  • Grouped-Query Attention (GQA): Improves inference scalability and efficiency
  • 128K Vocabulary Tokenizer: Encodes language more efficiently than previous versions
  • Supervised Fine-Tuning (SFT): Aligns model behavior with human preferences
  • Reinforcement Learning with Human Feedback (RLHF): Ensures helpfulness and safety

Benchmark Performance:

Benchmark Llama 3.3 70B Comparison
MMLU Chat (0-shot, CoT) 86.0 Matches Llama 3.1 70B, competitive with Amazon Nova Pro (85.9)
MMLU PRO (5-shot, CoT) 68.9 Improved over Llama 3.1 70B
GPQA Diamond (0-shot, CoT) 50.5 Better than Llama 3.1 70B (48.0)
HumanEval (0-shot) 88.4 Near Llama 3.1 405B (89.0)
MBPP EvalPlus 87.6 Slight improvement over Llama 3.1 70B (86.0)
MATH (0-shot, CoT) 77.0 Major improvement over Llama 3.1 70B (67.8)
MGSM (0-shot) 91.1 Substantial improvement over Llama 3.1 70B (86.9)
IFEval 92.1 Excellent instruction-following

Inference Performance:

  • Achieves 276 tokens/second on Groq hardware (25 tokens/second faster than Llama 3.1 70B)
  • NVIDIA TensorRT-LLM with speculative decoding achieves up to 3.55x throughput speedup on HGX H200

Cost Efficiency:

  • Input tokens: $0.10 per million (vs. $1.00 for Llama 3.1 405B)
  • Output tokens: $0.40 per million (vs. $1.80 for Llama 3.1 405B)

Best For: Users who need maximum capability and have RTX 4090 or Apple Silicon with 64GB+ memory. This model approaches GPT-4 quality for many tasks while running locally.

Hardware Requirements: Minimum 64GB RAM or 24GB VRAM with CPU offloading. Runs well on M2 Max or M3 Max MacBooks.

Llama 3.2 (1B and 3B Parameters)

Llama 3.2 represents Meta's push into efficient, edge-deployable models. Released in September 2024, these are designed for devices with limited resources and represent a new era of on-device AI.

Key Specifications:

  • Parameters: 1B (1.3GB) or 3B (2.0GB)
  • Context Window: 128K tokens
  • Languages: 8 officially supported (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai)
  • Training Data: Up to 9 trillion tokens from public sources
  • Downloads: 51 million

Architecture & Training Innovation:

The 1B and 3B models were created using two innovative techniques:

  1. Pruning: Started with Llama 3.1 8B and systematically removed less critical network components
  2. Knowledge Distillation: Used logits from Llama 3.1 8B and 70B as token-level targets during training

This approach allowed Meta to create models that retain much of the capability of larger models in a fraction of the size.

Edge Deployment Features:

  • Compatible with Qualcomm, MediaTek, and ARM processors
  • Designed for mobile and IoT applications
  • Instantaneous local processing without cloud latency
  • Complete data privacy with no cloud transmission
  • Works on devices with as little as 8GB RAM

Benchmark Performance:

Benchmark Llama 3.2 1B Llama 3.2 3B Gemma 2B IT Phi 3.5-mini IT
MMLU (5-shot) - 63.4 57.8 69.0
IFEval - 77.4 61.9 59.2
TLDR9 (summarization) 16.8 19.0 - -
BFCL V2 (tool use) 25.7 67.0 - -

The 3B model outperforms Gemma 2 2.6B and Phi 3.5-mini on instruction-following, summarization, and tool use benchmarks. Notably, Llama 3.2 3B significantly outperformed the original GPT-4 on the MATH benchmark.

Best For: Mobile development, IoT applications, and situations where you need AI on resource-constrained devices. Also excellent for rapid prototyping when speed matters more than maximum quality.

Hardware Requirements: Runs on any modern hardware. Even laptops with 8GB RAM handle these models comfortably.

Llama 3.2-Vision (11B and 90B Parameters)

The vision variants are Meta's first Llama models to support vision tasks, adding powerful image understanding capabilities to the Llama architecture through a novel adapter-based approach.

Key Specifications:

  • Parameters: 11B (7.8GB) or 90B (55GB)
  • Context Window: 128K tokens
  • Capabilities: Image reasoning, captioning, visual question answering, OCR, object detection
  • Input: Text and images (up to 1120x1120 pixels)
  • Training Data: 6 billion image-text pairs

Architecture Innovation:

Llama 3.2 Vision introduces a unique architecture combining:

  • Base Language Model: Llama 3.1 8B (for 11B) or Llama 3.1 70B (for 90B)
  • Vision Encoder: Separately trained image processing component
  • Cross-Attention Adapters: Connect image representations to the language model

The key innovation is that the language model parameters remained frozen during vision adapter training, preserving all text-only capabilities. This means Llama 3.2-Vision serves as a drop-in replacement for Llama 3.1 for text tasks.

Image Processing Capabilities:

  • High-resolution support up to 1120x1120 pixels
  • Object classification and identification
  • Image-to-text transcription (including handwriting) via OCR
  • Chart and graph understanding
  • Document analysis and data extraction
  • Contextual visual Q&A
  • Image comparison

Grouped-Query Attention (GQA): All models support GQA for faster inference, particularly beneficial for the larger 90B model.

Best For: Applications requiring image analysis: document processing, visual content moderation, image-based research assistance, visual accessibility tools. The 11B variant is the sweet spot for most users.

Limitations: Image+text combinations only support English. Text-only tasks support the full 8-language set.

Hardware Requirements:

  • 11B: 16GB VRAM or 32GB RAM
  • 90B: 64GB+ VRAM or distributed setup

Llama 3.1 (8B, 70B, and 405B Parameters)

Llama 3.1 remains the workhorse of local AI, with the 8B version being the most downloaded model on Ollama at over 108 million downloads. The 405B variant was the first openly available model to rival GPT-4 and Claude 3 Opus in capability.

Key Specifications:

  • Sizes: 8B (4.9GB), 70B (43GB), 405B (243GB)
  • Context Window: 128K tokens
  • Capabilities: Tool use, multilingual, long-form summarization, coding
  • Training Data: 15 trillion tokens

Architectural Improvements over Llama 3:

  • Extended context window from 8K to 128K tokens
  • Improved tokenizer efficiency
  • Enhanced multilingual capabilities
  • Native tool use and function calling support

Best For:

  • 8B: Everyday professional work, document summarization, code generation, content drafting. The best balance of capability and accessibility.
  • 70B: Complex analysis, detailed reasoning, high-stakes professional applications
  • 405B: Research and enterprise applications requiring maximum capability. First open model to truly compete with GPT-4.

Hardware Requirements:

  • 8B: 8GB VRAM or 16GB RAM
  • 70B: 64GB RAM or distributed GPU setup
  • 405B: Multiple high-end GPUs or specialized infrastructure (typically 8x 80GB GPUs)

Llama 3 (8B and 70B Parameters)

The previous generation remains useful for applications optimized for its architecture, though Llama 3.1 is recommended for new projects.

Key Specifications:

  • Sizes: 8B (4.7GB), 70B (40GB)
  • Context Window: 8K tokens
  • Downloads: 13.2 million

Best For: Legacy compatibility or when the shorter 8K context window is sufficient. Generally recommend Llama 3.1 for new projects due to the larger context window and improved capabilities.

Llama 2 (7B, 13B, and 70B Parameters)

The foundation that started the open-source AI revolution in July 2023.

Key Specifications:

  • Sizes: 7B (3.8GB), 13B (7.4GB), 70B (39GB)
  • Context Window: 4K tokens
  • Training: 2 trillion tokens
  • Downloads: 4.9 million

Best For: Research comparisons, fine-tuning base models, or applications where you have existing Llama 2 infrastructure. Historical significance as the model that democratized large language model access.

Llama 2 Uncensored (7B and 70B Parameters)

A variant of Llama 2 with safety guardrails removed, created using Eric Hartford's uncensoring methodology.

Key Specifications:

  • Sizes: 7B (3.8GB), 70B (39GB)
  • Context Window: 2K tokens
  • Downloads: 1.5 million

Best For: Research purposes, creative writing without restrictions, or applications where you need the model to engage with topics the standard version refuses.

Caution: Use responsibly. The lack of guardrails means the model will attempt to comply with any request. Not recommended for production applications without additional safety measures.

DeepSeek Models: The Reasoning Revolution

DeepSeek has emerged as a major force in open-source AI, particularly with reasoning-focused models. Their January 2025 release of DeepSeek-R1 demonstrated that reinforcement learning alone could produce emergent reasoning capabilities rivaling frontier closed-source models—at a fraction of the training cost.

DeepSeek-R1 (1.5B to 671B Parameters)

DeepSeek-R1 is a family of open reasoning models that approach the performance of OpenAI's o1 and Google's Gemini 2.5 Pro. The full model represents one of the most significant open-source AI releases of 2025.

Key Specifications:

  • Sizes: 1.5B (1.1GB), 7B (4.7GB), 8B (5.2GB), 14B (9.0GB), 32B (20GB), 70B (43GB), 671B (404GB)
  • Context Window: 128K-160K tokens
  • Downloads: 75.2 million
  • License: MIT (fully permissive)
  • Training Cost: Approximately $5.6 million (significantly lower than competing models)

Architecture: Mixture of Experts (MoE)

DeepSeek-R1 leverages a sophisticated MoE framework:

  • Total Parameters: 671 billion
  • Active Parameters: Only 37 billion per inference (5.5% activation rate)
  • Experts per Layer: 256 routed experts
  • Selected Experts: 8 per query (typically 2-4 for simpler tasks)

Key architectural innovations include:

  1. Multi-Head Latent Attention (MLA): Dramatically reduces KV cache size, a common bottleneck in transformers. Enables faster inference and longer text generation.

  2. Expert Routing Mechanism: Lightweight gating network assigns probability distributions over experts. Top-ranked experts process queries in parallel.

  3. Multi-Token Prediction (MTP): Improves generation efficiency.

Revolutionary Training Methodology:

DeepSeek-R1's training process represents a breakthrough in AI development:

  1. DeepSeek-R1-Zero: Trained via large-scale reinforcement learning (RL) without supervised fine-tuning. Remarkably, powerful reasoning behaviors emerged naturally from pure RL.

  2. Group Relative Policy Optimization (GRPO): A novel RL algorithm from the DeepSeekMath paper. Built on PPO (Proximal Policy Optimization), GRPO enhances mathematical reasoning while reducing memory consumption.

  3. Multi-Stage Pipeline:

    • Stage 1: Pure RL to discover reasoning patterns
    • Stage 2: Supervised Fine-Tuning (SFT) on synthesized reasoning data
    • Stage 3: Second RL phase for helpfulness and harmlessness

Distilled Models:

The smaller models (1.5B-70B) are distilled from the full 671B model, demonstrating that reasoning patterns from larger models can effectively transfer to smaller ones. This makes advanced reasoning accessible on consumer hardware.

Benchmark Performance:

DeepSeek-R1 matches or exceeds frontier models on reasoning benchmarks:

  • Competitive with OpenAI o1 on mathematical reasoning
  • Approaches GPT-4 Turbo on code generation
  • Exceeds many closed models on logic and scientific analysis

Hardware Requirements:

  • 7B-8B: 8GB VRAM
  • 14B: 12GB VRAM
  • 32B: 24GB VRAM
  • 70B: 48GB+ VRAM or large RAM
  • 671B: Minimum 800GB HBM in FP8 format; requires 64-way expert parallelism across multiple GPUs

Best For: Mathematical reasoning, programming challenges, logical problem-solving, scientific analysis. The 14B-32B range offers the best balance of capability and hardware requirements for most users.

DeepSeek-Coder (1.3B to 33B Parameters)

A coding-focused model trained on 87% code and 13% natural language, optimized for programming tasks.

Key Specifications:

  • Sizes: 1.3B (776MB), 6.7B (3.8GB), 33B (19GB)
  • Context Window: 16K tokens
  • Training: 2 trillion tokens
  • Downloads: 2.4 million

Best For: Code completion, code generation, programming assistance, and technical documentation. Excellent for developers who need a dedicated coding assistant.

DeepSeek-Coder-V2 (16B and 236B Parameters)

An advanced Mixture-of-Experts coding model that achieves GPT-4 Turbo-level performance on code tasks—the first open model to reach this milestone.

Key Specifications:

  • Sizes: 16B (8.9GB), 236B (133GB)
  • Active Parameters: 2.4B (16B model), 21B (236B model)
  • Context Window: Up to 160K tokens
  • Architecture: Mixture-of-Experts with Multi-Head Latent Attention
  • Programming Languages: 338 supported
  • Training Data: 10.2 trillion tokens (60% code, 10% mathematics)
  • Downloads: 1.3 million

Architecture Innovations:

  1. MoE Efficiency: The 236B model uses only 21B active parameters per inference, achieving high performance without prohibitive compute costs.

  2. Multi-Head Latent Attention (MLA): Reduces KV cache size dramatically, enabling faster inference and longer context handling.

Benchmark Performance:

Benchmark DeepSeek-Coder-V2 236B Comparison
HumanEval 90.2% New state-of-the-art
MBPP 76.2% New state-of-the-art
MATH 75.7% Near GPT-4o (76.6%)

Hardware Requirements:

  • 16B (Lite): Single GPU with 40GB VRAM in BF16
  • 236B (Full): 8x 80GB GPUs for BF16 inference

Best For: Professional development environments, code review automation, and complex programming tasks requiring maximum accuracy.

Google Gemma Family: Efficiency Meets Capability

Google's Gemma models leverage technology from the Gemini family in compact, efficient packages. The March 2025 release of Gemma 3 established new standards for what's possible on a single GPU.

Gemma 3 (270M to 27B Parameters)

Gemma 3 is Google's latest and most capable model family that runs on a single GPU, bringing Gemini-class capabilities to local deployment.

Key Specifications:

  • Sizes: 270M (text only), 1B, 4B, 12B, 27B
  • Context Window: 32K tokens (1B), 128K tokens (4B and larger)
  • Languages: 35+ out-of-the-box, 140+ pretrained support
  • Multimodal: 4B and larger process both text and images
  • Downloads: 28.9 million
  • Training Data: 14T tokens (27B), 12T tokens (12B), 4T tokens (4B), 2T tokens (1B)

Architecture Innovations:

Gemma 3 introduces several architectural improvements:

  1. Interleaved Attention Blocks: Each block contains 5 local attention layers (sliding window of 1024) and 1 global attention layer. This captures both short and long-range dependencies efficiently.

  2. Enhanced Positional Encoding: Upgraded RoPE (Rotary Positional Embedding) with base frequency increased from 10K to 1M for global layers, maintaining 10K for local layers.

  3. Improved Normalization: QK-norm for stable attention scores, replacing soft-capping from Gemma 2. Uses Grouped-Query Attention (GQA) with both post-norm and pre-norm RMSNorm.

  4. Memory Efficiency: Architectural changes reduce KV cache overhead during long-context inference compared to global-only attention.

Vision Integration (4B+):

  • Vision Encoder: Based on SigLIP for processing images
  • Pan & Scan Algorithm: Adaptively crops and resizes images to handle different aspect ratios
  • Fixed Processing Size: Vision encoder operates on 896x896 square images

Benchmark Performance:

Benchmark Gemma 3 27B Notes
MMLU-Pro 67.5 Strong general knowledge
LiveCodeBench 29.7 Competitive coding
Bird-SQL 54.4 Database queries
GPQA Diamond 42.4 Graduate-level reasoning
MATH 69.0 Mathematical ability
FACTS Grounding 74.9 Factual accuracy
MMMU 64.9 Multimodal understanding
LM Arena Elo 1338 Top 10 overall (March 2025)

The 27B model outperforms Llama3-405B, DeepSeek-V3, and o3-mini in preliminary human preference evaluations on LMArena.

Additional Features:

  • Function calling and structured output support
  • Official quantized versions available
  • Runs efficiently on workstations, laptops, and smartphones

Hardware Requirements:

  • 270M-1B: Any modern hardware
  • 4B: 6GB VRAM
  • 12B: 12GB VRAM
  • 27B: 20GB+ VRAM

Best For: Multilingual applications, multimodal projects, and situations where you need strong performance with reasonable hardware. The 12B variant is particularly efficient for its capability level.

Gemma 2 (2B, 9B, and 27B Parameters)

The previous generation remains excellent for many applications, offering proven reliability and broad compatibility.

Key Specifications:

  • Sizes: 2B (1.6GB), 9B (5.4GB), 27B (16GB)
  • Context Window: 8K tokens
  • Downloads: 12.3 million

The 27B variant delivers "performance surpassing models more than twice its size" according to Google's benchmarks.

Best For: Creative text generation, chatbots, content summarization, NLP research, and language learning applications where Gemma 3's longer context isn't needed.

Gemma (2B and 7B Parameters)

The original Gemma release from February 2024, lightweight but capable.

Key Specifications:

  • Sizes: 2B (1.7GB), 7B (5.0GB)
  • Context Window: 8K tokens
  • Training: Web documents, code, mathematics

Best For: Edge deployments, resource-constrained environments, and applications needing a small but capable model with Google's quality standards.

CodeGemma (2B and 7B Parameters)

Google's code-specialized variant optimized for IDE integration and code completion.

Key Specifications:

  • Sizes: 2B (1.6GB), 7B (5.0GB)
  • Context Window: 8K tokens
  • Languages: Python, JavaScript, Java, Kotlin, C++, C#, Rust, Go, and others
  • Training: 500 billion tokens including code and mathematics
  • Fill-in-the-Middle: Supported for code completion

Best For: IDE integration, code completion, fill-in-the-middle tasks, and coding assistant applications.

Alibaba Qwen Family: Multilingual Excellence

Qwen models from Alibaba excel at multilingual tasks and offer excellent performance across the capability spectrum. The April 2025 release of Qwen3 introduced revolutionary hybrid reasoning capabilities.

Qwen3 (0.6B to 235B Parameters)

The latest Qwen generation provides both dense and Mixture-of-Experts variants with groundbreaking hybrid reasoning modes.

Key Specifications:

  • Dense Models: 0.6B, 1.7B, 4B, 8B (default), 14B, 32B
  • MoE Models: 30B-A3B (30B total, 3B active), 235B-A22B (235B total, 22B active)
  • Context Window: 32K-128K tokens
  • Languages: 119 languages and dialects
  • Training Data: 36 trillion tokens
  • License: Apache 2.0

Architecture: Dense and MoE Variants

Dense models use traditional transformer architecture where all parameters contribute during inference.

MoE models feature:

  • 128 expert FFNs per layer
  • 8 experts selected per token
  • Extended 128K context support

Revolutionary Hybrid Reasoning Modes:

Qwen3's most significant innovation is unifying two reasoning approaches in one model:

  1. Thinking Mode: The model reasons step-by-step before delivering answers. Ideal for complex problems requiring deeper thought.

  2. Non-Thinking Mode: Quick, near-instant responses for simpler questions where speed matters more than depth.

This eliminates the need to switch between chat-optimized models (like GPT-4o) and dedicated reasoning models (like QwQ-32B). Users can even set a "thinking budget" to balance computational effort against response speed.

Training Process:

Three-stage pretraining:

  1. Stage 1: 30+ trillion tokens at 4K context for basic language skills
  2. Stage 2: Additional 5 trillion tokens emphasizing STEM, coding, and reasoning
  3. Stage 3: High-quality long-context data extending to 32K tokens

Benchmark Performance:

The flagship Qwen3-235B-A22B competes with:

  • OpenAI o1 and o3-mini
  • DeepSeek-R1
  • Google Gemini-2.5-Pro
  • Grok-3

Remarkably, Qwen3-30B-A3B outperforms QwQ-32B despite having 10x fewer activated parameters. Even Qwen3-4B rivals Qwen2.5-72B-Instruct performance.

Best For: Multilingual applications, agent development, creative writing, role-playing, and multi-turn dialogue systems. Excellent for applications that need both quick responses and deep reasoning.

Qwen3-Coder (30B and 480B Parameters)

Alibaba's latest coding models optimized for agentic and coding tasks.

Key Specifications:

  • Sizes: 30B (19GB), 480B (varies)
  • Optimization: Long code contexts
  • Downloads: 1.6 million

Best For: Complex software development, large codebase navigation, and autonomous coding agents.

Qwen3-VL (2B to 235B Parameters)

The most powerful vision-language model in the Qwen family.

Key Specifications:

  • Size Range: 2B to 235B
  • Capabilities: Visual understanding, document analysis, multimodal reasoning
  • Downloads: 881K

Best For: Document processing, visual question answering, and applications requiring both image and text understanding.

Qwen2.5-Coder (0.5B to 32B Parameters)

The state-of-the-art open-source coding model, matching GPT-4o on code repair benchmarks.

Key Specifications:

  • Sizes: 0.5B (398MB), 1.5B, 3B, 7B, 14B, 32B (20GB)
  • Context Window: 128K tokens
  • Programming Languages: 92 supported
  • Training Data: 5.5 trillion tokens
  • Downloads: 9.5 million

Architecture:

Built on Qwen2.5 architecture with:

  • 32B Model: 5,120 hidden size, 40 query heads, 8 key-value heads, 27,648 intermediate size

Benchmark Performance:

Benchmark Qwen2.5-Coder 32B Notes
Aider (code repair) 73.7 Comparable to GPT-4o, 4th overall
MdEval (multi-language repair) 75.2 #1 among open-source
McEval (40+ languages) 65.9 Excellent cross-language support

The model achieves state-of-the-art performance across 10+ benchmarks including code generation, completion, reasoning, and repair.

Best For: Professional development, code generation, code reasoning, and code fixing tasks. The best open-source coding model available.

Qwen2 (0.5B to 72B Parameters)

The previous generation with excellent multilingual support for 29 languages.

Key Specifications:

  • Sizes: 0.5B (352MB), 1.5B (935MB), 7B (4.4GB), 72B (41GB)
  • Context Window: 32K-128K tokens
  • Languages: 29 including major European, Asian, and Middle Eastern languages

Best For: Multilingual chatbots, translation, and cross-lingual applications.

CodeQwen (7B Parameters)

An earlier code-specialized Qwen model with exceptional context length.

Key Specifications:

  • Size: 7B (4.2GB)
  • Context Window: 64K tokens
  • Training: 3 trillion tokens of code data
  • Languages: 92 coding languages

Best For: Long-context code understanding, Text-to-SQL, and bug fixing.

Mistral AI Models: French Excellence

Mistral AI, based in Paris, has produced some of the most efficient and capable open-source models. Their innovative use of Mixture-of-Experts and Sliding Window Attention has influenced the entire field.

Mistral (7B Parameters)

The original Mistral model that proved smaller models could outperform much larger ones through architectural innovation.

Key Specifications:

  • Size: 7B (4.4GB)
  • Context Window: 32K tokens
  • License: Apache 2.0
  • Downloads: 23.6 million

Architecture Innovations:

  • Sliding Window Attention: Trained with 8K context, fixed cache size, theoretical attention span of 128K tokens
  • Grouped Query Attention (GQA): Faster inference and smaller cache
  • Byte-fallback BPE Tokenizer: No out-of-vocabulary tokens

Outperforms Llama 2 13B on all benchmarks and approaches CodeLlama 7B on code tasks.

Hardware Requirements: 24GB RAM and single GPU

Best For: General-purpose applications, chatbots, and situations where you need reliable performance with moderate resources.

Mixtral 8x7B and 8x22B (47B and 141B Total Parameters)

Mistral's groundbreaking Mixture-of-Experts models that use only a fraction of their parameters for each inference.

Key Specifications:

Specification Mixtral 8x7B Mixtral 8x22B
Total Parameters 47B 141B
Active Parameters 13B 39B
Size 26GB 80GB
Context Window 32K tokens 64K tokens
Downloads 1.6 million -

Architecture:

Mixtral shares Mistral 7B's architecture with one key difference: each layer contains 8 feedforward blocks (experts) instead of one. A router network selects which 2 experts process each token.

Key features:

  • Sliding Window Attention with broader context support
  • Grouped Query Attention for efficient inference
  • Byte-fallback BPE Tokenizer

Performance:

  • 8x7B: Outperforms Llama 2 70B on most benchmarks with 6x faster inference. Matches or outperforms GPT-3.5 on standard benchmarks.
  • 8x22B: Outperforms ChatGPT 3.5 on MMLU and WinoGrande. Achieves 90.8% on GSM8K (math) and 44.6% on MATH.

Resource Requirements:

  • 8x7B: 64GB RAM, dual GPUs recommended
  • 8x22B: ~90GB VRAM in half-precision, 5.3x slower than 7B, 2.1x slower than 8x7B

Languages: English, French, Italian, German, Spanish (native fluency)

Best For: Applications requiring high capability with better efficiency than pure dense models. Excellent for multilingual European applications.

Microsoft Phi Family: Small But Mighty

Microsoft's Phi models prove that careful training on high-quality synthetic data can create remarkably capable small models. The Phi series represents a different philosophy: quality over quantity in training data.

Phi-4 (14B Parameters)

The latest Phi model, released in December 2024, trained on synthetic datasets and high-quality filtered data with a focus on reasoning.

Key Specifications:

  • Size: 14B (9.1GB)
  • Context Window: 16K tokens
  • Focus: Reasoning and logic
  • Training Data: 16 billion tokens (8.3 billion unique)
  • Downloads: 6.7 million

Training Innovation: Synthetic Data First

Phi-4 represents a paradigm shift in training methodology:

  1. Synthetic Data Generation: GPT-4o rewrote web text, code, scientific papers, and books as exercises, discussions, Q&A pairs, and structured reasoning tasks.

  2. Feedback Loop: GPT-4o critiqued its own outputs and generated improvements.

  3. 50 Dataset Types: Different seeds and multi-stage prompting procedures covering diverse topics, skills, and interaction types. Total: ~400B unweighted tokens.

Phi-4 substantially surpasses its teacher model (GPT-4) on STEM-focused QA capabilities, demonstrating that synthetic data can produce emergent capabilities beyond the teacher.

Architecture:

Dense decoder-only Transformer with minimal changes from Phi-3:

  • Modified RoPE base frequency for 32K context support
  • Optimized for memory/compute-constrained environments

Best For: Edge deployment, real-time applications, and situations requiring strong reasoning in a compact package.

Phi-4-Reasoning (14B Parameters)

A fine-tuned variant specifically optimized for complex reasoning tasks through supervised fine-tuning and reinforcement learning.

Key Specifications:

  • Size: 14B (11GB)
  • Context Window: 32K tokens
  • Training: SFT + Reinforcement Learning
  • RL Training: Only ~6,400 math-focused problems
  • Downloads: 916K

Training Approach:

  1. Curated Prompts: 1.4M prompts focused on "boundary" cases at the edge of Phi-4's baseline capabilities. Emphasized multi-step reasoning over factual recall.

  2. Synthetic Responses: Generated using o3-mini in high-reasoning mode.

  3. Structured Reasoning: Special <think> and </think> tokens separate intermediate reasoning from final answers, promoting transparency and coherence.

Benchmark Performance:

Despite only 14B parameters, Phi-4-Reasoning:

  • Outperforms DeepSeek-R1 Distill Llama 70B (5x larger)
  • Approaches full DeepSeek-R1 (671B) on AIME 2025
  • Excels on GPQA-Diamond (graduate-level science)
  • Strong on LiveCodeBench (competitive coding)
  • Generalizes to NP-hard problems (3SAT, TSP)

Best For: Mathematical reasoning, scientific analysis, complex problem-solving, and coding tasks. Exceptional reasoning capability for its size.

Phi-3 (3.8B and 14B Parameters)

The previous generation with excellent efficiency and the first Phi model to achieve widespread adoption.

Key Specifications:

  • Sizes: Mini 3.8B (2.2GB), Medium 14B (7.9GB)
  • Context Window: 128K tokens
  • Training: 3.3 trillion tokens

Best For: Quick prototyping, mobile applications, and situations where Phi-4 is too resource-intensive.

Phi-2 (2.7B Parameters)

Microsoft's earlier small model demonstrating that 2.7B parameters can achieve remarkable capability.

Key Specifications:

  • Size: 2.7B (1.6GB)
  • Context Window: 2K tokens
  • Capabilities: Common-sense reasoning, language understanding

Best For: Extremely constrained environments, quick experiments, and applications where even Phi-3 is too large.

Coding-Specialized Models

Beyond the coding variants of general models, Ollama offers several dedicated code models optimized specifically for software development tasks.

CodeLlama (7B to 70B Parameters)

Meta's code-specialized version of Llama 2, offering specialized variants for different use cases.

Key Specifications:

  • Sizes: 7B (3.8GB), 13B (7.4GB), 34B (19GB), 70B (39GB)
  • Context Window: 16K (2K for 70B)
  • Languages: Python, C++, Java, PHP, TypeScript, C#, Bash
  • Variants: Base, Instruct, Python-specialized
  • Training: 500B tokens (1T for 70B)

Fill-in-the-Middle (FIM) Support:

Important: Infilling is only available in 7B and 13B base models. The 34B and 70B models were trained without the infilling objective.

Use the <FILL_ME> token for code completion in the middle of files.

Performance:

  • 34B: 53.7% HumanEval, 56.2% MBPP (comparable to ChatGPT at release)
  • 70B: Highest capability but 2K context limitation
  • 7B/13B: Best for real-time completion due to low latency

Best For: Code completion, generation, review, and fill-in-the-middle tasks. Choose size based on your latency requirements.

StarCoder2 (3B, 7B, and 15B Parameters)

Next-generation open code models from BigCode with full transparency about training data.

Key Specifications:

  • Sizes: 3B (1.7GB), 7B (4.0GB), 15B (9.1GB)
  • Context Window: 16K tokens
  • Languages: 17 (3B, 7B) to 600+ (15B)
  • Training: 3-4 trillion tokens from The Stack v2
  • License: BigCode OpenRAIL-M v1

Architecture:

  • Grouped Query Attention
  • Sliding window attention (4,096 tokens)
  • Fill-in-the-Middle objective training

Training Data (The Stack v2):

  • 67.5 terabytes of code data (4x larger than original StarCoder)
  • Full transparency with SoftWare Heritage persistent IDentifiers (SWHIDs)

Performance:

Model Comparison
StarCoder2-3B Matches original StarCoder-15B
StarCoder2-15B Matches 33B+ models, outperforms CodeLlama-34B

StarCoder2-15B outperforms DeepSeekCoder-33B on math, code reasoning, and low-resource languages.

Note: StarCoder2-7B underperforms relative to 3B and 15B for unknown reasons.

Best For: Code completion, generation, and applications where transparency about training data matters (compliance, licensing concerns).

WizardCoder (7B and 33B Parameters)

State-of-the-art code generation using innovative Evol-Instruct techniques.

Key Specifications:

  • Sizes: 7B (3.8GB), 33B (19GB)
  • Context Window: 16K tokens
  • Base: Code Llama and DeepSeek Coder

Best For: Advanced code generation tasks requiring high accuracy.

Stable Code 3B (3B Parameters)

Stability AI's efficient code completion model optimized for real-time IDE use.

Key Specifications:

  • Size: 3B (1.6GB)
  • Context Window: 16K tokens
  • Languages: 18 programming languages
  • Feature: Fill-in-the-Middle capability

Best For: IDE integration, real-time code completion, and applications requiring fast inference.

Granite Code (3B to 34B Parameters)

IBM's enterprise-focused decoder-only code models with strong compliance and licensing guarantees.

Key Specifications:

  • Sizes: 3B (2.0GB), 8B (4.6GB), 20B (12GB), 34B (19GB)
  • Context Window: 8K-128K tokens
  • Capabilities: Code generation, explanation, fixing
  • Variants: Base, Instruct, Accelerator

Architecture:

  • Transformer decoder with pre-normalization
  • Multi-Query Attention for efficient inference
  • GELU activation in MLP blocks
  • LayerNorm for activation normalization

Two-Phase Training:

  1. Phase 1: 3-4 trillion tokens from 116 programming languages
  2. Phase 2: 500 billion additional tokens mixing code and natural language for improved reasoning

34B Model Creation (Depth Upscaling):

IBM created the 34B model through depth upscaling:

  1. Remove final 8 layers from first 20B checkpoint
  2. Remove first 8 layers from second 20B checkpoint
  3. Merge to create 88-layer model
  4. Continue training on 1.4T tokens

Performance:

Granite models consistently outperform equivalent-size CodeLlama. Even Granite-3B-Code-Instruct surpasses CodeLlama-34B-Instruct.

Enterprise Features:

  • Training data collected per IBM AI ethics principles
  • IBM legal team guidance for trustworthy enterprise use
  • Available on watsonx.ai and RHEL AI
  • Accelerator versions for reduced latency

Best For: Enterprise environments requiring IBM support, compliance guarantees, and licensing clarity.

Magicoder (7B Parameters)

Code models trained using the innovative OSS-Instruct methodology for reduced training bias.

Key Specifications:

  • Size: 7B (3.8GB)
  • Context Window: 16K tokens
  • Training: 75K synthetic instructions generated from open-source code

Best For: Diverse, realistic code generation with reduced training bias compared to models trained on curated instruction sets.

SQL-Specialized Models

For database work, these specialized models convert natural language to SQL with high accuracy.

SQLCoder (7B and 15B Parameters)

Fine-tuned on StarCoder specifically for SQL generation, slightly outperforming GPT-3.5-turbo on natural language to SQL tasks.

Key Specifications:

  • Sizes: 7B (4.1GB), 15B (9.0GB)
  • Context Window: 8K-32K tokens

Best For: Database querying, business intelligence, and SQL generation from natural language descriptions.

DuckDB-NSQL (7B Parameters)

Specialized for DuckDB SQL generation, optimized for analytics workloads.

Key Specifications:

  • Size: 7B (3.8GB)
  • Context Window: 16K tokens
  • Base: Llama-2 7B with SQL-specific training

Best For: DuckDB-specific applications, analytics workloads, and data engineering tasks.

Vision-Language Models

These models combine text and image understanding for multimodal applications.

LLaVA (7B to 34B Parameters)

Large Language and Vision Assistant, one of the most influential open-source vision-language models.

Key Specifications:

  • Sizes: 7B (4.7GB), 13B (8.0GB), 34B (20GB)
  • Context Window: 4K-32K tokens
  • Capabilities: Visual reasoning, OCR, image captioning
  • Downloads: 12.3 million

Architecture (LLaVA 1.6/LLaVA-NeXT):

  • Vision Encoder: CLIP-ViT-L
  • Vision-Language Connector: MLP (upgraded from linear projection in v1.5)
  • Resolution: Up to 672x672 (4x more pixels than v1.5)
  • Aspect Ratios: Supports 672x672, 336x1344, 1344x336

Key Improvements in v1.6:

  1. Enhanced OCR: Replaced TextCaps with DocVQA and SynDog-EN training data
  2. Chart Understanding: Added ChartQA, DVQA, AI2D for diagram comprehension
  3. Better Visual Reasoning: Improved zero-shot performance

Training Efficiency:

LLaVA 1.6 maintains minimalist design:

  • 32 GPUs for ~1 day
  • 1.3M training samples
  • Reuses pretrained connector from v1.5
  • 100-1000x lower compute cost than competing models

Performance: Catches up to Gemini Pro and outperforms Qwen-VL-Plus on selected benchmarks.

2025 Development (LLaVA-Mini):

LLaVA-Mini achieves comparable performance using only 1 vision token instead of 576 (0.17% of original), offering 77% FLOPs reduction and significantly lower GPU memory.

Best For: General visual understanding, document analysis, and multimodal conversations.

LLaVA-Llama3 (8B Parameters)

LLaVA fine-tuned from Llama 3 Instruct with improved benchmark scores.

Key Specifications:

  • Size: 8B (5.5GB)
  • Context Window: 8K tokens
  • Downloads: 2.1 million

Best For: Users who want LLaVA capabilities with Llama 3's improved language understanding.

BakLLaVA (7B Parameters)

Mistral 7B augmented with LLaVA architecture, combining Mistral's efficiency with vision capabilities.

Key Specifications:

  • Size: 7B (4.7GB)
  • Context Window: 32K tokens
  • Downloads: 373K

Best For: Visual understanding with Mistral's efficient architecture and longer context.

MiniCPM-V (8B Parameters)

Efficient multimodal model from OpenBMB, designed to run on edge devices while outperforming much larger models.

Key Specifications:

  • Size: 8B (5.5GB)
  • Architecture: SigLip-400M + Qwen2-7B
  • Context Window: 32K tokens
  • Resolution: Up to 1.8 million pixels (e.g., 1344x1344)
  • Languages: English, Chinese, German, French, Italian, Korean

Token Efficiency:

MiniCPM-V 2.6 produces only 640 tokens when processing a 1.8M pixel image—75% fewer than most models. This improves:

  • Inference speed
  • First-token latency
  • Memory usage
  • Power consumption

Performance:

  • OpenCompass: 65.2 (surpasses GPT-4o mini and Claude 3.5 Sonnet on single-image)
  • State-of-the-art on OCRBench, surpassing GPT-4V and Gemini 1.5 Pro
  • Supports real-time video understanding on edge devices like iPad

2025 Updates:

  • MiniCPM-o 2.6 (January 2025): Adds real-time speech-to-speech conversation and multimodal live streaming. OpenCompass: 70.2
  • MiniCPM-V 4.5: Outperforms GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B in vision-language capabilities

Best For: High-resolution image analysis, OCR-heavy applications, edge deployment, and efficient multimodal deployment.

Moondream (1.8B Parameters)

Tiny vision-language model designed specifically for edge deployment.

Key Specifications:

  • Size: 1.86B (1.7GB), under 5GB memory required
  • Context Window: ~1000 tokens
  • Base: SigLIP + Phi-1.5 weights
  • Downloads: 472K

Capabilities:

  • Visual question answering
  • Image captioning
  • Object detection (including document layout detection)
  • UI understanding
  • Zero-shot detection (outperforms o3-mini, SmolVLM 2.0, Claude 3 Opus)

2025 Improvements (June 2025 Release):

  • Reinforcement Learning fine-tuning across 55 vision-language tasks
  • Better OCR for documents and tables
  • ScreenSpot [email protected]: 60.3 (up from 53.3)
  • DocVQA: 79.3 (up from 76.5)
  • TextVQA: 76.3 (up from 74.6)
  • CountBenchQA: 86.4 (up from 80)
  • COCO object detection: 51.2 (up from 30.5)

Edge Deployment:

  • Runs on Raspberry Pi and single-board computers
  • Much faster than larger models like Qwen2.5-VL on edge hardware
  • Designed for quick multimodal tasks on constrained hardware

Best For: Edge devices, mobile applications, and situations requiring vision capabilities with minimal resources.

Embedding Models

Embedding models convert text to numerical vectors for semantic search, retrieval-augmented generation (RAG), and similarity matching.

nomic-embed-text

High-performance text embedding that surpasses OpenAI's ada-002 and text-embedding-3-small, with full reproducibility.

Key Specifications:

  • Parameters: 137 million
  • Size: 274MB
  • Context Window: 8192 tokens (industry-leading for open models)
  • Downloads: 48.7 million
  • License: Apache 2.0

Architecture (nomic-bert-2048):

BERT base with key modifications:

  • Rotary Positional Embeddings (RoPE): Replaces absolute encodings, enables context extrapolation
  • SwiGLU Activation: Replaces GeLU for improved performance
  • Flash Attention: Optimized attention computation
  • Vocabulary Size: Multiple of 64 for efficiency

Training Pipeline:

  1. Stage 1: Self-supervised MLM objective (BERT-style)
  2. Stage 2: Contrastive training with web-scale unsupervised data
  3. Stage 3: Contrastive fine-tuning with 1.6M curated paired samples

Training data: 235 million text pairs (fully disclosed)

Performance:

  • Outperforms text-embedding-ada-002 on short-context MTEB
  • Outperforms text-embedding-3-small on MTEB
  • Outperforms jina-embeddings-v2-base-en on long context (LoCo, Jina benchmarks)
  • Best performing 100M parameter class unsupervised model

Nomic Embed v1.5:

  • Adds Matryoshka Representation Learning for adjustable embedding dimensions
  • Now multimodal: nomic-embed-vision-v1.5 aligned to text embedding space

Best For: Semantic search, similarity matching, RAG applications, and any task requiring high-quality text embeddings.

mxbai-embed-large

State-of-the-art large embedding model from mixedbread.ai.

Key Specifications:

  • Size: 335M (670MB)
  • Context Window: 512 tokens
  • Downloads: 6 million

Achieves top performance among BERT-large models on MTEB benchmark, outperforming OpenAI's commercial embedding.

Best For: High-accuracy embedding applications where quality matters more than speed or context length.

BGE-M3

Versatile multilingual embedding model from BAAI (Beijing Academy of Artificial Intelligence), supporting three retrieval methods in one model.

Key Specifications:

  • Size: 567M (1.2GB)
  • Context Window: 8K tokens
  • Languages: 100+ languages
  • Capabilities: Dense, multi-vector, and sparse retrieval
  • Downloads: 3 million

M3 = Multi-Multi-Multi:

  1. Multi-linguality: 100+ languages
  2. Multi-granularity: Up to 8192 tokens input
  3. Multi-functionality: Three retrieval methods unified

Architecture:

Based on XLM-RoBERTa-large (24 layers, 1024 hidden, 16 heads) with RetroMAE enhancements. Core model: ~550M parameters.

Three Retrieval Methods:

  1. Dense Retrieval: Normalized [CLS] token hidden state
  2. Sparse Retrieval: Linear layer + ReLU on hidden states (outperforms BM25)
  3. Multi-vector (ColBERT-style): Fine-grained query-passage interactions

Hybrid Scoring: s_rank = w1·s_dense + w2·s_lex + w3·s_mul

Training:

  • Pre-trained on ~1.2 billion unsupervised text pairs
  • Fine-tuned on English, Chinese, and multilingual retrieval datasets
  • Novel self-knowledge distillation approach

Performance:

  • MIRACL (18 languages): nDCG@10 = 70.0 (highest among multilingual embedders)
  • Outperforms mE5 (~65.4 average)
  • Sparse representations outperform BM25 across all tested languages

2025 Updates:

  • Available on NVIDIA NIM and IONOS Cloud for production deployment
  • BGE-VL released for multimodal embedding (MIT license)

Best For: Multilingual retrieval, cross-lingual search, applications requiring variable text lengths, and hybrid retrieval systems.

all-minilm

Lightweight embedding model for resource-constrained environments.

Key Specifications:

  • Sizes: 46MB and 67MB variants
  • Context Window: 512 tokens
  • Downloads: 2.1 million

Best For: Quick prototyping, edge deployment, and applications where embedding model size matters.

Snowflake Arctic Embed

Retrieval-optimized embeddings from Snowflake, designed for production RAG pipelines.

Key Specifications:

  • Sizes: 22M (46MB), 33M (67MB), 110M (219MB), 137M (274MB), 335M (669MB)
  • Context Window: 512-2K tokens

Best For: Retrieval-focused applications, search systems, and production RAG pipelines.

Enterprise and Specialized Models

Command R (35B Parameters)

Cohere's model optimized for RAG and tool integration, designed for enterprise-scale deployments.

Key Specifications:

  • Size: 35B (19GB)
  • Context Window: 128K tokens
  • Languages: 10+ languages (English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Chinese, Arabic)

Architecture:

Auto-regressive transformer with:

  • Supervised Fine-Tuning (SFT)
  • Preference training for human alignment

RAG Capabilities:

Command R is specifically designed for retrieval-augmented generation:

  • Grounded summarization with source citations
  • Verifiable outputs with grounding spans
  • Optimized for working with Cohere's Embed and Rerank models

Tool Use:

  • Single-step: Multiple tools called simultaneously in one step
  • Multi-step: Sequential tool calls using previous results
  • Enables dynamic actions based on external information

Best For: Enterprise RAG applications, tool-using agents, and production chatbots requiring high throughput.

Command R+ (104B Parameters)

Cohere's flagship model with advanced multi-step reasoning and RAG capabilities.

Key Specifications:

  • Size: 104B
  • Context Window: 128K tokens
  • Languages: 10+ languages

August 2024 Update:

  • 50% higher throughput
  • 25% lower latencies
  • Same hardware footprint

Best For: Complex RAG workflows, multi-step tool use, and enterprise applications requiring maximum capability.

Aya (8B and 35B Parameters)

Cohere's multilingual model supporting 23 languages with strong cross-lingual performance.

Key Specifications:

  • Sizes: 8B (4.8GB), 35B (20GB)
  • Context Window: 8K tokens
  • Downloads: 213K

Best For: Multilingual applications requiring strong cross-lingual performance.

Solar (10.7B Parameters)

Upstage's efficient model using innovative Depth Up-Scaling, outperforming models up to 30B parameters.

Key Specifications:

  • Size: 10.7B (6.1GB)
  • Context Window: 4K tokens
  • Base: Llama 2 architecture with Mistral weights

Outperforms Mixtral 8x7B on H6 benchmarks despite having fewer total parameters.

Best For: Single-turn conversations and applications where efficiency matters.

Nemotron (70B Parameters)

NVIDIA-customized Llama 3.1 for enhanced response quality, optimized for the NVIDIA ecosystem.

Key Specifications:

  • Size: 70B (43GB)
  • Context Window: 128K tokens
  • Training: RLHF with REINFORCE algorithm

Best For: Enterprise applications requiring NVIDIA ecosystem integration and high-quality responses.

InternLM2 (1.8B to 20B Parameters)

Shanghai AI Lab's model with outstanding reasoning and tool utilization capabilities.

Key Specifications:

  • Sizes: 1.8B (1.1GB), 7B (4.5GB), 20B (11GB)
  • Context Window: 32K-256K tokens

Best For: Mathematical reasoning, tool utilization, and web browsing applications.

Yi (6B to 34B Parameters)

01.ai's bilingual English-Chinese models trained on 3 trillion tokens.

Key Specifications:

  • Sizes: 6B (3.5GB), 9B (5.0GB), 34B (19GB)
  • Context Window: 4K tokens
  • Training: 3 trillion tokens

Best For: English-Chinese bilingual applications and research.

Community and Fine-Tuned Models

OpenHermes (7B Parameters)

Teknium's state-of-the-art fine-tune on Mistral using carefully curated open datasets.

Key Specifications:

  • Size: 7B (4.1GB)
  • Context Window: 32K tokens
  • Training: 1,000,000 entries primarily from GPT-4

Training Data:

  • ~1 million dialogue entries, primarily GPT-4 generated
  • 7-14% programming instructions
  • Converted to ShareGPT format, then ChatML via axolotl
  • Extensive filtering of public datasets

Training Approach:

  1. Supervised fine-tuning on multi-turn conversations
  2. Preference data rated by GPT-4
  3. Distilled Direct Preference Optimization (dDPO)

Interesting Finding: Including 7-14% code instructions boosted non-code benchmarks (TruthfulQA, AGIEval, GPT4All) while slightly reducing BigBench.

Performance:

  • GPT4All average: 73.12
  • AGIEval: 43.07
  • TruthfulQA: 53.04
  • HumanEval pass@1: 50.7%
  • Matches larger 70B models on certain benchmarks

Best For: Multi-turn conversations, coding tasks, and applications requiring strong instruction-following.

Dolphin-Mixtral (8x7B and 8x22B)

Uncensored fine-tune of Mixtral optimized for coding and unrestricted responses.

Key Specifications:

  • Sizes: 8x7B (26GB), 8x22B (80GB)
  • Context Window: 32K-64K tokens
  • Downloads: 799K

Best For: Uncensored coding assistance and creative applications.

Zephyr (7B and 141B Parameters)

HuggingFace's helpful assistant models, optimized for user assistance.

Key Specifications:

  • Sizes: 7B (4.1GB), 141B (80GB)
  • Context Window: 32K-64K tokens
  • Downloads: 338K

Best For: Helpful, conversational applications prioritizing user assistance.

OpenChat (7B Parameters)

C-RLFT trained model that surpasses ChatGPT on various benchmarks.

Key Specifications:

  • Size: 7B (4.1GB)
  • Context Window: 8K tokens
  • Downloads: 253K

Best For: Chat applications requiring strong open-source performance.

Nous-Hermes 2 (10.7B and 34B Parameters)

Nous Research's scientific and coding-focused models.

Key Specifications:

  • Sizes: 10.7B (6.1GB), 34B (19GB)
  • Context Window: 4K tokens
  • Downloads: 196K

Best For: Scientific discussion, coding tasks, and research applications.

Samantha-Mistral (7B Parameters)

Eric Hartford's companion assistant trained on philosophy and psychology.

Key Specifications:

  • Size: 7B (4.1GB)
  • Context Window: 32K tokens
  • Downloads: 159K

Best For: Conversational AI emphasizing personal development and relationship coaching.

Vicuna (7B to 33B Parameters)

LMSYS's chat assistant trained on ShareGPT conversations.

Key Specifications:

  • Sizes: 7B (3.8GB), 13B (7.4GB), 33B (18GB)
  • Context Window: 2K-16K tokens

Best For: General chat applications and fine-tuning experiments.

Orca-Mini (3B to 70B Parameters)

Llama-based models trained using Orca methodology for learning complex reasoning patterns.

Key Specifications:

  • Sizes: 3B (2.0GB), 7B (3.8GB), 13B (7.4GB), 70B (39GB)
  • Context Window: Various

Best For: Entry-level hardware deployments and learning complex reasoning patterns.

Neural Chat (7B Parameters)

Intel's Mistral-based model for high-performance chatbots, optimized for Intel hardware.

Key Specifications:

  • Size: 7B (4.1GB)
  • Context Window: 32K tokens
  • Downloads: 198K

Best For: Chatbot applications optimized for Intel hardware.

TinyLlama (1.1B Parameters)

Compact Llama trained on 3 trillion tokens, demonstrating that tiny models can be surprisingly capable.

Key Specifications:

  • Size: 1.1B (638MB)
  • Context Window: 2K tokens
  • Downloads: 3.2 million

Best For: Extremely constrained environments and minimal footprint deployments.

EverythingLM (13B Parameters)

Uncensored Llama 2 with extended 16K context.

Key Specifications:

  • Size: 13B (7.4GB)
  • Context Window: 16K tokens
  • Downloads: 91K

Best For: Extended context applications without content restrictions.

Notux (8x7B Parameters)

Optimized Mixtral variant with improved fine-tuning.

Key Specifications:

  • Size: 8x7B (26GB)
  • Context Window: 32K tokens

Best For: Users wanting improved Mixtral performance through fine-tuning.

XWinLM (7B and 13B Parameters)

Llama 2-based model with competitive benchmark performance.

Key Specifications:

  • Sizes: 7B (3.8GB), 13B (7.4GB)
  • Context Window: 4K tokens
  • Downloads: 143K

Best For: General chat and alternative to base Llama 2.

Domain-Specific Models

Meditron (7B and 70B Parameters)

Medical-specialized model from EPFL, designed for healthcare applications.

Key Specifications:

  • Sizes: 7B (3.8GB), 70B (39GB)
  • Context Window: 2K-4K tokens

Outperforms Llama 2, GPT-3.5, and Flan-PaLM on many medical reasoning tasks.

Best For: Medical question answering, differential diagnosis support, and health information (with appropriate clinical oversight).

Important: Not a substitute for professional medical advice. Requires clinical oversight for any healthcare applications.

MedLlama2 (7B Parameters)

Llama 2 fine-tuned on MedQA dataset for medical question-answering.

Key Specifications:

  • Size: 7B (3.8GB)
  • Context Window: 4K tokens
  • Downloads: 114K

Best For: Medical question-answering and research (not for clinical use).

Wizard-Math (7B to 70B Parameters)

Mathematical reasoning specialist optimized for problem-solving and computational tasks.

Key Specifications:

  • Sizes: 7B (4.1GB), 13B (7.4GB), 70B (39GB)
  • Context Window: 2K-32K tokens
  • Downloads: 164K

Best For: Mathematical problem-solving, tutoring applications, and computational reasoning.

FunctionGemma (270M Parameters)

Google's Gemma 3 variant fine-tuned for function calling, enabling reliable tool use in agents.

Key Specifications:

  • Size: 270M
  • Specialization: Tool and function calling
  • Downloads: 13K

Best For: Agent development and applications requiring reliable function calling.

Multilingual Models

StableLM2 (1.6B and 12B Parameters)

Stability AI's multilingual model optimized for European languages.

Key Specifications:

  • Sizes: 1.6B (983MB), 12B (7.0GB)
  • Context Window: 4K tokens
  • Languages: English, Spanish, German, Italian, French, Portuguese, Dutch
  • Downloads: 179K

Best For: Multilingual European applications with moderate resource requirements.

Falcon (7B to 180B Parameters)

Technology Innovation Institute's multilingual models with massive scale options.

Key Specifications:

  • Sizes: 7B (4.2GB), 40B (24GB), 180B (101GB)
  • Context Window: 2K tokens

The 180B variant performs between GPT-3.5 and GPT-4 levels on many benchmarks.

Best For: High-capability multilingual applications and research.

Hardware Requirements Quick Reference

Model Category VRAM Needed RAM Alternative Best GPUs
1-3B models 4GB 8GB Any modern GPU
7-8B models 8GB 16GB RTX 3060, RTX 4060
13-14B models 12GB 24GB RTX 3060 12GB, RTX 4070
32-34B models 24GB 48GB RTX 4090, A6000
70B models 48GB+ 64GB+ Multiple GPUs, Apple Silicon
100B+ models Specialized 128GB+ Enterprise infrastructure

Apple Silicon Recommendations:

  • M1/M2 (16GB): 7-8B models comfortably
  • M2 Pro/M3 Pro (32GB): Up to 32B models, 70B with slow speed
  • M3 Max (128GB): 70B models at usable speeds

Quantization Impact:

  • Q4 (4-bit): 75% size reduction, minimal quality loss
  • Q8 (8-bit): Higher quality, more memory
  • Q2-Q3: Maximum compression, noticeable quality degradation
  • Recommended: Q4_K_M for best balance

How to Choose the Right Model

For General Chat and Assistance

  • Budget hardware: Llama 3.2 3B, Phi-3 Mini, Gemma 2 2B
  • Standard hardware: Llama 3.1 8B, Mistral 7B, Gemma 3 12B
  • High-end hardware: Llama 3.3 70B, Qwen3 32B

For Coding and Development

  • Quick completions: Stable Code 3B, CodeGemma 2B
  • General coding: Qwen2.5-Coder 7B, DeepSeek-Coder 6.7B
  • Maximum quality: Qwen2.5-Coder 32B, DeepSeek-Coder-V2 16B

For Reasoning and Analysis

  • Efficient reasoning: Phi-4-Reasoning, DeepSeek-R1 14B
  • Maximum capability: DeepSeek-R1 70B, Qwen3 32B

For Image Understanding

  • Lightweight: Moondream, LLaVA 7B
  • Balanced: MiniCPM-V, Gemma 3 12B
  • Maximum capability: Llama 3.2-Vision 90B, Qwen3-VL

For Multilingual Applications

  • European languages: Mixtral 8x7B, StableLM2
  • Asian languages: Qwen3, Yi
  • 100+ languages: BGE-M3, Qwen2
  • Standard embedding: nomic-embed-text, all-minilm
  • High-quality embedding: mxbai-embed-large, BGE-M3
  • RAG systems: Command R with your embedding choice

Getting Started with Ollama

Installing Ollama and running your first model takes just a few minutes:

  1. Install Ollama: Download from ollama.com for Windows, Mac, or Linux
  2. Pull a model: ollama pull llama3.1
  3. Start chatting: ollama run llama3.1

For integration with applications, Ollama provides a REST API at localhost:11434:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Why is the sky blue?"
}'

Using Local AI on Practical Web Tools

If you want to experience local AI without any setup, try our AI Chat feature. It connects to your local Ollama installation, providing a polished interface while keeping all processing on your machine. Your prompts never touch our servers, maintaining complete privacy.

The interface works with any Ollama model. Simply select your preferred model and start chatting. Combined with our privacy-focused file conversion tools, you can build complete local workflows without sending sensitive data to the cloud.

Frequently Asked Questions

What is the best Ollama model for beginners?

Start with Llama 3.1 8B. It runs on most hardware (8GB VRAM or 16GB RAM), provides excellent quality across diverse tasks, and has the largest community support. Once comfortable, explore specialized models based on your specific needs.

How much VRAM do I need for Ollama?

For 7-8B models, 8GB VRAM is sufficient. For 13-14B models, aim for 12GB. For 32B+ models, you need 24GB or more. Alternatively, models can run in system RAM at reduced speed, roughly doubling the memory requirement.

What is the fastest Ollama model?

The fastest capable model is Llama 3.2 1B or Phi-3 Mini, generating 100+ tokens per second on modest hardware. For usable quality, Llama 3.1 8B at 40-70 tokens per second on modern GPUs offers the best speed/quality balance.

Which Ollama model is best for coding?

Qwen2.5-Coder 32B offers the best quality, matching GPT-4o on code repair benchmarks. For smaller hardware, Qwen2.5-Coder 7B or DeepSeek-Coder 6.7B provide excellent results. StarCoder2 15B offers transparency about training data.

Can Ollama models process images?

Yes. Llama 3.2-Vision, LLaVA, MiniCPM-V, BakLLaVA, Moondream, and Gemma 3 (4B+) all process images. MiniCPM-V and LLaVA 1.6 offer the best image understanding for their size.

What is the difference between quantization levels?

Q4 uses 4 bits per parameter, reducing model size by 75% with minimal quality loss. Q8 uses 8 bits for higher quality but more memory. Q2-Q3 saves more memory but noticeably degrades quality. For most uses, Q4_K_M is the sweet spot.

How do I choose between Llama, Mistral, and Qwen?

Llama has the largest ecosystem and broadest support. Mistral offers excellent efficiency and European language performance. Qwen excels at multilingual tasks (especially Asian languages) and provides strong coding variants. Try each for your specific task.

Are these models safe to use?

Most models include safety training. However, "uncensored" variants (Llama 2 Uncensored, Dolphin-Mixtral) have guardrails removed and should be used responsibly. Always implement appropriate safeguards for production applications.

How do Ollama models compare to ChatGPT?

Llama 3.1 70B and DeepSeek-R1 70B approach GPT-4 quality for many tasks. For everyday use, Llama 3.1 8B competes with GPT-3.5. The gap has narrowed significantly, though frontier models still lead on the most complex reasoning.

Can I fine-tune Ollama models?

Ollama itself runs pre-existing models. For fine-tuning, use the base models from HuggingFace with tools like Axolotl or PEFT, then import the fine-tuned weights into Ollama.

What is the best model for mathematical reasoning?

DeepSeek-R1 and Phi-4-Reasoning lead in mathematical reasoning. Phi-4-Reasoning is remarkable for its size, matching much larger models on math olympiad problems. For maximum capability, DeepSeek-R1 70B or the full 671B model approach frontier performance.

Which models have the longest context windows?

Qwen3 supports up to 256K tokens. Llama 3.1/3.2/3.3 and DeepSeek-R1 support 128K. Gemma 3 (4B+) supports 128K. For embedding, BGE-M3 and nomic-embed-text support 8K tokens.

Conclusion

The Ollama model library offers something for every use case, from tiny 270M parameter edge models to massive 671B reasoning systems. The key is matching model capabilities to your actual needs rather than always choosing the largest option.

For most users, starting with Llama 3.1 8B provides an excellent foundation. As you identify specific needs—whether coding, reasoning, multilingual support, or image understanding—explore the specialized models in those categories.

Local AI has reached a maturity where quality rivals cloud APIs for many tasks, while offering complete privacy, zero ongoing costs, and offline capability. With Ollama making deployment trivial, the only barrier is choosing your first model.

Start experimenting today with our AI Chat feature, which connects seamlessly to your local Ollama installation for a polished, private AI experience.


Model information current as of December 2025. Download counts and specifications updated regularly by Ollama. Always check ollama.com/library for the latest models and versions.

Continue Reading