Complete Ollama Models Guide 2025 - Every Model Explained | Practical Web Tools
What are the best Ollama models in 2025? Ollama now offers over 100 open-source AI models for local deployment, ranging from tiny 270M parameter models to massive 671B reasoning systems. The most popular choices are Llama 3.1 8B for general use (108M+ downloads), DeepSeek-R1 for advanced reasoning (75M+ downloads), and Gemma 3 for efficient multimodal tasks (28M+ downloads). This guide covers every model available on Ollama, helping you choose the right one for your specific needs.
Running AI locally has become essential for developers, researchers, and businesses who need privacy, cost control, and offline capability. With Ollama making local deployment as simple as running a single command, the only question remaining is which model to choose.
This comprehensive guide examines every model in the Ollama library, providing the technical details, performance characteristics, and practical recommendations you need to make informed decisions.
What Is Ollama and Why Does Model Selection Matter?
Ollama is an open-source platform that simplifies running large language models locally on your hardware. Instead of sending data to cloud APIs like OpenAI or Anthropic, you download models once and run them entirely on your machine. Your data never leaves your device.
The platform handles the complexity of model quantization, memory management, and optimization automatically. You run ollama run llama3.1 and start chatting within minutes.
Model selection matters because each model has different strengths:
- Parameter count affects capability and memory requirements
- Training focus determines whether models excel at code, reasoning, or conversation
- Quantization level trades quality for speed and memory efficiency
- Context window limits how much text the model can process at once
- Architecture type (dense vs. Mixture-of-Experts) impacts efficiency and specialization
Choosing the wrong model wastes hardware resources or leaves performance on the table. This guide helps you match models to your actual needs.
The Meta Llama Family: The Foundation of Local AI
Meta's Llama models form the backbone of local AI. They are the most widely used, best supported, and most thoroughly tested models available. The December 2024 release of Llama 3.3 changed the conversation around open-source large language models, delivering performance comparable to much larger models at a fraction of the computational cost.
Llama 3.3 (70B Parameters)
Llama 3.3 is Meta's latest flagship model, released in December 2024. It offers performance comparable to the much larger Llama 3.1 405B while requiring only 43GB of storage, representing a major advancement in efficient model design.
Key Specifications:
- Parameters: 70 billion
- Context Window: 128K tokens
- Size: 43GB
- Languages: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai
- Downloads: 2.9 million
- Training Data: 15 trillion tokens from public sources (7x larger than Llama 2)
- License: Llama 3.3 Community License
Architecture Details:
Llama 3.3 is an auto-regressive language model using an optimized transformer architecture with several key innovations:
- Grouped-Query Attention (GQA): Improves inference scalability and efficiency
- 128K Vocabulary Tokenizer: Encodes language more efficiently than previous versions
- Supervised Fine-Tuning (SFT): Aligns model behavior with human preferences
- Reinforcement Learning with Human Feedback (RLHF): Ensures helpfulness and safety
Benchmark Performance:
| Benchmark | Llama 3.3 70B | Comparison |
|---|---|---|
| MMLU Chat (0-shot, CoT) | 86.0 | Matches Llama 3.1 70B, competitive with Amazon Nova Pro (85.9) |
| MMLU PRO (5-shot, CoT) | 68.9 | Improved over Llama 3.1 70B |
| GPQA Diamond (0-shot, CoT) | 50.5 | Better than Llama 3.1 70B (48.0) |
| HumanEval (0-shot) | 88.4 | Near Llama 3.1 405B (89.0) |
| MBPP EvalPlus | 87.6 | Slight improvement over Llama 3.1 70B (86.0) |
| MATH (0-shot, CoT) | 77.0 | Major improvement over Llama 3.1 70B (67.8) |
| MGSM (0-shot) | 91.1 | Substantial improvement over Llama 3.1 70B (86.9) |
| IFEval | 92.1 | Excellent instruction-following |
Inference Performance:
- Achieves 276 tokens/second on Groq hardware (25 tokens/second faster than Llama 3.1 70B)
- NVIDIA TensorRT-LLM with speculative decoding achieves up to 3.55x throughput speedup on HGX H200
Cost Efficiency:
- Input tokens: $0.10 per million (vs. $1.00 for Llama 3.1 405B)
- Output tokens: $0.40 per million (vs. $1.80 for Llama 3.1 405B)
Best For: Users who need maximum capability and have RTX 4090 or Apple Silicon with 64GB+ memory. This model approaches GPT-4 quality for many tasks while running locally.
Hardware Requirements: Minimum 64GB RAM or 24GB VRAM with CPU offloading. Runs well on M2 Max or M3 Max MacBooks.
Llama 3.2 (1B and 3B Parameters)
Llama 3.2 represents Meta's push into efficient, edge-deployable models. Released in September 2024, these are designed for devices with limited resources and represent a new era of on-device AI.
Key Specifications:
- Parameters: 1B (1.3GB) or 3B (2.0GB)
- Context Window: 128K tokens
- Languages: 8 officially supported (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai)
- Training Data: Up to 9 trillion tokens from public sources
- Downloads: 51 million
Architecture & Training Innovation:
The 1B and 3B models were created using two innovative techniques:
- Pruning: Started with Llama 3.1 8B and systematically removed less critical network components
- Knowledge Distillation: Used logits from Llama 3.1 8B and 70B as token-level targets during training
This approach allowed Meta to create models that retain much of the capability of larger models in a fraction of the size.
Edge Deployment Features:
- Compatible with Qualcomm, MediaTek, and ARM processors
- Designed for mobile and IoT applications
- Instantaneous local processing without cloud latency
- Complete data privacy with no cloud transmission
- Works on devices with as little as 8GB RAM
Benchmark Performance:
| Benchmark | Llama 3.2 1B | Llama 3.2 3B | Gemma 2B IT | Phi 3.5-mini IT |
|---|---|---|---|---|
| MMLU (5-shot) | - | 63.4 | 57.8 | 69.0 |
| IFEval | - | 77.4 | 61.9 | 59.2 |
| TLDR9 (summarization) | 16.8 | 19.0 | - | - |
| BFCL V2 (tool use) | 25.7 | 67.0 | - | - |
The 3B model outperforms Gemma 2 2.6B and Phi 3.5-mini on instruction-following, summarization, and tool use benchmarks. Notably, Llama 3.2 3B significantly outperformed the original GPT-4 on the MATH benchmark.
Best For: Mobile development, IoT applications, and situations where you need AI on resource-constrained devices. Also excellent for rapid prototyping when speed matters more than maximum quality.
Hardware Requirements: Runs on any modern hardware. Even laptops with 8GB RAM handle these models comfortably.
Llama 3.2-Vision (11B and 90B Parameters)
The vision variants are Meta's first Llama models to support vision tasks, adding powerful image understanding capabilities to the Llama architecture through a novel adapter-based approach.
Key Specifications:
- Parameters: 11B (7.8GB) or 90B (55GB)
- Context Window: 128K tokens
- Capabilities: Image reasoning, captioning, visual question answering, OCR, object detection
- Input: Text and images (up to 1120x1120 pixels)
- Training Data: 6 billion image-text pairs
Architecture Innovation:
Llama 3.2 Vision introduces a unique architecture combining:
- Base Language Model: Llama 3.1 8B (for 11B) or Llama 3.1 70B (for 90B)
- Vision Encoder: Separately trained image processing component
- Cross-Attention Adapters: Connect image representations to the language model
The key innovation is that the language model parameters remained frozen during vision adapter training, preserving all text-only capabilities. This means Llama 3.2-Vision serves as a drop-in replacement for Llama 3.1 for text tasks.
Image Processing Capabilities:
- High-resolution support up to 1120x1120 pixels
- Object classification and identification
- Image-to-text transcription (including handwriting) via OCR
- Chart and graph understanding
- Document analysis and data extraction
- Contextual visual Q&A
- Image comparison
Grouped-Query Attention (GQA): All models support GQA for faster inference, particularly beneficial for the larger 90B model.
Best For: Applications requiring image analysis: document processing, visual content moderation, image-based research assistance, visual accessibility tools. The 11B variant is the sweet spot for most users.
Limitations: Image+text combinations only support English. Text-only tasks support the full 8-language set.
Hardware Requirements:
- 11B: 16GB VRAM or 32GB RAM
- 90B: 64GB+ VRAM or distributed setup
Llama 3.1 (8B, 70B, and 405B Parameters)
Llama 3.1 remains the workhorse of local AI, with the 8B version being the most downloaded model on Ollama at over 108 million downloads. The 405B variant was the first openly available model to rival GPT-4 and Claude 3 Opus in capability.
Key Specifications:
- Sizes: 8B (4.9GB), 70B (43GB), 405B (243GB)
- Context Window: 128K tokens
- Capabilities: Tool use, multilingual, long-form summarization, coding
- Training Data: 15 trillion tokens
Architectural Improvements over Llama 3:
- Extended context window from 8K to 128K tokens
- Improved tokenizer efficiency
- Enhanced multilingual capabilities
- Native tool use and function calling support
Best For:
- 8B: Everyday professional work, document summarization, code generation, content drafting. The best balance of capability and accessibility.
- 70B: Complex analysis, detailed reasoning, high-stakes professional applications
- 405B: Research and enterprise applications requiring maximum capability. First open model to truly compete with GPT-4.
Hardware Requirements:
- 8B: 8GB VRAM or 16GB RAM
- 70B: 64GB RAM or distributed GPU setup
- 405B: Multiple high-end GPUs or specialized infrastructure (typically 8x 80GB GPUs)
Llama 3 (8B and 70B Parameters)
The previous generation remains useful for applications optimized for its architecture, though Llama 3.1 is recommended for new projects.
Key Specifications:
- Sizes: 8B (4.7GB), 70B (40GB)
- Context Window: 8K tokens
- Downloads: 13.2 million
Best For: Legacy compatibility or when the shorter 8K context window is sufficient. Generally recommend Llama 3.1 for new projects due to the larger context window and improved capabilities.
Llama 2 (7B, 13B, and 70B Parameters)
The foundation that started the open-source AI revolution in July 2023.
Key Specifications:
- Sizes: 7B (3.8GB), 13B (7.4GB), 70B (39GB)
- Context Window: 4K tokens
- Training: 2 trillion tokens
- Downloads: 4.9 million
Best For: Research comparisons, fine-tuning base models, or applications where you have existing Llama 2 infrastructure. Historical significance as the model that democratized large language model access.
Llama 2 Uncensored (7B and 70B Parameters)
A variant of Llama 2 with safety guardrails removed, created using Eric Hartford's uncensoring methodology.
Key Specifications:
- Sizes: 7B (3.8GB), 70B (39GB)
- Context Window: 2K tokens
- Downloads: 1.5 million
Best For: Research purposes, creative writing without restrictions, or applications where you need the model to engage with topics the standard version refuses.
Caution: Use responsibly. The lack of guardrails means the model will attempt to comply with any request. Not recommended for production applications without additional safety measures.
DeepSeek Models: The Reasoning Revolution
DeepSeek has emerged as a major force in open-source AI, particularly with reasoning-focused models. Their January 2025 release of DeepSeek-R1 demonstrated that reinforcement learning alone could produce emergent reasoning capabilities rivaling frontier closed-source models—at a fraction of the training cost.
DeepSeek-R1 (1.5B to 671B Parameters)
DeepSeek-R1 is a family of open reasoning models that approach the performance of OpenAI's o1 and Google's Gemini 2.5 Pro. The full model represents one of the most significant open-source AI releases of 2025.
Key Specifications:
- Sizes: 1.5B (1.1GB), 7B (4.7GB), 8B (5.2GB), 14B (9.0GB), 32B (20GB), 70B (43GB), 671B (404GB)
- Context Window: 128K-160K tokens
- Downloads: 75.2 million
- License: MIT (fully permissive)
- Training Cost: Approximately $5.6 million (significantly lower than competing models)
Architecture: Mixture of Experts (MoE)
DeepSeek-R1 leverages a sophisticated MoE framework:
- Total Parameters: 671 billion
- Active Parameters: Only 37 billion per inference (5.5% activation rate)
- Experts per Layer: 256 routed experts
- Selected Experts: 8 per query (typically 2-4 for simpler tasks)
Key architectural innovations include:
-
Multi-Head Latent Attention (MLA): Dramatically reduces KV cache size, a common bottleneck in transformers. Enables faster inference and longer text generation.
-
Expert Routing Mechanism: Lightweight gating network assigns probability distributions over experts. Top-ranked experts process queries in parallel.
-
Multi-Token Prediction (MTP): Improves generation efficiency.
Revolutionary Training Methodology:
DeepSeek-R1's training process represents a breakthrough in AI development:
-
DeepSeek-R1-Zero: Trained via large-scale reinforcement learning (RL) without supervised fine-tuning. Remarkably, powerful reasoning behaviors emerged naturally from pure RL.
-
Group Relative Policy Optimization (GRPO): A novel RL algorithm from the DeepSeekMath paper. Built on PPO (Proximal Policy Optimization), GRPO enhances mathematical reasoning while reducing memory consumption.
-
Multi-Stage Pipeline:
- Stage 1: Pure RL to discover reasoning patterns
- Stage 2: Supervised Fine-Tuning (SFT) on synthesized reasoning data
- Stage 3: Second RL phase for helpfulness and harmlessness
Distilled Models:
The smaller models (1.5B-70B) are distilled from the full 671B model, demonstrating that reasoning patterns from larger models can effectively transfer to smaller ones. This makes advanced reasoning accessible on consumer hardware.
Benchmark Performance:
DeepSeek-R1 matches or exceeds frontier models on reasoning benchmarks:
- Competitive with OpenAI o1 on mathematical reasoning
- Approaches GPT-4 Turbo on code generation
- Exceeds many closed models on logic and scientific analysis
Hardware Requirements:
- 7B-8B: 8GB VRAM
- 14B: 12GB VRAM
- 32B: 24GB VRAM
- 70B: 48GB+ VRAM or large RAM
- 671B: Minimum 800GB HBM in FP8 format; requires 64-way expert parallelism across multiple GPUs
Best For: Mathematical reasoning, programming challenges, logical problem-solving, scientific analysis. The 14B-32B range offers the best balance of capability and hardware requirements for most users.
DeepSeek-Coder (1.3B to 33B Parameters)
A coding-focused model trained on 87% code and 13% natural language, optimized for programming tasks.
Key Specifications:
- Sizes: 1.3B (776MB), 6.7B (3.8GB), 33B (19GB)
- Context Window: 16K tokens
- Training: 2 trillion tokens
- Downloads: 2.4 million
Best For: Code completion, code generation, programming assistance, and technical documentation. Excellent for developers who need a dedicated coding assistant.
DeepSeek-Coder-V2 (16B and 236B Parameters)
An advanced Mixture-of-Experts coding model that achieves GPT-4 Turbo-level performance on code tasks—the first open model to reach this milestone.
Key Specifications:
- Sizes: 16B (8.9GB), 236B (133GB)
- Active Parameters: 2.4B (16B model), 21B (236B model)
- Context Window: Up to 160K tokens
- Architecture: Mixture-of-Experts with Multi-Head Latent Attention
- Programming Languages: 338 supported
- Training Data: 10.2 trillion tokens (60% code, 10% mathematics)
- Downloads: 1.3 million
Architecture Innovations:
-
MoE Efficiency: The 236B model uses only 21B active parameters per inference, achieving high performance without prohibitive compute costs.
-
Multi-Head Latent Attention (MLA): Reduces KV cache size dramatically, enabling faster inference and longer context handling.
Benchmark Performance:
| Benchmark | DeepSeek-Coder-V2 236B | Comparison |
|---|---|---|
| HumanEval | 90.2% | New state-of-the-art |
| MBPP | 76.2% | New state-of-the-art |
| MATH | 75.7% | Near GPT-4o (76.6%) |
Hardware Requirements:
- 16B (Lite): Single GPU with 40GB VRAM in BF16
- 236B (Full): 8x 80GB GPUs for BF16 inference
Best For: Professional development environments, code review automation, and complex programming tasks requiring maximum accuracy.
Google Gemma Family: Efficiency Meets Capability
Google's Gemma models leverage technology from the Gemini family in compact, efficient packages. The March 2025 release of Gemma 3 established new standards for what's possible on a single GPU.
Gemma 3 (270M to 27B Parameters)
Gemma 3 is Google's latest and most capable model family that runs on a single GPU, bringing Gemini-class capabilities to local deployment.
Key Specifications:
- Sizes: 270M (text only), 1B, 4B, 12B, 27B
- Context Window: 32K tokens (1B), 128K tokens (4B and larger)
- Languages: 35+ out-of-the-box, 140+ pretrained support
- Multimodal: 4B and larger process both text and images
- Downloads: 28.9 million
- Training Data: 14T tokens (27B), 12T tokens (12B), 4T tokens (4B), 2T tokens (1B)
Architecture Innovations:
Gemma 3 introduces several architectural improvements:
-
Interleaved Attention Blocks: Each block contains 5 local attention layers (sliding window of 1024) and 1 global attention layer. This captures both short and long-range dependencies efficiently.
-
Enhanced Positional Encoding: Upgraded RoPE (Rotary Positional Embedding) with base frequency increased from 10K to 1M for global layers, maintaining 10K for local layers.
-
Improved Normalization: QK-norm for stable attention scores, replacing soft-capping from Gemma 2. Uses Grouped-Query Attention (GQA) with both post-norm and pre-norm RMSNorm.
-
Memory Efficiency: Architectural changes reduce KV cache overhead during long-context inference compared to global-only attention.
Vision Integration (4B+):
- Vision Encoder: Based on SigLIP for processing images
- Pan & Scan Algorithm: Adaptively crops and resizes images to handle different aspect ratios
- Fixed Processing Size: Vision encoder operates on 896x896 square images
Benchmark Performance:
| Benchmark | Gemma 3 27B | Notes |
|---|---|---|
| MMLU-Pro | 67.5 | Strong general knowledge |
| LiveCodeBench | 29.7 | Competitive coding |
| Bird-SQL | 54.4 | Database queries |
| GPQA Diamond | 42.4 | Graduate-level reasoning |
| MATH | 69.0 | Mathematical ability |
| FACTS Grounding | 74.9 | Factual accuracy |
| MMMU | 64.9 | Multimodal understanding |
| LM Arena Elo | 1338 | Top 10 overall (March 2025) |
The 27B model outperforms Llama3-405B, DeepSeek-V3, and o3-mini in preliminary human preference evaluations on LMArena.
Additional Features:
- Function calling and structured output support
- Official quantized versions available
- Runs efficiently on workstations, laptops, and smartphones
Hardware Requirements:
- 270M-1B: Any modern hardware
- 4B: 6GB VRAM
- 12B: 12GB VRAM
- 27B: 20GB+ VRAM
Best For: Multilingual applications, multimodal projects, and situations where you need strong performance with reasonable hardware. The 12B variant is particularly efficient for its capability level.
Gemma 2 (2B, 9B, and 27B Parameters)
The previous generation remains excellent for many applications, offering proven reliability and broad compatibility.
Key Specifications:
- Sizes: 2B (1.6GB), 9B (5.4GB), 27B (16GB)
- Context Window: 8K tokens
- Downloads: 12.3 million
The 27B variant delivers "performance surpassing models more than twice its size" according to Google's benchmarks.
Best For: Creative text generation, chatbots, content summarization, NLP research, and language learning applications where Gemma 3's longer context isn't needed.
Gemma (2B and 7B Parameters)
The original Gemma release from February 2024, lightweight but capable.
Key Specifications:
- Sizes: 2B (1.7GB), 7B (5.0GB)
- Context Window: 8K tokens
- Training: Web documents, code, mathematics
Best For: Edge deployments, resource-constrained environments, and applications needing a small but capable model with Google's quality standards.
CodeGemma (2B and 7B Parameters)
Google's code-specialized variant optimized for IDE integration and code completion.
Key Specifications:
- Sizes: 2B (1.6GB), 7B (5.0GB)
- Context Window: 8K tokens
- Languages: Python, JavaScript, Java, Kotlin, C++, C#, Rust, Go, and others
- Training: 500 billion tokens including code and mathematics
- Fill-in-the-Middle: Supported for code completion
Best For: IDE integration, code completion, fill-in-the-middle tasks, and coding assistant applications.
Alibaba Qwen Family: Multilingual Excellence
Qwen models from Alibaba excel at multilingual tasks and offer excellent performance across the capability spectrum. The April 2025 release of Qwen3 introduced revolutionary hybrid reasoning capabilities.
Qwen3 (0.6B to 235B Parameters)
The latest Qwen generation provides both dense and Mixture-of-Experts variants with groundbreaking hybrid reasoning modes.
Key Specifications:
- Dense Models: 0.6B, 1.7B, 4B, 8B (default), 14B, 32B
- MoE Models: 30B-A3B (30B total, 3B active), 235B-A22B (235B total, 22B active)
- Context Window: 32K-128K tokens
- Languages: 119 languages and dialects
- Training Data: 36 trillion tokens
- License: Apache 2.0
Architecture: Dense and MoE Variants
Dense models use traditional transformer architecture where all parameters contribute during inference.
MoE models feature:
- 128 expert FFNs per layer
- 8 experts selected per token
- Extended 128K context support
Revolutionary Hybrid Reasoning Modes:
Qwen3's most significant innovation is unifying two reasoning approaches in one model:
-
Thinking Mode: The model reasons step-by-step before delivering answers. Ideal for complex problems requiring deeper thought.
-
Non-Thinking Mode: Quick, near-instant responses for simpler questions where speed matters more than depth.
This eliminates the need to switch between chat-optimized models (like GPT-4o) and dedicated reasoning models (like QwQ-32B). Users can even set a "thinking budget" to balance computational effort against response speed.
Training Process:
Three-stage pretraining:
- Stage 1: 30+ trillion tokens at 4K context for basic language skills
- Stage 2: Additional 5 trillion tokens emphasizing STEM, coding, and reasoning
- Stage 3: High-quality long-context data extending to 32K tokens
Benchmark Performance:
The flagship Qwen3-235B-A22B competes with:
- OpenAI o1 and o3-mini
- DeepSeek-R1
- Google Gemini-2.5-Pro
- Grok-3
Remarkably, Qwen3-30B-A3B outperforms QwQ-32B despite having 10x fewer activated parameters. Even Qwen3-4B rivals Qwen2.5-72B-Instruct performance.
Best For: Multilingual applications, agent development, creative writing, role-playing, and multi-turn dialogue systems. Excellent for applications that need both quick responses and deep reasoning.
Qwen3-Coder (30B and 480B Parameters)
Alibaba's latest coding models optimized for agentic and coding tasks.
Key Specifications:
- Sizes: 30B (19GB), 480B (varies)
- Optimization: Long code contexts
- Downloads: 1.6 million
Best For: Complex software development, large codebase navigation, and autonomous coding agents.
Qwen3-VL (2B to 235B Parameters)
The most powerful vision-language model in the Qwen family.
Key Specifications:
- Size Range: 2B to 235B
- Capabilities: Visual understanding, document analysis, multimodal reasoning
- Downloads: 881K
Best For: Document processing, visual question answering, and applications requiring both image and text understanding.
Qwen2.5-Coder (0.5B to 32B Parameters)
The state-of-the-art open-source coding model, matching GPT-4o on code repair benchmarks.
Key Specifications:
- Sizes: 0.5B (398MB), 1.5B, 3B, 7B, 14B, 32B (20GB)
- Context Window: 128K tokens
- Programming Languages: 92 supported
- Training Data: 5.5 trillion tokens
- Downloads: 9.5 million
Architecture:
Built on Qwen2.5 architecture with:
- 32B Model: 5,120 hidden size, 40 query heads, 8 key-value heads, 27,648 intermediate size
Benchmark Performance:
| Benchmark | Qwen2.5-Coder 32B | Notes |
|---|---|---|
| Aider (code repair) | 73.7 | Comparable to GPT-4o, 4th overall |
| MdEval (multi-language repair) | 75.2 | #1 among open-source |
| McEval (40+ languages) | 65.9 | Excellent cross-language support |
The model achieves state-of-the-art performance across 10+ benchmarks including code generation, completion, reasoning, and repair.
Best For: Professional development, code generation, code reasoning, and code fixing tasks. The best open-source coding model available.
Qwen2 (0.5B to 72B Parameters)
The previous generation with excellent multilingual support for 29 languages.
Key Specifications:
- Sizes: 0.5B (352MB), 1.5B (935MB), 7B (4.4GB), 72B (41GB)
- Context Window: 32K-128K tokens
- Languages: 29 including major European, Asian, and Middle Eastern languages
Best For: Multilingual chatbots, translation, and cross-lingual applications.
CodeQwen (7B Parameters)
An earlier code-specialized Qwen model with exceptional context length.
Key Specifications:
- Size: 7B (4.2GB)
- Context Window: 64K tokens
- Training: 3 trillion tokens of code data
- Languages: 92 coding languages
Best For: Long-context code understanding, Text-to-SQL, and bug fixing.
Mistral AI Models: French Excellence
Mistral AI, based in Paris, has produced some of the most efficient and capable open-source models. Their innovative use of Mixture-of-Experts and Sliding Window Attention has influenced the entire field.
Mistral (7B Parameters)
The original Mistral model that proved smaller models could outperform much larger ones through architectural innovation.
Key Specifications:
- Size: 7B (4.4GB)
- Context Window: 32K tokens
- License: Apache 2.0
- Downloads: 23.6 million
Architecture Innovations:
- Sliding Window Attention: Trained with 8K context, fixed cache size, theoretical attention span of 128K tokens
- Grouped Query Attention (GQA): Faster inference and smaller cache
- Byte-fallback BPE Tokenizer: No out-of-vocabulary tokens
Outperforms Llama 2 13B on all benchmarks and approaches CodeLlama 7B on code tasks.
Hardware Requirements: 24GB RAM and single GPU
Best For: General-purpose applications, chatbots, and situations where you need reliable performance with moderate resources.
Mixtral 8x7B and 8x22B (47B and 141B Total Parameters)
Mistral's groundbreaking Mixture-of-Experts models that use only a fraction of their parameters for each inference.
Key Specifications:
| Specification | Mixtral 8x7B | Mixtral 8x22B |
|---|---|---|
| Total Parameters | 47B | 141B |
| Active Parameters | 13B | 39B |
| Size | 26GB | 80GB |
| Context Window | 32K tokens | 64K tokens |
| Downloads | 1.6 million | - |
Architecture:
Mixtral shares Mistral 7B's architecture with one key difference: each layer contains 8 feedforward blocks (experts) instead of one. A router network selects which 2 experts process each token.
Key features:
- Sliding Window Attention with broader context support
- Grouped Query Attention for efficient inference
- Byte-fallback BPE Tokenizer
Performance:
- 8x7B: Outperforms Llama 2 70B on most benchmarks with 6x faster inference. Matches or outperforms GPT-3.5 on standard benchmarks.
- 8x22B: Outperforms ChatGPT 3.5 on MMLU and WinoGrande. Achieves 90.8% on GSM8K (math) and 44.6% on MATH.
Resource Requirements:
- 8x7B: 64GB RAM, dual GPUs recommended
- 8x22B: ~90GB VRAM in half-precision, 5.3x slower than 7B, 2.1x slower than 8x7B
Languages: English, French, Italian, German, Spanish (native fluency)
Best For: Applications requiring high capability with better efficiency than pure dense models. Excellent for multilingual European applications.
Microsoft Phi Family: Small But Mighty
Microsoft's Phi models prove that careful training on high-quality synthetic data can create remarkably capable small models. The Phi series represents a different philosophy: quality over quantity in training data.
Phi-4 (14B Parameters)
The latest Phi model, released in December 2024, trained on synthetic datasets and high-quality filtered data with a focus on reasoning.
Key Specifications:
- Size: 14B (9.1GB)
- Context Window: 16K tokens
- Focus: Reasoning and logic
- Training Data: 16 billion tokens (8.3 billion unique)
- Downloads: 6.7 million
Training Innovation: Synthetic Data First
Phi-4 represents a paradigm shift in training methodology:
-
Synthetic Data Generation: GPT-4o rewrote web text, code, scientific papers, and books as exercises, discussions, Q&A pairs, and structured reasoning tasks.
-
Feedback Loop: GPT-4o critiqued its own outputs and generated improvements.
-
50 Dataset Types: Different seeds and multi-stage prompting procedures covering diverse topics, skills, and interaction types. Total: ~400B unweighted tokens.
Phi-4 substantially surpasses its teacher model (GPT-4) on STEM-focused QA capabilities, demonstrating that synthetic data can produce emergent capabilities beyond the teacher.
Architecture:
Dense decoder-only Transformer with minimal changes from Phi-3:
- Modified RoPE base frequency for 32K context support
- Optimized for memory/compute-constrained environments
Best For: Edge deployment, real-time applications, and situations requiring strong reasoning in a compact package.
Phi-4-Reasoning (14B Parameters)
A fine-tuned variant specifically optimized for complex reasoning tasks through supervised fine-tuning and reinforcement learning.
Key Specifications:
- Size: 14B (11GB)
- Context Window: 32K tokens
- Training: SFT + Reinforcement Learning
- RL Training: Only ~6,400 math-focused problems
- Downloads: 916K
Training Approach:
-
Curated Prompts: 1.4M prompts focused on "boundary" cases at the edge of Phi-4's baseline capabilities. Emphasized multi-step reasoning over factual recall.
-
Synthetic Responses: Generated using o3-mini in high-reasoning mode.
-
Structured Reasoning: Special
<think>and</think>tokens separate intermediate reasoning from final answers, promoting transparency and coherence.
Benchmark Performance:
Despite only 14B parameters, Phi-4-Reasoning:
- Outperforms DeepSeek-R1 Distill Llama 70B (5x larger)
- Approaches full DeepSeek-R1 (671B) on AIME 2025
- Excels on GPQA-Diamond (graduate-level science)
- Strong on LiveCodeBench (competitive coding)
- Generalizes to NP-hard problems (3SAT, TSP)
Best For: Mathematical reasoning, scientific analysis, complex problem-solving, and coding tasks. Exceptional reasoning capability for its size.
Phi-3 (3.8B and 14B Parameters)
The previous generation with excellent efficiency and the first Phi model to achieve widespread adoption.
Key Specifications:
- Sizes: Mini 3.8B (2.2GB), Medium 14B (7.9GB)
- Context Window: 128K tokens
- Training: 3.3 trillion tokens
Best For: Quick prototyping, mobile applications, and situations where Phi-4 is too resource-intensive.
Phi-2 (2.7B Parameters)
Microsoft's earlier small model demonstrating that 2.7B parameters can achieve remarkable capability.
Key Specifications:
- Size: 2.7B (1.6GB)
- Context Window: 2K tokens
- Capabilities: Common-sense reasoning, language understanding
Best For: Extremely constrained environments, quick experiments, and applications where even Phi-3 is too large.
Coding-Specialized Models
Beyond the coding variants of general models, Ollama offers several dedicated code models optimized specifically for software development tasks.
CodeLlama (7B to 70B Parameters)
Meta's code-specialized version of Llama 2, offering specialized variants for different use cases.
Key Specifications:
- Sizes: 7B (3.8GB), 13B (7.4GB), 34B (19GB), 70B (39GB)
- Context Window: 16K (2K for 70B)
- Languages: Python, C++, Java, PHP, TypeScript, C#, Bash
- Variants: Base, Instruct, Python-specialized
- Training: 500B tokens (1T for 70B)
Fill-in-the-Middle (FIM) Support:
Important: Infilling is only available in 7B and 13B base models. The 34B and 70B models were trained without the infilling objective.
Use the <FILL_ME> token for code completion in the middle of files.
Performance:
- 34B: 53.7% HumanEval, 56.2% MBPP (comparable to ChatGPT at release)
- 70B: Highest capability but 2K context limitation
- 7B/13B: Best for real-time completion due to low latency
Best For: Code completion, generation, review, and fill-in-the-middle tasks. Choose size based on your latency requirements.
StarCoder2 (3B, 7B, and 15B Parameters)
Next-generation open code models from BigCode with full transparency about training data.
Key Specifications:
- Sizes: 3B (1.7GB), 7B (4.0GB), 15B (9.1GB)
- Context Window: 16K tokens
- Languages: 17 (3B, 7B) to 600+ (15B)
- Training: 3-4 trillion tokens from The Stack v2
- License: BigCode OpenRAIL-M v1
Architecture:
- Grouped Query Attention
- Sliding window attention (4,096 tokens)
- Fill-in-the-Middle objective training
Training Data (The Stack v2):
- 67.5 terabytes of code data (4x larger than original StarCoder)
- Full transparency with SoftWare Heritage persistent IDentifiers (SWHIDs)
Performance:
| Model | Comparison |
|---|---|
| StarCoder2-3B | Matches original StarCoder-15B |
| StarCoder2-15B | Matches 33B+ models, outperforms CodeLlama-34B |
StarCoder2-15B outperforms DeepSeekCoder-33B on math, code reasoning, and low-resource languages.
Note: StarCoder2-7B underperforms relative to 3B and 15B for unknown reasons.
Best For: Code completion, generation, and applications where transparency about training data matters (compliance, licensing concerns).
WizardCoder (7B and 33B Parameters)
State-of-the-art code generation using innovative Evol-Instruct techniques.
Key Specifications:
- Sizes: 7B (3.8GB), 33B (19GB)
- Context Window: 16K tokens
- Base: Code Llama and DeepSeek Coder
Best For: Advanced code generation tasks requiring high accuracy.
Stable Code 3B (3B Parameters)
Stability AI's efficient code completion model optimized for real-time IDE use.
Key Specifications:
- Size: 3B (1.6GB)
- Context Window: 16K tokens
- Languages: 18 programming languages
- Feature: Fill-in-the-Middle capability
Best For: IDE integration, real-time code completion, and applications requiring fast inference.
Granite Code (3B to 34B Parameters)
IBM's enterprise-focused decoder-only code models with strong compliance and licensing guarantees.
Key Specifications:
- Sizes: 3B (2.0GB), 8B (4.6GB), 20B (12GB), 34B (19GB)
- Context Window: 8K-128K tokens
- Capabilities: Code generation, explanation, fixing
- Variants: Base, Instruct, Accelerator
Architecture:
- Transformer decoder with pre-normalization
- Multi-Query Attention for efficient inference
- GELU activation in MLP blocks
- LayerNorm for activation normalization
Two-Phase Training:
- Phase 1: 3-4 trillion tokens from 116 programming languages
- Phase 2: 500 billion additional tokens mixing code and natural language for improved reasoning
34B Model Creation (Depth Upscaling):
IBM created the 34B model through depth upscaling:
- Remove final 8 layers from first 20B checkpoint
- Remove first 8 layers from second 20B checkpoint
- Merge to create 88-layer model
- Continue training on 1.4T tokens
Performance:
Granite models consistently outperform equivalent-size CodeLlama. Even Granite-3B-Code-Instruct surpasses CodeLlama-34B-Instruct.
Enterprise Features:
- Training data collected per IBM AI ethics principles
- IBM legal team guidance for trustworthy enterprise use
- Available on watsonx.ai and RHEL AI
- Accelerator versions for reduced latency
Best For: Enterprise environments requiring IBM support, compliance guarantees, and licensing clarity.
Magicoder (7B Parameters)
Code models trained using the innovative OSS-Instruct methodology for reduced training bias.
Key Specifications:
- Size: 7B (3.8GB)
- Context Window: 16K tokens
- Training: 75K synthetic instructions generated from open-source code
Best For: Diverse, realistic code generation with reduced training bias compared to models trained on curated instruction sets.
SQL-Specialized Models
For database work, these specialized models convert natural language to SQL with high accuracy.
SQLCoder (7B and 15B Parameters)
Fine-tuned on StarCoder specifically for SQL generation, slightly outperforming GPT-3.5-turbo on natural language to SQL tasks.
Key Specifications:
- Sizes: 7B (4.1GB), 15B (9.0GB)
- Context Window: 8K-32K tokens
Best For: Database querying, business intelligence, and SQL generation from natural language descriptions.
DuckDB-NSQL (7B Parameters)
Specialized for DuckDB SQL generation, optimized for analytics workloads.
Key Specifications:
- Size: 7B (3.8GB)
- Context Window: 16K tokens
- Base: Llama-2 7B with SQL-specific training
Best For: DuckDB-specific applications, analytics workloads, and data engineering tasks.
Vision-Language Models
These models combine text and image understanding for multimodal applications.
LLaVA (7B to 34B Parameters)
Large Language and Vision Assistant, one of the most influential open-source vision-language models.
Key Specifications:
- Sizes: 7B (4.7GB), 13B (8.0GB), 34B (20GB)
- Context Window: 4K-32K tokens
- Capabilities: Visual reasoning, OCR, image captioning
- Downloads: 12.3 million
Architecture (LLaVA 1.6/LLaVA-NeXT):
- Vision Encoder: CLIP-ViT-L
- Vision-Language Connector: MLP (upgraded from linear projection in v1.5)
- Resolution: Up to 672x672 (4x more pixels than v1.5)
- Aspect Ratios: Supports 672x672, 336x1344, 1344x336
Key Improvements in v1.6:
- Enhanced OCR: Replaced TextCaps with DocVQA and SynDog-EN training data
- Chart Understanding: Added ChartQA, DVQA, AI2D for diagram comprehension
- Better Visual Reasoning: Improved zero-shot performance
Training Efficiency:
LLaVA 1.6 maintains minimalist design:
- 32 GPUs for ~1 day
- 1.3M training samples
- Reuses pretrained connector from v1.5
- 100-1000x lower compute cost than competing models
Performance: Catches up to Gemini Pro and outperforms Qwen-VL-Plus on selected benchmarks.
2025 Development (LLaVA-Mini):
LLaVA-Mini achieves comparable performance using only 1 vision token instead of 576 (0.17% of original), offering 77% FLOPs reduction and significantly lower GPU memory.
Best For: General visual understanding, document analysis, and multimodal conversations.
LLaVA-Llama3 (8B Parameters)
LLaVA fine-tuned from Llama 3 Instruct with improved benchmark scores.
Key Specifications:
- Size: 8B (5.5GB)
- Context Window: 8K tokens
- Downloads: 2.1 million
Best For: Users who want LLaVA capabilities with Llama 3's improved language understanding.
BakLLaVA (7B Parameters)
Mistral 7B augmented with LLaVA architecture, combining Mistral's efficiency with vision capabilities.
Key Specifications:
- Size: 7B (4.7GB)
- Context Window: 32K tokens
- Downloads: 373K
Best For: Visual understanding with Mistral's efficient architecture and longer context.
MiniCPM-V (8B Parameters)
Efficient multimodal model from OpenBMB, designed to run on edge devices while outperforming much larger models.
Key Specifications:
- Size: 8B (5.5GB)
- Architecture: SigLip-400M + Qwen2-7B
- Context Window: 32K tokens
- Resolution: Up to 1.8 million pixels (e.g., 1344x1344)
- Languages: English, Chinese, German, French, Italian, Korean
Token Efficiency:
MiniCPM-V 2.6 produces only 640 tokens when processing a 1.8M pixel image—75% fewer than most models. This improves:
- Inference speed
- First-token latency
- Memory usage
- Power consumption
Performance:
- OpenCompass: 65.2 (surpasses GPT-4o mini and Claude 3.5 Sonnet on single-image)
- State-of-the-art on OCRBench, surpassing GPT-4V and Gemini 1.5 Pro
- Supports real-time video understanding on edge devices like iPad
2025 Updates:
- MiniCPM-o 2.6 (January 2025): Adds real-time speech-to-speech conversation and multimodal live streaming. OpenCompass: 70.2
- MiniCPM-V 4.5: Outperforms GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B in vision-language capabilities
Best For: High-resolution image analysis, OCR-heavy applications, edge deployment, and efficient multimodal deployment.
Moondream (1.8B Parameters)
Tiny vision-language model designed specifically for edge deployment.
Key Specifications:
- Size: 1.86B (1.7GB), under 5GB memory required
- Context Window: ~1000 tokens
- Base: SigLIP + Phi-1.5 weights
- Downloads: 472K
Capabilities:
- Visual question answering
- Image captioning
- Object detection (including document layout detection)
- UI understanding
- Zero-shot detection (outperforms o3-mini, SmolVLM 2.0, Claude 3 Opus)
2025 Improvements (June 2025 Release):
- Reinforcement Learning fine-tuning across 55 vision-language tasks
- Better OCR for documents and tables
- ScreenSpot [email protected]: 60.3 (up from 53.3)
- DocVQA: 79.3 (up from 76.5)
- TextVQA: 76.3 (up from 74.6)
- CountBenchQA: 86.4 (up from 80)
- COCO object detection: 51.2 (up from 30.5)
Edge Deployment:
- Runs on Raspberry Pi and single-board computers
- Much faster than larger models like Qwen2.5-VL on edge hardware
- Designed for quick multimodal tasks on constrained hardware
Best For: Edge devices, mobile applications, and situations requiring vision capabilities with minimal resources.
Embedding Models
Embedding models convert text to numerical vectors for semantic search, retrieval-augmented generation (RAG), and similarity matching.
nomic-embed-text
High-performance text embedding that surpasses OpenAI's ada-002 and text-embedding-3-small, with full reproducibility.
Key Specifications:
- Parameters: 137 million
- Size: 274MB
- Context Window: 8192 tokens (industry-leading for open models)
- Downloads: 48.7 million
- License: Apache 2.0
Architecture (nomic-bert-2048):
BERT base with key modifications:
- Rotary Positional Embeddings (RoPE): Replaces absolute encodings, enables context extrapolation
- SwiGLU Activation: Replaces GeLU for improved performance
- Flash Attention: Optimized attention computation
- Vocabulary Size: Multiple of 64 for efficiency
Training Pipeline:
- Stage 1: Self-supervised MLM objective (BERT-style)
- Stage 2: Contrastive training with web-scale unsupervised data
- Stage 3: Contrastive fine-tuning with 1.6M curated paired samples
Training data: 235 million text pairs (fully disclosed)
Performance:
- Outperforms text-embedding-ada-002 on short-context MTEB
- Outperforms text-embedding-3-small on MTEB
- Outperforms jina-embeddings-v2-base-en on long context (LoCo, Jina benchmarks)
- Best performing 100M parameter class unsupervised model
Nomic Embed v1.5:
- Adds Matryoshka Representation Learning for adjustable embedding dimensions
- Now multimodal: nomic-embed-vision-v1.5 aligned to text embedding space
Best For: Semantic search, similarity matching, RAG applications, and any task requiring high-quality text embeddings.
mxbai-embed-large
State-of-the-art large embedding model from mixedbread.ai.
Key Specifications:
- Size: 335M (670MB)
- Context Window: 512 tokens
- Downloads: 6 million
Achieves top performance among BERT-large models on MTEB benchmark, outperforming OpenAI's commercial embedding.
Best For: High-accuracy embedding applications where quality matters more than speed or context length.
BGE-M3
Versatile multilingual embedding model from BAAI (Beijing Academy of Artificial Intelligence), supporting three retrieval methods in one model.
Key Specifications:
- Size: 567M (1.2GB)
- Context Window: 8K tokens
- Languages: 100+ languages
- Capabilities: Dense, multi-vector, and sparse retrieval
- Downloads: 3 million
M3 = Multi-Multi-Multi:
- Multi-linguality: 100+ languages
- Multi-granularity: Up to 8192 tokens input
- Multi-functionality: Three retrieval methods unified
Architecture:
Based on XLM-RoBERTa-large (24 layers, 1024 hidden, 16 heads) with RetroMAE enhancements. Core model: ~550M parameters.
Three Retrieval Methods:
- Dense Retrieval: Normalized [CLS] token hidden state
- Sparse Retrieval: Linear layer + ReLU on hidden states (outperforms BM25)
- Multi-vector (ColBERT-style): Fine-grained query-passage interactions
Hybrid Scoring: s_rank = w1·s_dense + w2·s_lex + w3·s_mul
Training:
- Pre-trained on ~1.2 billion unsupervised text pairs
- Fine-tuned on English, Chinese, and multilingual retrieval datasets
- Novel self-knowledge distillation approach
Performance:
- MIRACL (18 languages): nDCG@10 = 70.0 (highest among multilingual embedders)
- Outperforms mE5 (~65.4 average)
- Sparse representations outperform BM25 across all tested languages
2025 Updates:
- Available on NVIDIA NIM and IONOS Cloud for production deployment
- BGE-VL released for multimodal embedding (MIT license)
Best For: Multilingual retrieval, cross-lingual search, applications requiring variable text lengths, and hybrid retrieval systems.
all-minilm
Lightweight embedding model for resource-constrained environments.
Key Specifications:
- Sizes: 46MB and 67MB variants
- Context Window: 512 tokens
- Downloads: 2.1 million
Best For: Quick prototyping, edge deployment, and applications where embedding model size matters.
Snowflake Arctic Embed
Retrieval-optimized embeddings from Snowflake, designed for production RAG pipelines.
Key Specifications:
- Sizes: 22M (46MB), 33M (67MB), 110M (219MB), 137M (274MB), 335M (669MB)
- Context Window: 512-2K tokens
Best For: Retrieval-focused applications, search systems, and production RAG pipelines.
Enterprise and Specialized Models
Command R (35B Parameters)
Cohere's model optimized for RAG and tool integration, designed for enterprise-scale deployments.
Key Specifications:
- Size: 35B (19GB)
- Context Window: 128K tokens
- Languages: 10+ languages (English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Chinese, Arabic)
Architecture:
Auto-regressive transformer with:
- Supervised Fine-Tuning (SFT)
- Preference training for human alignment
RAG Capabilities:
Command R is specifically designed for retrieval-augmented generation:
- Grounded summarization with source citations
- Verifiable outputs with grounding spans
- Optimized for working with Cohere's Embed and Rerank models
Tool Use:
- Single-step: Multiple tools called simultaneously in one step
- Multi-step: Sequential tool calls using previous results
- Enables dynamic actions based on external information
Best For: Enterprise RAG applications, tool-using agents, and production chatbots requiring high throughput.
Command R+ (104B Parameters)
Cohere's flagship model with advanced multi-step reasoning and RAG capabilities.
Key Specifications:
- Size: 104B
- Context Window: 128K tokens
- Languages: 10+ languages
August 2024 Update:
- 50% higher throughput
- 25% lower latencies
- Same hardware footprint
Best For: Complex RAG workflows, multi-step tool use, and enterprise applications requiring maximum capability.
Aya (8B and 35B Parameters)
Cohere's multilingual model supporting 23 languages with strong cross-lingual performance.
Key Specifications:
- Sizes: 8B (4.8GB), 35B (20GB)
- Context Window: 8K tokens
- Downloads: 213K
Best For: Multilingual applications requiring strong cross-lingual performance.
Solar (10.7B Parameters)
Upstage's efficient model using innovative Depth Up-Scaling, outperforming models up to 30B parameters.
Key Specifications:
- Size: 10.7B (6.1GB)
- Context Window: 4K tokens
- Base: Llama 2 architecture with Mistral weights
Outperforms Mixtral 8x7B on H6 benchmarks despite having fewer total parameters.
Best For: Single-turn conversations and applications where efficiency matters.
Nemotron (70B Parameters)
NVIDIA-customized Llama 3.1 for enhanced response quality, optimized for the NVIDIA ecosystem.
Key Specifications:
- Size: 70B (43GB)
- Context Window: 128K tokens
- Training: RLHF with REINFORCE algorithm
Best For: Enterprise applications requiring NVIDIA ecosystem integration and high-quality responses.
InternLM2 (1.8B to 20B Parameters)
Shanghai AI Lab's model with outstanding reasoning and tool utilization capabilities.
Key Specifications:
- Sizes: 1.8B (1.1GB), 7B (4.5GB), 20B (11GB)
- Context Window: 32K-256K tokens
Best For: Mathematical reasoning, tool utilization, and web browsing applications.
Yi (6B to 34B Parameters)
01.ai's bilingual English-Chinese models trained on 3 trillion tokens.
Key Specifications:
- Sizes: 6B (3.5GB), 9B (5.0GB), 34B (19GB)
- Context Window: 4K tokens
- Training: 3 trillion tokens
Best For: English-Chinese bilingual applications and research.
Community and Fine-Tuned Models
OpenHermes (7B Parameters)
Teknium's state-of-the-art fine-tune on Mistral using carefully curated open datasets.
Key Specifications:
- Size: 7B (4.1GB)
- Context Window: 32K tokens
- Training: 1,000,000 entries primarily from GPT-4
Training Data:
- ~1 million dialogue entries, primarily GPT-4 generated
- 7-14% programming instructions
- Converted to ShareGPT format, then ChatML via axolotl
- Extensive filtering of public datasets
Training Approach:
- Supervised fine-tuning on multi-turn conversations
- Preference data rated by GPT-4
- Distilled Direct Preference Optimization (dDPO)
Interesting Finding: Including 7-14% code instructions boosted non-code benchmarks (TruthfulQA, AGIEval, GPT4All) while slightly reducing BigBench.
Performance:
- GPT4All average: 73.12
- AGIEval: 43.07
- TruthfulQA: 53.04
- HumanEval pass@1: 50.7%
- Matches larger 70B models on certain benchmarks
Best For: Multi-turn conversations, coding tasks, and applications requiring strong instruction-following.
Dolphin-Mixtral (8x7B and 8x22B)
Uncensored fine-tune of Mixtral optimized for coding and unrestricted responses.
Key Specifications:
- Sizes: 8x7B (26GB), 8x22B (80GB)
- Context Window: 32K-64K tokens
- Downloads: 799K
Best For: Uncensored coding assistance and creative applications.
Zephyr (7B and 141B Parameters)
HuggingFace's helpful assistant models, optimized for user assistance.
Key Specifications:
- Sizes: 7B (4.1GB), 141B (80GB)
- Context Window: 32K-64K tokens
- Downloads: 338K
Best For: Helpful, conversational applications prioritizing user assistance.
OpenChat (7B Parameters)
C-RLFT trained model that surpasses ChatGPT on various benchmarks.
Key Specifications:
- Size: 7B (4.1GB)
- Context Window: 8K tokens
- Downloads: 253K
Best For: Chat applications requiring strong open-source performance.
Nous-Hermes 2 (10.7B and 34B Parameters)
Nous Research's scientific and coding-focused models.
Key Specifications:
- Sizes: 10.7B (6.1GB), 34B (19GB)
- Context Window: 4K tokens
- Downloads: 196K
Best For: Scientific discussion, coding tasks, and research applications.
Samantha-Mistral (7B Parameters)
Eric Hartford's companion assistant trained on philosophy and psychology.
Key Specifications:
- Size: 7B (4.1GB)
- Context Window: 32K tokens
- Downloads: 159K
Best For: Conversational AI emphasizing personal development and relationship coaching.
Vicuna (7B to 33B Parameters)
LMSYS's chat assistant trained on ShareGPT conversations.
Key Specifications:
- Sizes: 7B (3.8GB), 13B (7.4GB), 33B (18GB)
- Context Window: 2K-16K tokens
Best For: General chat applications and fine-tuning experiments.
Orca-Mini (3B to 70B Parameters)
Llama-based models trained using Orca methodology for learning complex reasoning patterns.
Key Specifications:
- Sizes: 3B (2.0GB), 7B (3.8GB), 13B (7.4GB), 70B (39GB)
- Context Window: Various
Best For: Entry-level hardware deployments and learning complex reasoning patterns.
Neural Chat (7B Parameters)
Intel's Mistral-based model for high-performance chatbots, optimized for Intel hardware.
Key Specifications:
- Size: 7B (4.1GB)
- Context Window: 32K tokens
- Downloads: 198K
Best For: Chatbot applications optimized for Intel hardware.
TinyLlama (1.1B Parameters)
Compact Llama trained on 3 trillion tokens, demonstrating that tiny models can be surprisingly capable.
Key Specifications:
- Size: 1.1B (638MB)
- Context Window: 2K tokens
- Downloads: 3.2 million
Best For: Extremely constrained environments and minimal footprint deployments.
EverythingLM (13B Parameters)
Uncensored Llama 2 with extended 16K context.
Key Specifications:
- Size: 13B (7.4GB)
- Context Window: 16K tokens
- Downloads: 91K
Best For: Extended context applications without content restrictions.
Notux (8x7B Parameters)
Optimized Mixtral variant with improved fine-tuning.
Key Specifications:
- Size: 8x7B (26GB)
- Context Window: 32K tokens
Best For: Users wanting improved Mixtral performance through fine-tuning.
XWinLM (7B and 13B Parameters)
Llama 2-based model with competitive benchmark performance.
Key Specifications:
- Sizes: 7B (3.8GB), 13B (7.4GB)
- Context Window: 4K tokens
- Downloads: 143K
Best For: General chat and alternative to base Llama 2.
Domain-Specific Models
Meditron (7B and 70B Parameters)
Medical-specialized model from EPFL, designed for healthcare applications.
Key Specifications:
- Sizes: 7B (3.8GB), 70B (39GB)
- Context Window: 2K-4K tokens
Outperforms Llama 2, GPT-3.5, and Flan-PaLM on many medical reasoning tasks.
Best For: Medical question answering, differential diagnosis support, and health information (with appropriate clinical oversight).
Important: Not a substitute for professional medical advice. Requires clinical oversight for any healthcare applications.
MedLlama2 (7B Parameters)
Llama 2 fine-tuned on MedQA dataset for medical question-answering.
Key Specifications:
- Size: 7B (3.8GB)
- Context Window: 4K tokens
- Downloads: 114K
Best For: Medical question-answering and research (not for clinical use).
Wizard-Math (7B to 70B Parameters)
Mathematical reasoning specialist optimized for problem-solving and computational tasks.
Key Specifications:
- Sizes: 7B (4.1GB), 13B (7.4GB), 70B (39GB)
- Context Window: 2K-32K tokens
- Downloads: 164K
Best For: Mathematical problem-solving, tutoring applications, and computational reasoning.
FunctionGemma (270M Parameters)
Google's Gemma 3 variant fine-tuned for function calling, enabling reliable tool use in agents.
Key Specifications:
- Size: 270M
- Specialization: Tool and function calling
- Downloads: 13K
Best For: Agent development and applications requiring reliable function calling.
Multilingual Models
StableLM2 (1.6B and 12B Parameters)
Stability AI's multilingual model optimized for European languages.
Key Specifications:
- Sizes: 1.6B (983MB), 12B (7.0GB)
- Context Window: 4K tokens
- Languages: English, Spanish, German, Italian, French, Portuguese, Dutch
- Downloads: 179K
Best For: Multilingual European applications with moderate resource requirements.
Falcon (7B to 180B Parameters)
Technology Innovation Institute's multilingual models with massive scale options.
Key Specifications:
- Sizes: 7B (4.2GB), 40B (24GB), 180B (101GB)
- Context Window: 2K tokens
The 180B variant performs between GPT-3.5 and GPT-4 levels on many benchmarks.
Best For: High-capability multilingual applications and research.
Hardware Requirements Quick Reference
| Model Category | VRAM Needed | RAM Alternative | Best GPUs |
|---|---|---|---|
| 1-3B models | 4GB | 8GB | Any modern GPU |
| 7-8B models | 8GB | 16GB | RTX 3060, RTX 4060 |
| 13-14B models | 12GB | 24GB | RTX 3060 12GB, RTX 4070 |
| 32-34B models | 24GB | 48GB | RTX 4090, A6000 |
| 70B models | 48GB+ | 64GB+ | Multiple GPUs, Apple Silicon |
| 100B+ models | Specialized | 128GB+ | Enterprise infrastructure |
Apple Silicon Recommendations:
- M1/M2 (16GB): 7-8B models comfortably
- M2 Pro/M3 Pro (32GB): Up to 32B models, 70B with slow speed
- M3 Max (128GB): 70B models at usable speeds
Quantization Impact:
- Q4 (4-bit): 75% size reduction, minimal quality loss
- Q8 (8-bit): Higher quality, more memory
- Q2-Q3: Maximum compression, noticeable quality degradation
- Recommended: Q4_K_M for best balance
How to Choose the Right Model
For General Chat and Assistance
- Budget hardware: Llama 3.2 3B, Phi-3 Mini, Gemma 2 2B
- Standard hardware: Llama 3.1 8B, Mistral 7B, Gemma 3 12B
- High-end hardware: Llama 3.3 70B, Qwen3 32B
For Coding and Development
- Quick completions: Stable Code 3B, CodeGemma 2B
- General coding: Qwen2.5-Coder 7B, DeepSeek-Coder 6.7B
- Maximum quality: Qwen2.5-Coder 32B, DeepSeek-Coder-V2 16B
For Reasoning and Analysis
- Efficient reasoning: Phi-4-Reasoning, DeepSeek-R1 14B
- Maximum capability: DeepSeek-R1 70B, Qwen3 32B
For Image Understanding
- Lightweight: Moondream, LLaVA 7B
- Balanced: MiniCPM-V, Gemma 3 12B
- Maximum capability: Llama 3.2-Vision 90B, Qwen3-VL
For Multilingual Applications
- European languages: Mixtral 8x7B, StableLM2
- Asian languages: Qwen3, Yi
- 100+ languages: BGE-M3, Qwen2
For RAG and Search
- Standard embedding: nomic-embed-text, all-minilm
- High-quality embedding: mxbai-embed-large, BGE-M3
- RAG systems: Command R with your embedding choice
Getting Started with Ollama
Installing Ollama and running your first model takes just a few minutes:
- Install Ollama: Download from ollama.com for Windows, Mac, or Linux
- Pull a model:
ollama pull llama3.1 - Start chatting:
ollama run llama3.1
For integration with applications, Ollama provides a REST API at localhost:11434:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "Why is the sky blue?"
}'
Using Local AI on Practical Web Tools
If you want to experience local AI without any setup, try our AI Chat feature. It connects to your local Ollama installation, providing a polished interface while keeping all processing on your machine. Your prompts never touch our servers, maintaining complete privacy.
The interface works with any Ollama model. Simply select your preferred model and start chatting. Combined with our privacy-focused file conversion tools, you can build complete local workflows without sending sensitive data to the cloud.
Frequently Asked Questions
What is the best Ollama model for beginners?
Start with Llama 3.1 8B. It runs on most hardware (8GB VRAM or 16GB RAM), provides excellent quality across diverse tasks, and has the largest community support. Once comfortable, explore specialized models based on your specific needs.
How much VRAM do I need for Ollama?
For 7-8B models, 8GB VRAM is sufficient. For 13-14B models, aim for 12GB. For 32B+ models, you need 24GB or more. Alternatively, models can run in system RAM at reduced speed, roughly doubling the memory requirement.
What is the fastest Ollama model?
The fastest capable model is Llama 3.2 1B or Phi-3 Mini, generating 100+ tokens per second on modest hardware. For usable quality, Llama 3.1 8B at 40-70 tokens per second on modern GPUs offers the best speed/quality balance.
Which Ollama model is best for coding?
Qwen2.5-Coder 32B offers the best quality, matching GPT-4o on code repair benchmarks. For smaller hardware, Qwen2.5-Coder 7B or DeepSeek-Coder 6.7B provide excellent results. StarCoder2 15B offers transparency about training data.
Can Ollama models process images?
Yes. Llama 3.2-Vision, LLaVA, MiniCPM-V, BakLLaVA, Moondream, and Gemma 3 (4B+) all process images. MiniCPM-V and LLaVA 1.6 offer the best image understanding for their size.
What is the difference between quantization levels?
Q4 uses 4 bits per parameter, reducing model size by 75% with minimal quality loss. Q8 uses 8 bits for higher quality but more memory. Q2-Q3 saves more memory but noticeably degrades quality. For most uses, Q4_K_M is the sweet spot.
How do I choose between Llama, Mistral, and Qwen?
Llama has the largest ecosystem and broadest support. Mistral offers excellent efficiency and European language performance. Qwen excels at multilingual tasks (especially Asian languages) and provides strong coding variants. Try each for your specific task.
Are these models safe to use?
Most models include safety training. However, "uncensored" variants (Llama 2 Uncensored, Dolphin-Mixtral) have guardrails removed and should be used responsibly. Always implement appropriate safeguards for production applications.
How do Ollama models compare to ChatGPT?
Llama 3.1 70B and DeepSeek-R1 70B approach GPT-4 quality for many tasks. For everyday use, Llama 3.1 8B competes with GPT-3.5. The gap has narrowed significantly, though frontier models still lead on the most complex reasoning.
Can I fine-tune Ollama models?
Ollama itself runs pre-existing models. For fine-tuning, use the base models from HuggingFace with tools like Axolotl or PEFT, then import the fine-tuned weights into Ollama.
What is the best model for mathematical reasoning?
DeepSeek-R1 and Phi-4-Reasoning lead in mathematical reasoning. Phi-4-Reasoning is remarkable for its size, matching much larger models on math olympiad problems. For maximum capability, DeepSeek-R1 70B or the full 671B model approach frontier performance.
Which models have the longest context windows?
Qwen3 supports up to 256K tokens. Llama 3.1/3.2/3.3 and DeepSeek-R1 support 128K. Gemma 3 (4B+) supports 128K. For embedding, BGE-M3 and nomic-embed-text support 8K tokens.
Conclusion
The Ollama model library offers something for every use case, from tiny 270M parameter edge models to massive 671B reasoning systems. The key is matching model capabilities to your actual needs rather than always choosing the largest option.
For most users, starting with Llama 3.1 8B provides an excellent foundation. As you identify specific needs—whether coding, reasoning, multilingual support, or image understanding—explore the specialized models in those categories.
Local AI has reached a maturity where quality rivals cloud APIs for many tasks, while offering complete privacy, zero ongoing costs, and offline capability. With Ollama making deployment trivial, the only barrier is choosing your first model.
Start experimenting today with our AI Chat feature, which connects seamlessly to your local Ollama installation for a polished, private AI experience.
Model information current as of December 2025. Download counts and specifications updated regularly by Ollama. Always check ollama.com/library for the latest models and versions.