
If you want to experiment with Ollama local LLMs without sending a single token to a cloud provider, this guide is for you. It targets developers who already build with APIs and want a private, offline-friendly way to run models like Llama 3.2, Gemma, and Qwen on their own hardware. By the end, you will have Ollama installed, a model answering questions in your terminal, and a working REST API you can call from Python or Node.
Running models locally solves real problems. You avoid per-token billing during development, you keep sensitive data on your own machine, and you can keep working on a plane with no connection. The trade-off is that local hardware caps how large a model you can run, so part of this tutorial covers picking a model that actually fits your laptop.
What Is Ollama?
Ollama is an open-source tool that downloads, manages, and serves large language models on your own machine. It wraps the llama.cpp inference engine behind a simple CLI and a local REST API, so you can run a quantized model with one command instead of compiling C++ or wiring up Python dependencies. It runs on macOS, Windows, and Linux, with GPU acceleration where available.
Think of Ollama as Docker for language models. You pull a model from a registry, run it, and it stays cached locally for next time. The project handles the messy parts: downloading the right quantized weights, loading them onto your GPU or CPU, and exposing a consistent API that mirrors the OpenAI chat format.
Prerequisites
You do not need much to get started. However, the model you choose has to fit in memory, so hardware matters more than software here.
- RAM or VRAM: At least 8 GB of free memory for small models (1B–8B parameters). 16 GB is comfortable, and 32 GB+ opens up larger models.
- Disk space: Each model is a few gigabytes. Budget 5–10 GB to start, more if you collect several.
- OS: macOS 14 (Sonoma) or later, Windows 10/11, or a modern Linux distribution.
- Optional GPU: An NVIDIA GPU (with CUDA) or an Apple Silicon Mac dramatically speeds up inference. CPU-only works but runs slower.
A rough rule of thumb: a 7B–8B model at 4-bit quantization needs about 5–6 GB of memory, while a 70B model needs roughly 40 GB or more. If you only have 16 GB, stick to models in the 1B–14B range.
Step 1: Install Ollama
On macOS and Linux, the fastest path is the official install script. Run this in your terminal.
# Download and run the official installer
curl -fsSL https://ollama.com/install.sh | sh
# Verify the install
ollama --version
# Expected output: ollama version is 0.x.x
On Windows, download the installer from the official site and run it. The installer sets up Ollama as a background service that starts automatically. For macOS, you can alternatively grab the DMG and drag the app to Applications.
After installation, Ollama runs a local server in the background. You can confirm it is listening by checking the default port.
# The Ollama server listens on port 11434 by default
curl http://localhost:11434
# Expected output: Ollama is running
If you see “Ollama is running”, the server is up and ready to serve models. Should that command fail, start the server manually with ollama serve in a separate terminal.
Step 2: Run Your First Model
The ollama run command pulls a model if you do not have it yet, then drops you into an interactive chat. Start with Llama 3.2 3B, which is small enough to run almost anywhere.
# Pulls the model on first run, then starts an interactive session
ollama run llama3.2
The first run downloads several gigabytes, so give it a minute. Once the download finishes, you get a prompt where you can type questions directly.
>>> Explain what a vector database does in two sentences.
A vector database stores high-dimensional embeddings and indexes them for
fast similarity search. It lets applications find semantically related items
by comparing vector distances rather than exact keyword matches.
>>> /bye
Type /bye or press Ctrl+D to exit. That session you just ran used zero cloud calls. The model loaded into memory, answered, and stayed cached on disk for next time.
To pull a model without starting a chat, use ollama pull. This is useful in setup scripts where you want the download to happen ahead of time.
# Download a model without entering interactive mode
ollama pull qwen2.5:7b
Step 3: Manage Your Models
Once you have a few models, you will want to see what is installed, what is loaded, and how to clean up. These commands cover the daily workflow.
# List all downloaded models with size and modified date
ollama list
# Show models currently loaded in memory (RAM/VRAM)
ollama ps
# Remove a model you no longer need to free disk space
ollama rm qwen2.5:7b
# Copy a model under a new name (useful before customizing)
ollama cp llama3.2 my-llama
The ollama ps command matters more than it looks. Ollama keeps a model in memory for a few minutes after use, then unloads it to free resources. If a request feels slow, the model may have unloaded and needs to reload from disk first.
Step 4: Call the Ollama REST API
The interactive shell is handy for testing, but the real value comes from the API. Ollama exposes a REST endpoint on localhost:11434 that any language can call. Here is the core chat endpoint.
# Send a chat request and stream the response
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{ "role": "user", "content": "Write a one-line git command to undo the last commit." }
],
"stream": false
}'
The response is a JSON object containing the assistant message. By default the API streams tokens as they generate; setting "stream": false returns the full response at once, which is simpler for scripts.
For most application code, you will want a proper client. The official Python library wraps the API cleanly.
# pip install ollama
import ollama
def summarize(text: str) -> str:
"""Send text to a local model and return a short summary."""
response = ollama.chat(
model="llama3.2",
messages=[
{"role": "system", "content": "You summarize text in one sentence."},
{"role": "user", "content": text},
],
)
return response["message"]["content"]
if __name__ == "__main__":
article = "Ollama lets developers run language models locally..."
print(summarize(article))
This code calls a model running entirely on your machine. The system message sets behavior, and the user message carries the input. Because there is no network round-trip to a provider, latency depends only on your hardware.
Using the OpenAI-Compatible Endpoint
Ollama also exposes an OpenAI-compatible endpoint. This means you can point existing OpenAI SDK code at your local server by changing two lines, which makes migrating prototypes painless.
# pip install openai
from openai import OpenAI
# Point the standard OpenAI client at the local Ollama server
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by the SDK but ignored by Ollama
)
completion = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Name three uses for local LLMs."}],
)
print(completion.choices[0].message.content)
Because the interface matches OpenAI, the same pattern works with most frameworks that accept a custom base URL. If you already route requests through a gateway, you can register Ollama as one more backend. For a deeper look at unifying providers behind a single interface, see our guide on setting up LiteLLM as a unified LLM gateway.
Step 5: Customize a Model With a Modelfile
A Modelfile lets you bake a system prompt and parameters into a reusable model. Instead of repeating the same setup in every request, you define it once. This resembles a Dockerfile in structure.
# File: Modelfile
FROM llama3.2
# Lower temperature for more deterministic answers
PARAMETER temperature 0.3
# Set a larger context window (in tokens)
PARAMETER num_ctx 8192
# Bake in a persistent system prompt
SYSTEM """
You are a senior backend engineer. Answer concisely with code when relevant.
Prefer production patterns over toy examples.
"""
Build and run the custom model with two commands. From then on, every session inherits the system prompt and parameters.
# Create a named model from the Modelfile
ollama create backend-helper -f Modelfile
# Run it like any other model
ollama run backend-helper
The num_ctx parameter controls how many tokens the model can consider at once. Larger context uses more memory, so raise it only when your prompts genuinely need the room.
Choosing the Right Model for Your Hardware
Model selection is where most beginners stumble. Pulling a 70B model onto a 16 GB laptop leads to painfully slow responses or outright failures. Match the model to your available memory instead.
| Model | Parameters | Approx. memory (4-bit) | Good for |
|---|---|---|---|
| llama3.2:1b | 1B | ~1.5 GB | Quick tasks, low-end machines |
| llama3.2:3b | 3B | ~2.5 GB | General chat, fast responses |
| qwen2.5:7b | 7B | ~5 GB | Coding, reasoning, 16 GB laptops |
| gemma2:9b | 9B | ~6.5 GB | Strong general quality |
| qwen2.5:14b | 14B | ~9 GB | Better reasoning, 32 GB machines |
| llama3.3:70b | 70B | ~40 GB | High quality, workstation/server only |
The number after the colon is the tag, which usually indicates parameter count. Tags can also specify quantization level, such as llama3.2:3b-instruct-q4_K_M. Lower quantization (q4) shrinks the model and speeds it up at a small cost to quality, while higher precision (q8) keeps more quality but uses more memory.
For most laptop use, a 4-bit quantized 7B model hits the sweet spot between quality and speed. If you want a focused tutorial on the open-source models behind these tags, our walkthrough on building a documentation chatbot with open-source LLMs goes deeper on model behavior.
Real-World Scenario: A Private Code Review Helper
Consider a small backend team that cannot send proprietary code to a third-party API for compliance reasons. Over the course of a sprint, they want a local assistant that reviews diffs and flags obvious issues, running on each developer’s laptop rather than a shared cloud account.
With Ollama, the setup stays simple. Each developer pulls a 7B coding model such as qwen2.5-coder:7b, wraps it in a Modelfile with a review-focused system prompt, and calls it from a Git pre-commit hook. Because inference runs locally, no source code leaves the machine, which satisfies the compliance constraint without a vendor security review.
The trade-off is honest: a 7B local model will not match a frontier cloud model on subtle bugs. In practice, teams use it as a fast first pass for style issues, missing error handling, and obvious mistakes, then reserve human review for logic. The win is privacy and zero marginal cost, not state-of-the-art accuracy. For teams weighing local models against retrieval-based approaches, our comparison of fine-tuning versus RAG covers when each makes sense.
When to Use Ollama Local LLMs
- You need data privacy and cannot send prompts to a cloud provider
- You want to prototype without accumulating per-token API costs
- You work offline or in environments with restricted network access
- You are testing prompts and want fast, free iteration
- You need a small or mid-sized model and have at least 8–16 GB of memory
When NOT to Use Ollama Local LLMs
- You require the absolute best reasoning quality that only large frontier models provide
- Your laptop has under 8 GB of free memory and cannot fit a usable model
- You need to serve high-concurrency production traffic (look at vLLM or a hosted API instead)
- Your task depends on the latest closed models unavailable as open weights
Common Mistakes With Ollama Local LLMs
- Pulling a model too large for your RAM, causing swapping and extreme slowness
- Forgetting that the first request after idle reloads the model from disk, which adds latency
- Leaving
num_ctxat a huge value and running out of memory on long prompts - Assuming local model quality matches frontier APIs and over-trusting the output
- Running CPU-only on a machine with a capable GPU because drivers were not installed
How Ollama Compares to Cloud Inference
Local inference and cloud APIs solve different problems. Ollama wins on privacy, cost during development, and offline access. Cloud APIs win on raw model quality and on serving many concurrent users without managing hardware. Many teams use both: Ollama for local development and sensitive workloads, and a hosted API for production scale.
If your bottleneck is inference speed rather than privacy, a specialized cloud provider may be faster than your laptop. Our breakdown of the fastest LLM inference with Groq shows how purpose-built hardware changes the latency math. For building retrieval features on top of any model, local or hosted, start with our guide on building RAG from scratch.
Conclusion
Ollama makes running local LLMs genuinely a five-minute task: install, ollama run llama3.2, and you have a private model answering in your terminal and over a REST API. The keys to a good experience are matching the model size to your hardware and remembering that local quality trades off against frontier cloud models. Start by pulling a 4-bit 7B model today and pointing your existing OpenAI client at localhost:11434 to see how far Ollama local LLMs take your next prototype. From there, explore wrapping it behind a gateway with LiteLLM or layering retrieval on top with a vector database.
1 Comment