Ollama: Running Local LLMs on Your Laptop in 5 Minutes

If you want to experiment with Ollama local LLMs without sending a single token to a cloud provider, this guide is for you. It targets developers who already build with APIs and want a private, offline-friendly way to run models like Llama 3.2, Gemma, and Qwen on their own hardware. By the end, you will have Ollama installed, a model answering questions in your terminal, and a working REST API you can call from Python or Node.

Running models locally solves real problems. You avoid per-token billing during development, you keep sensitive data on your own machine, and you can keep working on a plane with no connection. The trade-off is that local hardware caps how large a model you can run, so part of this tutorial covers picking a model that actually fits your laptop.

What Is Ollama?

Ollama is an open-source tool that downloads, manages, and serves large language models on your own machine. It wraps the llama.cpp inference engine behind a simple CLI and a local REST API, so you can run a quantized model with one command instead of compiling C++ or wiring up Python dependencies. It runs on macOS, Windows, and Linux, with GPU acceleration where available.

Think of Ollama as Docker for language models. You pull a model from a registry, run it, and it stays cached locally for next time. The project handles the messy parts: downloading the right quantized weights, loading them onto your GPU or CPU, and exposing a consistent API that mirrors the OpenAI chat format.

Prerequisites

You do not need much to get started. However, the model you choose has to fit in memory, so hardware matters more than software here.

RAM or VRAM: At least 8 GB of free memory for small models (1B–8B parameters). 16 GB is comfortable, and 32 GB+ opens up larger models.
Disk space: Each model is a few gigabytes. Budget 5–10 GB to start, more if you collect several.
OS: macOS 14 (Sonoma) or later, Windows 10/11, or a modern Linux distribution.
Optional GPU: An NVIDIA GPU (with CUDA) or an Apple Silicon Mac dramatically speeds up inference. CPU-only works but runs slower.

A rough rule of thumb: a 7B–8B model at 4-bit quantization needs about 5–6 GB of memory, while a 70B model needs roughly 40 GB or more. If you only have 16 GB, stick to models in the 1B–14B range.

Step 1: Install Ollama

On macOS and Linux, the fastest path is the official install script. Run this in your terminal.

# Download and run the official installer
curl -fsSL https://ollama.com/install.sh | sh

# Verify the install
ollama --version
# Expected output: ollama version is 0.x.x

On Windows, download the installer from the official site and run it. The installer sets up Ollama as a background service that starts automatically. For macOS, you can alternatively grab the DMG and drag the app to Applications.

After installation, Ollama runs a local server in the background. You can confirm it is listening by checking the default port.

# The Ollama server listens on port 11434 by default
curl http://localhost:11434
# Expected output: Ollama is running

If you see “Ollama is running”, the server is up and ready to serve models. Should that command fail, start the server manually with ollama serve in a separate terminal.

Step 2: Run Your First Model

The ollama run command pulls a model if you do not have it yet, then drops you into an interactive chat. Start with Llama 3.2 3B, which is small enough to run almost anywhere.

# Pulls the model on first run, then starts an interactive session
ollama run llama3.2

The first run downloads several gigabytes, so give it a minute. Once the download finishes, you get a prompt where you can type questions directly.

>>> Explain what a vector database does in two sentences.
A vector database stores high-dimensional embeddings and indexes them for
fast similarity search. It lets applications find semantically related items
by comparing vector distances rather than exact keyword matches.

>>> /bye

Type /bye or press Ctrl+D to exit. That session you just ran used zero cloud calls. The model loaded into memory, answered, and stayed cached on disk for next time.

To pull a model without starting a chat, use ollama pull. This is useful in setup scripts where you want the download to happen ahead of time.

# Download a model without entering interactive mode
ollama pull qwen2.5:7b

Step 3: Manage Your Models

Once you have a few models, you will want to see what is installed, what is loaded, and how to clean up. These commands cover the daily workflow.

# List all downloaded models with size and modified date
ollama list

# Show models currently loaded in memory (RAM/VRAM)
ollama ps

# Remove a model you no longer need to free disk space
ollama rm qwen2.5:7b

# Copy a model under a new name (useful before customizing)
ollama cp llama3.2 my-llama

The ollama ps command matters more than it looks. Ollama keeps a model in memory for a few minutes after use, then unloads it to free resources. If a request feels slow, the model may have unloaded and needs to reload from disk first.

Step 4: Call the Ollama REST API

The interactive shell is handy for testing, but the real value comes from the API. Ollama exposes a REST endpoint on localhost:11434 that any language can call. Here is the core chat endpoint.

# Send a chat request and stream the response
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "Write a one-line git command to undo the last commit." }
  ],
  "stream": false
}'

The response is a JSON object containing the assistant message. By default the API streams tokens as they generate; setting "stream": false returns the full response at once, which is simpler for scripts.

For most application code, you will want a proper client. The official Python library wraps the API cleanly.

# pip install ollama
import ollama

def summarize(text: str) -> str:
    """Send text to a local model and return a short summary."""
    response = ollama.chat(
        model="llama3.2",
        messages=[
            {"role": "system", "content": "You summarize text in one sentence."},
            {"role": "user", "content": text},
        ],
    )
    return response["message"]["content"]

if __name__ == "__main__":
    article = "Ollama lets developers run language models locally..."
    print(summarize(article))

This code calls a model running entirely on your machine. The system message sets behavior, and the user message carries the input. Because there is no network round-trip to a provider, latency depends only on your hardware.

Using the OpenAI-Compatible Endpoint

Ollama also exposes an OpenAI-compatible endpoint. This means you can point existing OpenAI SDK code at your local server by changing two lines, which makes migrating prototypes painless.

# pip install openai
from openai import OpenAI

# Point the standard OpenAI client at the local Ollama server
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the SDK but ignored by Ollama
)

completion = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Name three uses for local LLMs."}],
)

print(completion.choices[0].message.content)

Because the interface matches OpenAI, the same pattern works with most frameworks that accept a custom base URL. If you already route requests through a gateway, you can register Ollama as one more backend. For a deeper look at unifying providers behind a single interface, see our guide on setting up LiteLLM as a unified LLM gateway.

Step 5: Customize a Model With a Modelfile

A Modelfile lets you bake a system prompt and parameters into a reusable model. Instead of repeating the same setup in every request, you define it once. This resembles a Dockerfile in structure.

# File: Modelfile
FROM llama3.2

# Lower temperature for more deterministic answers
PARAMETER temperature 0.3

# Set a larger context window (in tokens)
PARAMETER num_ctx 8192

# Bake in a persistent system prompt
SYSTEM """
You are a senior backend engineer. Answer concisely with code when relevant.
Prefer production patterns over toy examples.
"""

Build and run the custom model with two commands. From then on, every session inherits the system prompt and parameters.

# Create a named model from the Modelfile
ollama create backend-helper -f Modelfile

# Run it like any other model
ollama run backend-helper

The num_ctx parameter controls how many tokens the model can consider at once. Larger context uses more memory, so raise it only when your prompts genuinely need the room.

Choosing the Right Model for Your Hardware

Model selection is where most beginners stumble. Pulling a 70B model onto a 16 GB laptop leads to painfully slow responses or outright failures. Match the model to your available memory instead.

Model	Parameters	Approx. memory (4-bit)	Good for
llama3.2:1b	1B	~1.5 GB	Quick tasks, low-end machines
llama3.2:3b	3B	~2.5 GB	General chat, fast responses
qwen2.5:7b	7B	~5 GB	Coding, reasoning, 16 GB laptops
gemma2:9b	9B	~6.5 GB	Strong general quality
qwen2.5:14b	14B	~9 GB	Better reasoning, 32 GB machines
llama3.3:70b	70B	~40 GB	High quality, workstation/server only

The number after the colon is the tag, which usually indicates parameter count. Tags can also specify quantization level, such as llama3.2:3b-instruct-q4_K_M. Lower quantization (q4) shrinks the model and speeds it up at a small cost to quality, while higher precision (q8) keeps more quality but uses more memory.

For most laptop use, a 4-bit quantized 7B model hits the sweet spot between quality and speed. If you want a focused tutorial on the open-source models behind these tags, our walkthrough on building a documentation chatbot with open-source LLMs goes deeper on model behavior.

Real-World Scenario: A Private Code Review Helper

Consider a small backend team that cannot send proprietary code to a third-party API for compliance reasons. Over the course of a sprint, they want a local assistant that reviews diffs and flags obvious issues, running on each developer’s laptop rather than a shared cloud account.

With Ollama, the setup stays simple. Each developer pulls a 7B coding model such as qwen2.5-coder:7b, wraps it in a Modelfile with a review-focused system prompt, and calls it from a Git pre-commit hook. Because inference runs locally, no source code leaves the machine, which satisfies the compliance constraint without a vendor security review.

The trade-off is honest: a 7B local model will not match a frontier cloud model on subtle bugs. In practice, teams use it as a fast first pass for style issues, missing error handling, and obvious mistakes, then reserve human review for logic. The win is privacy and zero marginal cost, not state-of-the-art accuracy. For teams weighing local models against retrieval-based approaches, our comparison of fine-tuning versus RAG covers when each makes sense.

When to Use Ollama Local LLMs

You need data privacy and cannot send prompts to a cloud provider
You want to prototype without accumulating per-token API costs
You work offline or in environments with restricted network access
You are testing prompts and want fast, free iteration
You need a small or mid-sized model and have at least 8–16 GB of memory

When NOT to Use Ollama Local LLMs

You require the absolute best reasoning quality that only large frontier models provide
Your laptop has under 8 GB of free memory and cannot fit a usable model
You need to serve high-concurrency production traffic (look at vLLM or a hosted API instead)
Your task depends on the latest closed models unavailable as open weights

Common Mistakes With Ollama Local LLMs

Pulling a model too large for your RAM, causing swapping and extreme slowness
Forgetting that the first request after idle reloads the model from disk, which adds latency
Leaving num_ctx at a huge value and running out of memory on long prompts
Assuming local model quality matches frontier APIs and over-trusting the output
Running CPU-only on a machine with a capable GPU because drivers were not installed

How Ollama Compares to Cloud Inference

Local inference and cloud APIs solve different problems. Ollama wins on privacy, cost during development, and offline access. Cloud APIs win on raw model quality and on serving many concurrent users without managing hardware. Many teams use both: Ollama for local development and sensitive workloads, and a hosted API for production scale.

If your bottleneck is inference speed rather than privacy, a specialized cloud provider may be faster than your laptop. Our breakdown of the fastest LLM inference with Groq shows how purpose-built hardware changes the latency math. For building retrieval features on top of any model, local or hosted, start with our guide on building RAG from scratch.

Conclusion

Ollama makes running local LLMs genuinely a five-minute task: install, ollama run llama3.2, and you have a private model answering in your terminal and over a REST API. The keys to a good experience are matching the model size to your hardware and remembering that local quality trades off against frontier cloud models. Start by pulling a 4-bit 7B model today and pointing your existing OpenAI client at localhost:11434 to see how far Ollama local LLMs take your next prototype. From there, explore wrapping it behind a gateway with LiteLLM or layering retrieval on top with a vector database.

Ollama: Running Local LLMs on Your Laptop in 5 Minutes

What Is Ollama?

Prerequisites

Step 1: Install Ollama

Step 2: Run Your First Model

Step 3: Manage Your Models

Step 4: Call the Ollama REST API

Using the OpenAI-Compatible Endpoint

Step 5: Customize a Model With a Modelfile

Choosing the Right Model for Your Hardware

Real-World Scenario: A Private Code Review Helper

When to Use Ollama Local LLMs

When NOT to Use Ollama Local LLMs

Common Mistakes With Ollama Local LLMs

How Ollama Compares to Cloud Inference

Conclusion

2 Comments

Leave a Comment Cancel reply

What Is Ollama?

Prerequisites

Step 1: Install Ollama

Step 2: Run Your First Model

Step 3: Manage Your Models

Step 4: Call the Ollama REST API

Using the OpenAI-Compatible Endpoint

Step 5: Customize a Model With a Modelfile

Choosing the Right Model for Your Hardware

Real-World Scenario: A Private Code Review Helper

When to Use Ollama Local LLMs

When NOT to Use Ollama Local LLMs

Common Mistakes With Ollama Local LLMs

How Ollama Compares to Cloud Inference

Conclusion

2 Comments

Leave a Comment Cancel reply

Related Articles

Speculative Decoding: 2-4x Faster Local LLM Inference

vLLM: Fast, Self-Hosted LLM Serving With GPUs

LM Studio: Run and Test Local LLMs With a GUI