Local & Open-Source LLMs

Llama.cpp: Running Quantized LLMs on CPU-Only Machines

If you want to run a capable language model on a laptop or a cheap cloud box with no GPU, llama.cpp is the tool that makes it practical. It loads quantized models that fit in ordinary system RAM, runs inference on the CPU at usable speeds, and exposes an OpenAI-compatible server so your existing code barely changes. This tutorial walks through building llama.cpp, downloading a GGUF model, running your first prompt, serving an API, and tuning throughput so the numbers actually work for production-adjacent workloads.

This guide is for backend and full-stack developers who need local inference without buying a GPU. By the end, you will have a working CPU inference stack, a clear mental model of how quantization trades quality for speed, and a decision framework for when this approach fits and when it does not.

What Is Llama.cpp?

Llama.cpp is an open-source C/C++ inference engine that runs large language models efficiently on commodity hardware, including CPU-only machines. It loads models in the GGUF format, applies aggressive quantization to shrink memory use, and ships a built-in HTTP server. Because the core is plain C++ with no Python runtime, startup is fast and the dependency footprint stays small.

The project started as a way to run Meta’s LLaMA weights on a MacBook, and it has since grown into the foundation under many higher-level tools. If you have used Ollama for local LLMs or LM Studio’s desktop GUI, you have already used llama.cpp indirectly — both wrap this engine. Going direct gives you more control over threads, batching, and memory, which matters when you push for throughput.

Why CPU Inference Is Viable Now

A few years ago, running a useful model without a GPU was a non-starter. Two things changed that. First, quantization improved dramatically, so a 7-billion-parameter model that once needed 28GB of memory now runs in under 5GB with minimal quality loss. Second, llama.cpp added hand-tuned SIMD kernels (AVX2, AVX-512, ARM NEON) that squeeze real performance out of ordinary CPU cores.

The result is that a modern laptop can generate text at conversational speed for small models. Specifically, an 8B model quantized to 4 bits typically produces several tokens per second on a recent multi-core CPU — slow compared to a datacenter GPU, but fine for background jobs, batch processing, or low-traffic internal tools. For high-throughput serving you still want GPUs and a system like vLLM for self-hosted serving, but plenty of workloads never reach that scale.

Prerequisites

Before you build llama.cpp, make sure you have the basics in place. The build is lightweight, so the requirements are modest.

  • A C++ toolchain: gcc/g++ or clang on Linux/macOS, or the MSVC build tools on Windows
  • cmake 3.14 or newer for the build configuration
  • git to clone the repository
  • Roughly 8GB of free RAM for a 7B–8B model at 4-bit quantization (more headroom is better)
  • Python 3.8+ only if you plan to convert your own models to GGUF (optional)

On Ubuntu, you can install the toolchain in one command:

# Install the build toolchain and cmake on Debian/Ubuntu
sudo apt update
sudo apt install -y build-essential cmake git

On macOS, the Xcode command line tools plus Homebrew cover everything:

# Install command line tools, then cmake via Homebrew
xcode-select --install
brew install cmake git

Step 1: Build Llama.cpp From Source

Cloning and building takes a couple of minutes. Building from source matters here because the compiler detects your CPU’s instruction set and enables the matching kernels, which directly affects inference speed.

# Clone the repository and enter it
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# Configure a release build; llama.cpp auto-detects AVX/NEON support
cmake -B build -DCMAKE_BUILD_TYPE=Release

# Compile using all available cores
cmake --build build --config Release -j $(nproc)

On macOS, replace $(nproc) with $(sysctl -n hw.ncpu). When the build finishes, the binaries land in build/bin/. The two you will use most are llama-cli for one-off prompts and llama-server for the HTTP API.

A quick sanity check confirms the build worked and shows which CPU features were enabled:

# Print version and detected hardware features
./build/bin/llama-cli --version

Look for flags like AVX2 = 1 or NEON = 1 in the output. If those are zero on a CPU that supports them, your build did not pick up the right kernels, and inference will be noticeably slower.

Step 2: Download a Quantized GGUF Model

Llama.cpp runs models in the GGUF format, and the community publishes thousands of pre-quantized GGUF files on Hugging Face. You do not need to convert anything yourself for popular models. The fastest way to fetch one is the Hugging Face CLI.

# Install the Hugging Face CLI
pip install -U "huggingface_hub[cli]"

# Download a 4-bit quantized 8B model (single GGUF file)
hf download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  --local-dir ./models

The Q4_K_M suffix tells you the quantization level. That naming scheme looks cryptic at first, so here is how to read it: the number is the average bits per weight, and the letters describe the variant. Q4_K_M means 4-bit, K-quant, medium — a balanced default that most people should start with.

QuantBits/weight~Size (8B model)QualityUse when
Q8_08~8.5 GBNear-losslessYou have RAM to spare
Q6_K6~6.6 GBExcellentQuality matters most
Q4_K_M4~4.9 GBVery goodDefault recommendation
Q3_K_M3~4.0 GBNoticeable lossRAM is tight
Q2_K2~3.2 GBDegradedLast resort

For a deeper look at how GGUF compares to other formats like AWQ and GPTQ, the trade-offs deserve their own discussion, but for CPU inference GGUF is the format you want.

Step 3: Run Your First Prompt

With a model downloaded, you can generate text immediately. The llama-cli binary handles single prompts and interactive chat.

# Run a single prompt against the downloaded model
./build/bin/llama-cli \
  -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -p "Explain what a connection pool is in two sentences." \
  -n 128

The -n 128 flag caps the response at 128 tokens, which keeps test runs short. As the model loads, llama.cpp prints memory and timing details. The line to watch is the eval speed, reported in tokens per second — that is your real throughput number.

For an interactive session that keeps context across turns, switch to conversation mode:

# Start an interactive chat session with a system prompt
./build/bin/llama-cli \
  -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -cnv \
  -p "You are a concise senior backend engineer."

The -cnv flag enables conversation mode and applies the model’s built-in chat template, so you do not have to format prompts by hand. Type a message, press enter, and the model responds while remembering the conversation.

Step 4: Serve an OpenAI-Compatible API

The real power for application developers is the built-in server. It speaks the OpenAI chat completions protocol, which means any client library or existing code written for the OpenAI API works against your local model with only a base URL change.

# Launch the HTTP server on port 8080
./build/bin/llama-server \
  -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  -c 4096 \
  -t 8

The -c 4096 flag sets the context window to 4096 tokens, and -t 8 assigns 8 threads. Once it starts, the server exposes a chat UI at http://localhost:8080 plus the API endpoints. Now point a standard OpenAI client at it:

from openai import OpenAI

# The api_key is required by the client but ignored by llama-server
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="local-model",  # name is ignored; the loaded GGUF is used
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to check if a string is a palindrome."},
    ],
    temperature=0.2,
)

print(response.choices[0].message.content)

This is the pattern that makes llama.cpp genuinely useful in production code. You develop against a cloud API, then swap the base_url to your local server for offline work, cost-sensitive batch jobs, or environments where data cannot leave the machine. The application logic does not change at all.

How to Tune CPU Throughput

Default settings rarely give you the best speed. Because CPU inference is bound by memory bandwidth and core count, a few flags make a measurable difference. Here are the levers that matter, roughly in order of impact.

  1. Set thread count to physical cores, not logical ones. Use -t equal to your physical core count. Hyper-threading usually hurts here, so an 8-core CPU should use -t 8, not 16.
  2. Match the quantization to your RAM. A model that spills out of RAM and into swap will crawl. Pick the largest quant that leaves a few gigabytes of headroom.
  3. Keep the context window realistic. Larger -c values reserve more memory for the KV cache. Set it to what your prompts actually need, not the maximum.
  4. Use batch flags for prompt processing. The -b (batch) and -ub (micro-batch) flags speed up how fast long prompts are ingested, which matters for RAG-style inputs.
  5. Pin to performance cores on hybrid CPUs. On chips with efficiency and performance cores, restricting threads to the fast cores avoids the scheduler bouncing work onto slow ones.

A practical starting point on an 8-core desktop looks like this:

# Tuned server: physical-core threads, modest context, larger batch
./build/bin/llama-server \
  -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -t 8 \
  -c 4096 \
  -b 512 \
  --port 8080

After changing flags, send a fixed test prompt and compare the eval tokens-per-second the server logs. Treat tuning as an experiment: change one variable, measure, repeat. Guessing wastes more time than measuring does.

A Real-World Scenario: An Internal Document Classifier

Consider a small team building an internal tool that classifies support tickets into categories. The volume is modest — a few thousand tickets a day, processed in nightly batches — and the data is sensitive enough that sending it to a third-party API raises compliance questions. A GPU server sits idle most of the day yet still bills by the hour.

In this situation, a CPU-only llama.cpp deployment on an existing application server is a strong fit. An 8B model at Q4_K_M handles short classification prompts in well under a second each, and because the work runs as an overnight batch, raw throughput matters less than cost and data control. The trade-off is real and worth naming: latency per request is higher than a GPU would deliver, and a sudden spike to real-time interactive use would force a rethink. For a bounded, predictable batch workload, though, the team avoids both per-token API fees and idle GPU costs while keeping data on infrastructure they already own.

When to Use Llama.cpp

  • You need local or offline inference on machines without a GPU
  • Data privacy or compliance rules prevent sending text to external APIs
  • Your workload is batch-oriented or low-traffic, where per-request latency is not critical
  • You want a small, dependency-light binary rather than a heavy Python serving stack
  • You are prototyping on a laptop and want the same engine you can deploy to a server

When NOT to Use Llama.cpp

  • You need high-throughput, low-latency serving for many concurrent users (use a GPU with vLLM instead)
  • Your use case demands the very largest frontier models that will not fit in system RAM
  • You want the fastest possible hosted inference with zero infrastructure (a service like Groq’s API wins on raw speed)
  • Your team prefers a managed desktop GUI over command-line tooling — reach for LM Studio or Ollama

Common Mistakes with Llama.cpp

  • Using a prebuilt binary that misses your CPU’s AVX-512 or NEON kernels, leaving performance on the table
  • Setting thread count to logical cores, which oversubscribes the CPU and slows generation
  • Picking a quant that overflows RAM into swap, turning a fast model into an unusable one
  • Forgetting the -cnv flag or chat template, so an instruction-tuned model receives raw, unformatted prompts
  • Assuming CPU throughput scales to production traffic without load-testing realistic concurrency first

Converting Your Own Model to GGUF

Most of the time you download a ready-made GGUF, but occasionally you need to convert a fine-tuned model or one that nobody has quantized yet. Llama.cpp ships a conversion script for exactly this.

# Convert a Hugging Face model directory to GGUF (FP16 first)
python convert_hf_to_gguf.py ./my-finetuned-model \
  --outfile ./models/my-model-f16.gguf \
  --outtype f16

# Quantize the FP16 GGUF down to 4-bit
./build/bin/llama-quantize \
  ./models/my-model-f16.gguf \
  ./models/my-model-Q4_K_M.gguf \
  Q4_K_M

The two-step flow is deliberate: you first export full-precision weights to GGUF, then quantize to your target level. Keeping the FP16 file lets you re-quantize to different levels later without re-converting. This is the same pipeline that produces the community GGUF files you download, so the output is fully compatible with the server and CLI.

Conclusion

Llama.cpp turns a CPU-only machine into a capable inference server: build it from source so it uses your CPU’s fast kernels, download a Q4_K_M GGUF model, run prompts through llama-cli, and serve an OpenAI-compatible API with llama-server. The decision comes down to workload shape — for offline, private, or batch inference where latency is not the top constraint, llama.cpp is hard to beat on cost and simplicity. For high-concurrency real-time serving, pair a GPU with a dedicated engine instead.

Start by serving the 8B model from this guide and pointing one of your existing OpenAI-based scripts at the local endpoint. Once that works, explore how the same models run through a managed wrapper in our Ollama local LLMs guide, or compare the GPU serving path in our vLLM tutorial.

Leave a Comment