Local & Open-Source LLMs

LM Studio: Run and Test Local LLMs With a GUI

If you want to run open models on your own machine but the command line feels like friction, LM Studio is built for you. It is a desktop application that lets you download, chat with, and test local LLMs through a clean graphical interface, then expose them through an OpenAI-compatible API when you are ready to build. This tutorial walks intermediate developers through installing LM Studio, picking a model that fits your hardware, tuning the settings that actually matter, and connecting the local server to real Python and JavaScript code.

Local models matter because they keep data on your machine, cost nothing per token, and work offline. However, getting started usually means wrestling with quantization formats, GPU flags, and terminal output. LM Studio removes most of that overhead, which makes it one of the fastest ways to go from “I want to try Llama 3” to a working chat window.

What Is LM Studio?

LM Studio is a free desktop app for macOS, Windows, and Linux that runs open-weight large language models entirely on your own hardware. It provides a model browser backed by Hugging Face, a chat interface for testing, and a built-in server that mimics the OpenAI API. As a result, you can experiment visually and then call the same model from code without changing your client library.

Under the hood, LM Studio uses the llama.cpp and MLX engines to run quantized models efficiently on consumer GPUs, Apple Silicon, and even CPU-only machines. You do not interact with those engines directly. Instead, the GUI handles model loading, memory allocation, and hardware acceleration for you.

Why Use a GUI for Local LLMs?

A graphical interface lowers the barrier to experimentation. For instance, comparing two models usually means downloading both, loading each, and sending the same prompt to see which answers better. In a terminal-only tool, that involves several commands and manual bookkeeping. In LM Studio, it is a few clicks.

The GUI also surfaces information that command-line tools hide. You can see how much VRAM a model needs before loading it, watch tokens-per-second in real time, and adjust the context length with a slider instead of a flag. Consequently, you learn what your hardware can handle without trial-and-error crashes.

That said, a GUI is not a replacement for automation. Once you settle on a model and configuration, you will want to script against it. LM Studio anticipates this by shipping both a local API server and a command-line companion, so the GUI becomes your testing surface rather than your only interface.

Installing LM Studio

Installation is straightforward across all three platforms. First, download the installer for your operating system from the official LM Studio website. The app ships as a standard .dmg on macOS, an .exe installer on Windows, and an AppImage on Linux.

After installing, launch the app. On first run, LM Studio detects your hardware and reports your available GPU, VRAM, and system RAM in the bottom status bar. Pay attention to those numbers, because they determine which models you can realistically run.

Hardware guidelines help set expectations:

  • 8 GB RAM or VRAM: 3B to 7B parameter models at 4-bit quantization
  • 16 GB: 7B to 13B models comfortably, or a 7B model at higher precision
  • 24 GB or more: 13B to 34B models, with room for longer context windows
  • Apple Silicon (M-series): unified memory means a 16 GB Mac behaves similarly to a 16 GB GPU

These ranges assume quantized weights, which is the default for local inference. A 7B model in full 16-bit precision needs roughly 14 GB, whereas the same model at 4-bit fits in about 4 to 5 GB with minimal quality loss.

Downloading Your First Model

Open the search tab (the magnifying glass icon) to browse models. LM Studio pulls listings directly from Hugging Face, so you can search by name such as Llama 3.1 8B or Qwen2.5 7B Instruct. Each result shows multiple quantization options, and this is where new users get stuck.

Quantization trades precision for size and speed. The format you see most often is GGUF, with labels like Q4_K_M or Q8_0. The number indicates bits per weight, while the suffix describes the quantization method. For a balanced starting point, choose a Q4_K_M build, which keeps quality high while cutting memory use dramatically.

Here is how to read the common options:

  • Q3_K_S / Q3_K_M: smallest, noticeable quality drop, use only when memory is tight
  • Q4_K_M: the recommended default, strong balance of quality and size
  • Q5_K_M: slightly better quality, modestly larger
  • Q6_K / Q8_0: near-full quality, best when you have spare memory

LM Studio marks each download with a green, yellow, or red indicator based on whether it fits your hardware. Therefore, stick to green-labeled files while you are learning. Click download, and the model lands in your local model directory, ready to load.

If quantization formats still feel opaque, the broader concept of trading model size for inference cost is the same one explored in our guide on Ollama for local LLMs, which uses the same GGUF ecosystem.

Chatting With a Model in the GUI

Switch to the chat tab and click the model selector at the top. Choose the model you downloaded, and LM Studio loads it into memory. Loading takes a few seconds to a minute depending on size, and the status bar shows progress.

Once loaded, type a prompt and press enter. You will see the response stream token by token, along with a live tokens-per-second counter. This number is your single best signal for whether a model is practical on your hardware. Anything above 15 to 20 tokens per second feels conversational, whereas single-digit speeds make iteration painful.

The chat panel also exposes a system prompt field. Use it to set the model’s behavior, such as instructing it to answer concisely or to respond only in JSON. Because local models vary widely in instruction-following ability, testing system prompts in the GUI before you hardcode them into an app saves significant debugging time later.

Understanding Model Settings That Matter

LM Studio exposes inference settings in a side panel, and a few of them have an outsized effect on results. Rather than tweaking everything, focus on the handful that change behavior meaningfully.

Context length controls how much text the model can consider at once. Longer context uses more memory, so LM Studio defaults to a conservative value. If you plan to feed in long documents, raise this setting, but watch your VRAM usage climb as you do.

GPU offload decides how many model layers run on the GPU versus the CPU. Maxing out GPU layers gives the fastest inference, provided the layers fit in VRAM. When a model is slightly too large, lowering the offload count lets it run at reduced speed instead of failing to load.

Temperature governs randomness. Lower values near 0.2 produce focused, deterministic output suited to extraction and coding tasks. Higher values near 0.8 encourage creative variation. For testing reproducibility, keep temperature low so you can compare prompts fairly.

These settings mirror the parameters you would pass through any inference API. Consequently, the intuition you build in the GUI transfers directly when you move to code.

Running LM Studio as a Local API Server

The feature that makes LM Studio genuinely useful for development is its built-in server. Navigate to the developer tab (the terminal icon) and click “Start Server.” By default, it listens on http://localhost:1234 and exposes endpoints that match the OpenAI API specification.

This compatibility is the key detail. Because the endpoints mirror OpenAI’s, any library or tool that already talks to OpenAI can point at LM Studio with a single base-URL change. You get local, private, zero-cost inference while keeping the client code you already know.

Verify the server with a quick request:

# Confirm the server is up and list loaded models
curl http://localhost:1234/v1/models

# Expected output (abbreviated):
# {
#   "data": [
#     { "id": "llama-3.1-8b-instruct", "object": "model" }
#   ]
# }

The server keeps the currently loaded model in memory and serves requests against it. If you load a different model in the chat tab, the server switches to it automatically, which makes side-by-side testing painless.

Connecting Your Code to LM Studio

Because the server speaks the OpenAI dialect, the official OpenAI SDKs work without modification. You only change the base_url and supply a placeholder API key, since local inference needs no real credentials.

Here is a complete Python example using the openai package:

from openai import OpenAI

# Point the standard OpenAI client at the local LM Studio server.
# The api_key is required by the SDK but ignored by LM Studio.
client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="lm-studio",
)

def summarize(text: str) -> str:
    response = client.chat.completions.create(
        model="llama-3.1-8b-instruct",  # must match the loaded model id
        messages=[
            {"role": "system", "content": "Summarize in two sentences."},
            {"role": "user", "content": text},
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content

if __name__ == "__main__":
    article = "LM Studio runs open models locally through a desktop GUI..."
    print(summarize(article))

The same pattern applies in JavaScript or TypeScript. You install the openai package, override the base URL, and call the chat completions endpoint exactly as you would against the hosted API:

import OpenAI from "openai";

// The base URL change is the only difference from a cloud setup.
const client = new OpenAI({
  baseURL: "http://localhost:1234/v1",
  apiKey: "lm-studio", // placeholder, not validated locally
});

async function extractKeywords(input: string): Promise<string> {
  const completion = await client.chat.completions.create({
    model: "llama-3.1-8b-instruct",
    messages: [
      { role: "system", content: "Return five keywords as a comma-separated list." },
      { role: "user", content: input },
    ],
    temperature: 0.3,
  });

  return completion.choices[0].message.content ?? "";
}

extractKeywords("LM Studio exposes an OpenAI-compatible local server.").then(console.log);

Streaming works too. Pass stream=True in Python or stream: true in JavaScript, and the server returns server-sent events in the same shape OpenAI uses. As a result, you can prototype against LM Studio for free and switch to a hosted provider in production by changing two lines.

If you route requests through a gateway, LM Studio slots in cleanly. Our walkthrough on setting up LiteLLM shows how to treat a local OpenAI-compatible endpoint as just another model in a unified router.

LM Studio vs Ollama: Which Should You Use?

Both tools run local models on the same underlying engines, but they target different workflows. LM Studio leads with a polished GUI, while Ollama is command-line first and built for scripting and servers.

FeatureLM StudioOllama
Primary interfaceGraphical desktop appCommand line
OpenAI-compatible APIYesYes
Model discoveryBuilt-in Hugging Face browserollama pull from registry
Best forTesting, comparing, learningAutomation, headless servers
Visual settings tuningYesNo
Scriptable everywhereLimitedExcellent

In practice, many developers use both. You reach for LM Studio when evaluating a new model and want to see its behavior, speed, and memory footprint at a glance. Then you deploy with Ollama on a server where a GUI would only get in the way. For the command-line side of this workflow, see our full guide on running local LLMs with Ollama.

When you do need raw inference speed beyond what a laptop offers, a hosted accelerator becomes the better choice. Our overview of the Groq API for fast LLM inference covers when cloud hardware outpaces anything local.

When to Use LM Studio

  • You want to test and compare open models visually before committing to one
  • Your data is sensitive and must never leave your machine
  • You are learning how quantization, context length, and GPU offload affect results
  • You need a free, offline development environment that mirrors the OpenAI API
  • You prototype LLM features and want zero per-token cost during iteration

When NOT to Use LM Studio

  • You are deploying to a headless production server, where Ollama or vLLM fits better
  • You need to serve many concurrent users at scale, which desktop inference cannot handle
  • Your workflow is fully automated and a GUI adds no value
  • You require the absolute fastest inference and have access to cloud GPUs
  • You need frontier-model quality that small local models cannot match yet

Common Mistakes with LM Studio

  • Downloading a model too large for your VRAM, then wondering why it loads slowly or fails
  • Ignoring the green/yellow/red fit indicators and choosing the highest-quality quantization by default
  • Forgetting to start the local server before pointing application code at localhost:1234
  • Mismatching the model id in code with the actually loaded model, which returns confusing errors
  • Leaving context length at maximum, which silently consumes memory you needed for the model itself
  • Comparing models at different temperatures and drawing conclusions from noise rather than capability

A Realistic Testing Scenario

Consider a small team building an internal document classifier. They cannot send customer contracts to a third-party API for compliance reasons, so a local model is the only option. Rather than guessing which open model performs best, an engineer loads three candidates into LM Studio over an afternoon and runs the same set of ten representative contracts through each.

The GUI makes the trade-offs visible immediately. A 13B model produces the most accurate labels but runs at eight tokens per second on the team’s 16 GB hardware, which feels sluggish. A 7B model at Q4_K_M hits 30 tokens per second with only a small accuracy drop on their test set. Because the team values throughput for a batch job, they pick the 7B model, then move it behind a scripted Ollama deployment for the nightly run. The whole evaluation takes hours instead of days, precisely because the visual feedback removed the guesswork.

This pattern, testing locally before deciding, also informs bigger architecture questions. If your use case leans on private documents, the choice between adapting a model and retrieving context is worth understanding, which our comparison of fine-tuning vs RAG breaks down in detail.

Conclusion and Next Steps

LM Studio is the most approachable way to run and test local LLMs, because it pairs a clear GUI with an OpenAI-compatible server that drops straight into existing code. Start by downloading a Q4_K_M build of a 7B model, chat with it to gauge speed and quality, then start the local server and call it from a few lines of Python or JavaScript. From there, you can decide whether a local model meets your needs or whether a hosted provider is the better fit.

For the next step, wire LM Studio into a real project using the same patterns you would apply to a cloud model in our guide on building apps with the OpenAI API. The client code is identical, so everything you build against LM Studio works in production with a single base-URL change.

Leave a Comment