Build a Local LLM App

This is the hands-on companion to Cloud vs Local Models: run a model on your own machine with Ollama or LM Studio, then talk to it from a tiny app you write yourself. No API keys, no cloud, works offline.

The key idea: one API, two servers

Both Ollama and LM Studio expose an OpenAI-compatible HTTP API on localhost. That means your app code is identical for either one - only the base URL, port, and model name change. Anything that can speak to the OpenAI API can speak to a local model.

	Ollama	LM Studio
Default base URL	`http://localhost:11434/v1`	`http://localhost:1234/v1`
Native API (also)	`http://localhost:11434/api/chat`	-
API key	any non-empty string (ignored)	any non-empty string (ignored)
Interface	CLI-first	GUI-first (plus an `lms` CLI)
Get a model	`ollama pull <model>`	search & download in the app

Pick whichever you prefer; the app code below works with both.

Option A: Ollama

Ollama is the simplest way to run open-weights models from the terminal.

Install it from ollama.com/download.
Pull and run a model. Start with a small one that fits your RAM/VRAM:

ollama pull llama3.2        # ~2 GB, a good small default
ollama run llama3.2         # interactive chat in the terminal
ollama list                 # see what you have installed

The API is already running. Whenever Ollama is running, its server listens on http://localhost:11434. To start it manually (e.g. on a headless box): ollama serve.
Smoke-test it with curl - both the OpenAI-compatible and native endpoints work:

# OpenAI-compatible endpoint
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}]
}'

# Ollama native endpoint
curl http://localhost:11434/api/chat -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}],
    "stream": false
}'

If you need to pick a model for your hardware, see quantization and open-weights models - the rule is "largest model that fits your VRAM".

Option B: LM Studio

LM Studio is a desktop GUI for discovering, downloading, and serving models.

Download it from lmstudio.ai.
Find and download a model in the app's search tab (it suggests quantizations that fit your machine).
Start the local server. Open the Developer / Local Server tab and start it. It defaults to http://localhost:1234 and exposes the OpenAI-compatible API at http://localhost:1234/v1.
Enable CORS in the server settings if you plan to call it from a browser app.
Smoke-test it (use the exact model id shown in the server panel, or GET /v1/models to list them):

curl http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "your-loaded-model-id",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}]
}'

Write a simple app

Three versions of the same thing, in increasing order of polish. They all hit the OpenAI-compatible endpoint, so switching between Ollama and LM Studio is a one-line change.

1. A Node.js CLI (zero dependencies, streaming)

The most reliable starting point - no CORS, no build step. Requires Node 18+ (built-in fetch). Save as chat.mjs and run node chat.mjs "your question".

// chat.mjs - a tiny streaming chat client for a local LLM.
// Ollama:   node chat.mjs "hello"
// LM Studio: LLM_BASE_URL=http://localhost:1234/v1 LLM_MODEL=your-model node chat.mjs "hello"

const BASE_URL = process.env.LLM_BASE_URL ?? 'http://localhost:11434/v1';
const MODEL = process.env.LLM_MODEL ?? 'llama3.2';
const prompt = process.argv.slice(2).join(' ') || 'Explain what a local LLM is in two sentences.';

const res = await fetch(`${BASE_URL}/chat/completions`, {
    method: 'POST',
    headers: {'Content-Type': 'application/json'},
    body: JSON.stringify({
        model: MODEL,
        stream: true,
        messages: [
            {role: 'system', content: 'You are a concise, helpful assistant.'},
            {role: 'user', content: prompt},
        ],
    }),
});

if (!res.ok) {
    throw new Error(`Request failed: ${res.status} ${await res.text()}`);
}

const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = '';

for (;;) {
    const {done, value} = await reader.read();
    if (done) break;
    buffer += decoder.decode(value, {stream: true});
    const lines = buffer.split('\n');
    buffer = lines.pop() ?? '';
    for (const line of lines) {
        const trimmed = line.trim();
        if (!trimmed.startsWith('data:')) continue;
        const data = trimmed.slice(5).trim();
        if (data === '[DONE]') continue;
        const token = JSON.parse(data).choices?.[0]?.delta?.content;
        if (token) process.stdout.write(token);
    }
}
process.stdout.write('\n');

The streaming response is server-sent events: each line is data: {json}, ending with data: [DONE]. We buffer partial lines and pull choices[0].delta.content out of each chunk.

2. A single-file browser app (streaming)

Save as index.html. It is a complete, dependency-free chat UI. See the CORS note below before running it in a browser.

<!doctype html>
<html lang="en">
<head>
    <meta charset="utf-8" />
    <title>Local LLM Chat</title>
    <style>
        body { font-family: system-ui, sans-serif; max-width: 40rem; margin: 2rem auto; padding: 0 1rem; }
        #log { white-space: pre-wrap; border: 1px solid #ccc; border-radius: 8px; padding: 1rem; min-height: 8rem; }
        form { display: flex; gap: .5rem; margin-top: 1rem; }
        input { flex: 1; padding: .5rem; }
    </style>
</head>
<body>
    <h1>Local LLM Chat</h1>
    <div id="log"></div>
    <form id="form">
        <input id="input" autocomplete="off" placeholder="Ask something..." />
        <button>Send</button>
    </form>
    <script type="module">
        const BASE_URL = 'http://localhost:11434/v1'; // LM Studio: http://localhost:1234/v1
        const MODEL = 'llama3.2';                      // LM Studio: your loaded model id

        const log = document.getElementById('log');
        const form = document.getElementById('form');
        const input = document.getElementById('input');

        form.addEventListener('submit', async (event) => {
            event.preventDefault();
            const question = input.value.trim();
            if (!question) return;
            input.value = '';
            log.textContent += `\nYou: ${question}\nAI: `;

            const res = await fetch(`${BASE_URL}/chat/completions`, {
                method: 'POST',
                headers: {'Content-Type': 'application/json'},
                body: JSON.stringify({
                    model: MODEL,
                    stream: true,
                    messages: [{role: 'user', content: question}],
                }),
            });

            const reader = res.body.getReader();
            const decoder = new TextDecoder();
            let buffer = '';

            for (;;) {
                const {done, value} = await reader.read();
                if (done) break;
                buffer += decoder.decode(value, {stream: true});
                const lines = buffer.split('\n');
                buffer = lines.pop() ?? '';
                for (const line of lines) {
                    const trimmed = line.trim();
                    if (!trimmed.startsWith('data:')) continue;
                    const data = trimmed.slice(5).trim();
                    if (data === '[DONE]') continue;
                    const token = JSON.parse(data).choices?.[0]?.delta?.content;
                    if (token) log.textContent += token;
                }
            }
        });
    </script>
</body>
</html>

3. Using the official OpenAI SDK

If you already use the OpenAI SDK, just point it at the local base URL - nothing else changes.

import OpenAI from 'openai';

const client = new OpenAI({
    baseURL: 'http://localhost:11434/v1', // LM Studio: http://localhost:1234/v1
    apiKey: 'ollama',                      // any non-empty string; local servers ignore it
});

const completion = await client.chat.completions.create({
    model: 'llama3.2',
    messages: [{role: 'user', content: 'Give me one tip for running LLMs locally.'}],
});

console.log(completion.choices[0].message.content);

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

resp = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Give me one tip for running LLMs locally."}],
)
print(resp.choices[0].message.content)

Tuning the behavior

System prompt - prepend a {role: 'system', content: '...'} message to set persona and rules.
Temperature - add "temperature": 0.2 to the request body for more deterministic output (0) or more variety (higher). See temperature.
Swap the model - change one string. Pull another with ollama pull <model> or load another in LM Studio.
Streaming on/off - set "stream": false to get the whole response in one JSON object instead of token-by-token.

CORS and the browser

The Node CLI talks to localhost directly, so it never hits CORS. A browser app served from a different origin (or opened as a file://) makes a cross-origin request to the model server, which the browser blocks unless the server allows it:

Ollama - set the OLLAMA_ORIGINS environment variable before starting it, e.g. OLLAMA_ORIGINS='*' ollama serve (or list specific origins). Restart Ollama after changing it.
LM Studio - toggle CORS on in the Local Server settings.

The cleanest setup is to serve the HTML from a tiny static server (for example npx serve) and allow that exact origin, rather than opening the file directly.

Troubleshooting

Symptom	Likely cause	Fix
`connection refused` / fetch fails	Server not running or wrong port	Start Ollama (`ollama serve`) or the LM Studio server; check `11434` vs `1234`
`model not found` / 404	Model not pulled or wrong name	`ollama pull llama3.2`; in LM Studio use the exact loaded id; check `GET /v1/models`
CORS error in the browser console	Cross-origin request blocked	Set `OLLAMA_ORIGINS` / enable LM Studio CORS, or use the Node version
Very slow, or out of memory	Model too big for your VRAM	Pick a smaller or more quantized model (e.g. a 3B at Q4)
Empty or cut-off output	Streaming parse or length limit	Verify the SSE parsing loop; raise the max tokens

The key idea: one API, two servers​

Option A: Ollama​

Option B: LM Studio​

Write a simple app​

1. A Node.js CLI (zero dependencies, streaming)​

2. A single-file browser app (streaming)​

3. Using the official OpenAI SDK​

Tuning the behavior​

CORS and the browser​

Troubleshooting​

See also​