Skip to main content

Cloud vs Local Models

One of the first architectural decisions for any AI application is where the model runs. The two poles are managed cloud platforms (you call an API, the provider hosts everything) and local / self-hosted open-weights models (you download the weights and run them on your own hardware). Most serious systems end up somewhere on the spectrum between them.

The core trade-off

DimensionEnterprise cloudLocal / self-hosted
Model quality ceilingHighest (frontier: Claude, GPT, Gemini)Strong open-weights, usually a step behind frontier
Data residency / privacyLeaves your network (subject to contracts)Stays on your hardware; works offline
Setup effortMinutes (API key)Hardware + ops; more for production serving
Cost modelPer token / per API call; scales to zeroHardware capex or fixed GPU rental; cheap per token at volume
ScalingProvider's problemYour problem (GPUs, autoscaling)
CustomizationPrompting, RAG, light fine-tuningFull control: any model, fine-tune, quantize
Best whenBursty traffic, fast time-to-value, need managed servicesPrivacy/offline needs, high constant volume, full control

Enterprise cloud options

The cloud platforms are largely ways to access foundation models via API, plus managed RAG, agent, and guardrail services on top.

  • Amazon Bedrock -- one API for models from Anthropic (Claude), Meta (Llama), Mistral, Cohere, and Amazon (Nova/Titan), plus Knowledge Bases (managed RAG), Agents, Guardrails, and Flows. Serverless, per-token pricing.
  • Amazon SageMaker AI -- build, train, and fully control custom models; own the entire ML lifecycle (feature store, training jobs, model registry, monitoring). Serverful, per-hour pricing.
  • Azure AI / Microsoft Foundry -- Microsoft's managed AI platform and agent service, integrated with the Azure ecosystem and OpenAI models.
  • Google Vertex AI -- Gemini models, Agent Builder, and Vector Search on Google Cloud.

Bedrock vs SageMaker (the most common AWS question)

A useful one-liner: Bedrock = use pre-trained models as-is or with light fine-tuning and get managed RAG/agents/guardrails. SageMaker AI = build, train, and fully control custom models.

BedrockSageMaker AI
Primary purposeConsume + lightly adapt foundation modelsBuild, train, deploy custom models
Target usersDevelopers, product engineersData scientists, ML engineers
InfrastructureServerlessServerful (granular compute)
PricingPer token / API callPer compute hour + storage
Model ownershipProvider-managedYour models, your weights

Most production AI systems use both: Bedrock Agents + Knowledge Bases + Guardrails for the conversational layer, with SageMaker endpoints serving custom domain models (fraud, recommendations) called as tools. Rule of thumb on cost: bursty or unpredictable traffic favors serverless (Bedrock); constant high volume favors dedicated capacity (SageMaker endpoints, or self-hosting).

Local model usage

Running open-weights models (Llama, Mistral, Qwen, DeepSeek, Phi, Gemma) yourself keeps data on your hardware, works offline, and removes per-token cost -- at the price of owning the hardware and operations. The tooling has matured to the point where a laptop can run useful models.

ToolWhat it isBest for
OllamaSimple CLI + local API server for open-weights modelsEasiest local setup; dev and scripting
LM StudioDesktop GUI for discovering, downloading, and chatting with modelsNon-CLI users; quick experimentation
llama.cppHigh-performance C/C++ inference engine (GGUF format)Maximum control; runs on CPU and modest GPUs
GPT4AllDesktop app + ecosystem for local chatFriendly offline assistant
OpenWebUISelf-hosted web chat UI (often paired with Ollama)A private ChatGPT-style interface
vLLMHigh-throughput GPU serving engineProduction self-hosting at scale

The practical limit is VRAM: pick the largest parameter count that fits your GPU memory, which is where quantization comes in.

Quantization and QLoRA make local models practical

Quantization replaces model weights with lower-precision approximations to cut memory use (and sometimes speed up inference). Modern 4-bit formats (NF4 with double quantization) have minimal quality loss for most tasks while shrinking footprint dramatically -- a 3.8B model drops from ~15 GB at FP32 to ~2.2 GB at 4-bit, which is what lets it fit on a consumer GPU.

QLoRA builds on this: quantize the base model to 4-bit, freeze it, and train small LoRA adapters on top. It is the standard recipe for fine-tuning a moderate-size LLM on a single consumer GPU -- domain adaptation without renting a cluster. See RAG vs fine-tuning for when adaptation is worth it at all (usually: RAG for facts, fine-tune for style).

Worked local setups

A practical default

  • Prototyping or low/bursty volume, no strict data-residency need -- start with a managed cloud API. Fastest to value, frontier quality, scales to zero.
  • Sensitive data, offline requirements, or heavy local experimentation -- run open-weights models locally with Ollama or LM Studio; quantize to fit your hardware.
  • High, constant production throughput -- self-host with vLLM (or dedicated cloud endpoints) for predictable cost control.
  • Hybrid is normal -- frontier cloud models for hard reasoning, smaller local/self-hosted models on hot, latency-sensitive, or privacy-sensitive paths.

See also