Guideslocal-llm

Run Local LLMs on Mac (Apple Silicon)

Use Ollama and Open WebUI to run quantized language models locally on M1/M2/M3 Macs with Metal acceleration.

Hardware Requirements

Apple Silicon Macs use unified memory — the same physical RAM serves both CPU and GPU. This means a 16 GB M2 Mac can run a 7–9B model with acceptable speed, because the GPU does not need a separate VRAM pool.

Minimum requirements:

SpecMinimumRecommended
ChipM1M2 / M3
RAM8 GB16 GB
Storage50 GB free100 GB free
macOS13 Ventura14 Sonoma+

8 GB RAM reality check: You can run 3B–7B models at Q4_K_M quantization. macOS keeps ~2–3 GB for the OS. That leaves ~5–6 GB usable. Expect slower token generation and potential swapping.

16 GB: Run 7B–13B models comfortably. The sweet spot for most users.

32 GB+: Run 30B+ models (Llama 3.3 70B at Q2_K fits in 35 GB).

Intel Macs (pre-2020) are not supported — Ollama requires Apple Silicon for Metal acceleration.


Install Ollama

Homebrew is the simplest path:

brew install ollama

Or use the official installer (no Homebrew required):

curl -fsSL https://ollama.com/install.sh | sh

Verify installation:

ollama --version
# ollama version 0.3.x

Start the Ollama server (runs automatically as a background service after install, but you can also start it manually):

ollama serve
# Listening on 127.0.0.1:11434

Pull Your First Model

ollama pull llama3.2:3b

This downloads a 3B parameter Llama 3.2 model (Q4_K_M, ~2 GB). Fast download, very fast on M2.

Run it immediately in the terminal:

ollama run llama3.2:3b
>>> Tell me about Apple Silicon.

Press Ctrl+D or type /bye to exit the interactive session.


Quantization Tradeoffs

GGUF quantization reduces model size at the cost of slight quality degradation. Ollama uses llama.cpp under the hood and pulls pre-quantized models from the Ollama library.

FormatSize (7B model)QualityRAM neededSpeed (M2)
F16~14 GBBest18+ GBSlow
Q8_0~7 GBExcellent10 GBGood
Q4_K_M~4 GBGood6 GBFast
Q3_K_M~3.3 GBFair5 GBVery fast
Q2_K~2.5 GBAcceptable4 GBVery fast

Recommendation: Use Q4_K_M by default. It is the best tradeoff for most workloads. Q8_0 if you have 16+ GB and need higher accuracy.

Ollama automatically selects Q4_K_M when you pull without specifying a tag. Specify explicitly with a colon:

ollama pull llama3.2:8b-instruct-q8_0

Model Recommendations by RAM

8 GB RAM

ollama pull llama3.2:3b       # general purpose, fast
ollama pull phi4-mini         # Microsoft Phi-4 Mini, strong at coding
ollama pull gemma3:4b         # Google Gemma 3, good instruction following

16 GB RAM

ollama pull llama3.3:8b       # best general-purpose 8B model (Q4_K_M)
ollama pull qwen2.5:7b        # strong multilingual and coding
ollama pull mistral-nemo:12b  # Mistral NeMo 12B, excellent for instruction
ollama pull deepseek-r1:8b    # reasoning-focused, CoT outputs

32 GB RAM

ollama pull llama3.3:70b      # Llama 3.3 70B at Q2_K
ollama pull qwen2.5:32b       # excellent coding model
ollama pull deepseek-r1:32b   # reasoning, competitive with GPT-4o on benchmarks

Metal / MPS Acceleration

Ollama uses Metal Performance Shaders (MPS) automatically on Apple Silicon. No configuration needed.

To confirm Metal is active, check the Ollama server log:

# In the Ollama server terminal, look for:
# llm_load_tensors: offloading 32 repeating layers to GPU
# llm_load_tensors: offloaded 33/33 layers to GPU

When all layers are offloaded to GPU, inference is fastest. If only partial offload occurs (due to RAM pressure), performance drops significantly.

Force full GPU offload by ensuring no other memory-heavy apps are running:

# check what's using RAM
top -l 1 -s 0 | head -20

Open WebUI

Open WebUI provides a ChatGPT-like interface for Ollama running locally.

Install with Docker

Ensure Docker Desktop for Mac is installed and running, then:

docker run -d \
  --name open-webui \
  -p 3000:8080 \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000 in your browser. First run will prompt you to create an admin account (local only, no external registration).

Select a Model

In Open WebUI: click the model dropdown at the top → select any model you have pulled in Ollama. If you do not see models, verify Ollama is running (ollama list).

Install Without Docker

pip install open-webui
open-webui serve

This is simpler but requires Python 3.11+.


API Access

Ollama exposes an OpenAI-compatible API at http://localhost:11434/v1. Any OpenAI SDK or tool that supports a custom base URL works:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"   # required by the SDK, value is ignored
)

response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "What is 2 + 2?"}]
)
print(response.choices[0].message.content)

Managing Models

# list downloaded models
ollama list

# remove a model (frees disk space)
ollama rm llama3.2:3b

# show model details
ollama show llama3.3:8b

# copy/create a custom model variant
ollama create my-model -f Modelfile

A minimal Modelfile for customizing system prompt:

FROM llama3.3:8b
SYSTEM "You are a concise assistant. Answer in under 3 sentences."
PARAMETER temperature 0.7

See Also