
Local AI Coding with Ollama
>>> Set up a local LLM on your laptop and use it for AI-assisted coding in VS Code—completely free and private.
Overview
Two main approaches for local AI coding:
• Direct Ollama Integration (Some tools) - Tools like Cline support directly integrating with Ollama and use models on your computer.
• OpenAI-compatible API (Any tool) — Use Ollama with any tools that support the OpenAI API standard. Expose your model so you can use the API remotely.
Part 1: Install Ollama
Install using brew
brew install ollama
Start the server
ollama serve
Ollama runs at http://localhost:11434 by default.
Part 2: Pull Coding Models
The newest models offer superior performance, especially for agentic and complex coding tasks (competitive with closed models like GPT-4o).
Recommended models
# Best for Cline / Agentic Tasks # Requires 16GB+ RAM/VRAM ollama pull gpt-oss:20b # Best for Pure Code Generation ollama pull qwen2.5-coder:7b # Maximum performance (requires 24GB+ VRAM/RAM) ollama pull qwen2.5-coder:32b # Fast general purpose model ollama pull ministral-3:8b
Check installed models
ollama list
Hardware guidelines
The parameter sizes below are often a lower bound. Always refer to the model's Ollama page for specific memory requirements.
• 8GB → 3B - 7B models (e.g., qwen2.5-coder:7b)
• 16GB → 7B - 20B models (e.g., gpt-oss:20b)
• 24GB+ → 20B - 32B+ models (e.g., qwen2.5-coder:32b)
Part 3: Cline Setup
Cline is an autonomous coding agent that can create files, run terminal commands, and make multi-file changes.
Install
- Open VS Code
- Go to Extensions (Cmd+Shift+X / Ctrl+Shift+X)
- Search "Cline" and install
Configure for Ollama
- Click the Cline icon in the sidebar
- Open settings (gear icon)
- Set:
- API Provider: Ollama
- Base URL:
http://localhost:11434 - Model ID: Select your model (e.g., gpt-oss:20b)
Enable Compact Prompt (important!)
Local models have limited context. Enable compact prompts to reduce token usage by ~90%:
Cline Settings → Features → Use Compact Prompt → Toggle ON
Best models for Cline
Cline relies heavily on reliable tool-calling and reasoning.
# 16GB+ RAM ollama pull gpt-oss:20b # 8-16GB RAM ollama pull qwen2.5-coder:7b
Tips for local models
• Keep tasks small and focused — local models work best with clear, scoped prompts
• Use compact prompts — essential for staying within context limits
• Be patient — responses are slower than cloud APIs
• If the model plans but doesn't execute — try a larger model or simpler prompts
Part 4: OpenAI-Compatible API
Ollama exposes an OpenAI-compatible API at /v1. This lets you use Ollama with any tool that supports custom OpenAI endpoints—Cursor, Aider, your own apps, etc.
Endpoint
http://localhost:11434/v1
Test it
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss:20b",
"messages": [{"role": "user", "content": "Hello!"}]
}'Python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but ignored
)
response = client.chat.completions.create(
model="qwen2.5-coder:7b",
messages=[{"role": "user", "content": "Write a Python hello world"}]
)
print(response.choices[0].message.content)JavaScript
import OpenAI from 'openai'
const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama'
})
const response = await client.chat.completions.create({
model: 'ministral-3:8b',
messages: [{ role: 'user', content: 'Write a JS hello world' }]
})
console.log(response.choices[0].message.content)Environment variables for easy switching
# Development (local Ollama) OPENAI_API_KEY=ollama OPENAI_BASE_URL=http://localhost:11434/v1 MODEL=gpt-oss:20b # Production (OpenAI) OPENAI_API_KEY=sk-xxx OPENAI_BASE_URL=https://api.openai.com/v1 MODEL=gpt-4o
Supported endpoints
/v1/chat/completions— Chat/v1/completions— Legacy completions/v1/embeddings— Embeddings (Use models like embeddinggemma:300m)/v1/models— List models
Tools that work with this
• Cursor — Set custom API base in settings
• Aider — aider --openai-api-base http://localhost:11434/v1
• Open WebUI — Point to Ollama endpoint
• LangChain / LlamaIndex — Use OpenAI client with custom base
• Any OpenAI-compatible client
Part 5: Expose Ollama Remotely
By default, Ollama only listens on localhost. To access from other machines:
Set environment variable
export OLLAMA_HOST=0.0.0.0 ollama serve
Or permanently (add to ~/.zshrc or ~/.bashrc):
export OLLAMA_HOST=0.0.0.0
Connect from another machine
http://<your-ip>:11434
Use this IP in your tools instead of localhost.
Part 6: Increase Context Length
Default context is often 4096-8192 tokens. For larger codebases, create a custom model:
Create a Modelfile
cat > Modelfile << 'EOF' FROM gpt-oss:20b PARAMETER num_ctx 65536 EOF
Build the custom model
ollama create gpt-oss-64k -f Modelfile
Use it
Select gpt-oss-64k in Cline or your tools.
Quick Reference
Commands
• ollama serve — Start server
• ollama list — Show installed models
• ollama pull <model> — Download a model
• ollama rm <model> — Delete a model
• ollama run <model> — Chat in terminal
• ollama show <model> — Show model info
URLs
• Ollama API — http://localhost:11434
• OpenAI-compatible — http://localhost:11434/v1
Model recommendations
• Cline (Primary) — gpt-oss:20b (20B)
• Cline (Lighter) — qwen2.5-coder:7b (7B)
• Max Performance — qwen2.5-coder:32b (32B)
• Embeddings — embeddinggemma:300m (300M)
Troubleshooting
"Connection refused"
Make sure Ollama is running:
ollama serve
Slow responses
• Use a smaller model
• Close other GPU-intensive apps
• Enable compact prompts in Cline
Model not showing in Cline
-
Pull it first:
ollama pull <model> -
Restart VS Code
Out of memory
• Use a smaller model (e.g., switch from gpt-oss:20b to qwen2.5-coder:7b)
• Reduce context length
• Set OLLAMA_NUM_PARALLEL=1 to limit concurrent requests
Summary
-
Install Ollama and pull a new coding model (gpt-oss:20b or qwen2.5-coder:7b)
-
Use Cline in VS Code for autonomous coding tasks
-
Use the OpenAI-compatible API (
/v1) with any other tool -
Enable compact prompts in Cline for better local performance
-
Keep tasks focused — local models work best with clear, scoped prompts
Happy vibecoding! 🚀