Local AI Coding with Ollama

>>> Set up a local LLM on your laptop and use it for AI-assisted coding in VS Code—completely free and private.

Overview

Two main approaches for local AI coding:

Approaches

• Direct Ollama Integration (Some tools) - Tools like Cline support directly integrating with Ollama and use models on your computer.

• OpenAI-compatible API (Any tool) — Use Ollama with any tools that support the OpenAI API standard. Expose your model so you can use the API remotely.

Part 1: Install Ollama

Install using brew

Homebrew

brew install ollama

Start the server

Start Ollama

ollama serve

Ollama runs at http://localhost:11434 by default.

Part 2: Pull Coding Models

The newest models offer superior performance, especially for agentic and complex coding tasks (competitive with closed models like GPT-4o).

Recommended models

Pull Models

# Best for Cline / Agentic Tasks
# Requires 16GB+ RAM/VRAM
ollama pull gpt-oss:20b

# Best for Pure Code Generation
ollama pull qwen2.5-coder:7b

# Maximum performance (requires 24GB+ VRAM/RAM)
ollama pull qwen2.5-coder:32b

# Fast general purpose model
ollama pull ministral-3:8b

Check installed models

List Models

ollama list

Hardware guidelines

The parameter sizes below are often a lower bound. Always refer to the model's Ollama page for specific memory requirements.

RAM/VRAM Requirements

• 8GB → 3B - 7B models (e.g., qwen2.5-coder:7b)

• 16GB → 7B - 20B models (e.g., gpt-oss:20b)

• 24GB+ → 20B - 32B+ models (e.g., qwen2.5-coder:32b)

Part 3: Cline Setup

Cline is an autonomous coding agent that can create files, run terminal commands, and make multi-file changes.

Install

Open VS Code
Go to Extensions (Cmd+Shift+X / Ctrl+Shift+X)
Search "Cline" and install

Configure for Ollama

Click the Cline icon in the sidebar
Open settings (gear icon)
Set:
- API Provider: Ollama
- Base URL: http://localhost:11434
- Model ID: Select your model (e.g., gpt-oss:20b)

Enable Compact Prompt (important!)

Local models have limited context. Enable compact prompts to reduce token usage by ~90%:

Enable Compact Prompts

Cline Settings → Features → Use Compact Prompt → Toggle ON

Best models for Cline

Cline relies heavily on reliable tool-calling and reasoning.

Recommended Models for Cline

# 16GB+ RAM
ollama pull gpt-oss:20b

# 8-16GB RAM
ollama pull qwen2.5-coder:7b

Tips for local models

Best Practices

• Keep tasks small and focused — local models work best with clear, scoped prompts

• Use compact prompts — essential for staying within context limits

• Be patient — responses are slower than cloud APIs

• If the model plans but doesn't execute — try a larger model or simpler prompts

Part 4: OpenAI-Compatible API

Ollama exposes an OpenAI-compatible API at /v1. This lets you use Ollama with any tool that supports custom OpenAI endpoints—Cursor, Aider, your own apps, etc.

Endpoint

http://localhost:11434/v1

Test it

Test with cURL

curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "gpt-oss:20b",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

Python

Python Example

from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:11434/v1",
  api_key="ollama"  # Required but ignored
)

response = client.chat.completions.create(
  model="qwen2.5-coder:7b",
  messages=[{"role": "user", "content": "Write a Python hello world"}]
)
print(response.choices[0].message.content)

JavaScript

JavaScript Example

import OpenAI from 'openai'

const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama'
})

const response = await client.chat.completions.create({
model: 'ministral-3:8b',
messages: [{ role: 'user', content: 'Write a JS hello world' }]
})
console.log(response.choices[0].message.content)

Environment variables for easy switching

Environment Variables

# Development (local Ollama)
OPENAI_API_KEY=ollama
OPENAI_BASE_URL=http://localhost:11434/v1
MODEL=gpt-oss:20b

# Production (OpenAI)
OPENAI_API_KEY=sk-xxx
OPENAI_BASE_URL=https://api.openai.com/v1
MODEL=gpt-4o

Supported endpoints

/v1/chat/completions — Chat
/v1/completions — Legacy completions
/v1/embeddings — Embeddings (Use models like embeddinggemma:300m)
/v1/models — List models

Tools that work with this

Compatible Tools

• Cursor — Set custom API base in settings

• Aider — aider --openai-api-base http://localhost:11434/v1

• Open WebUI — Point to Ollama endpoint

• LangChain / LlamaIndex — Use OpenAI client with custom base

• Any OpenAI-compatible client

Part 5: Expose Ollama Remotely

By default, Ollama only listens on localhost. To access from other machines:

Set environment variable

Expose to Network

export OLLAMA_HOST=0.0.0.0
ollama serve

Or permanently (add to ~/.zshrc or ~/.bashrc):

Permanent Configuration

export OLLAMA_HOST=0.0.0.0

Connect from another machine

http://<your-ip>:11434

Use this IP in your tools instead of localhost.

Part 6: Increase Context Length

Default context is often 4096-8192 tokens. For larger codebases, create a custom model:

Create a Modelfile

Create Modelfile

cat > Modelfile << 'EOF'
FROM gpt-oss:20b
PARAMETER num_ctx 65536
EOF

Build the custom model

Build Custom Model

ollama create gpt-oss-64k -f Modelfile

Use it

Select gpt-oss-64k in Cline or your tools.

Quick Reference

Commands

Ollama Commands

• ollama serve — Start server

• ollama list — Show installed models

• ollama pull <model> — Download a model

• ollama rm <model> — Delete a model

• ollama run <model> — Chat in terminal

• ollama show <model> — Show model info

URLs

Endpoints

• Ollama API — http://localhost:11434

• OpenAI-compatible — http://localhost:11434/v1

Model recommendations

Recommended Models

• Cline (Primary) — gpt-oss:20b (20B)

• Cline (Lighter) — qwen2.5-coder:7b (7B)

• Max Performance — qwen2.5-coder:32b (32B)

• Embeddings — embeddinggemma:300m (300M)

Troubleshooting

"Connection refused"

Make sure Ollama is running:

Start Server

ollama serve

Slow responses

Speed Improvements

• Use a smaller model

• Close other GPU-intensive apps

• Enable compact prompts in Cline

Model not showing in Cline

Fix Missing Model

Pull it first: ollama pull <model>
Restart VS Code

Out of memory

Memory Solutions

• Use a smaller model (e.g., switch from gpt-oss:20b to qwen2.5-coder:7b)

• Reduce context length

• Set OLLAMA_NUM_PARALLEL=1 to limit concurrent requests

Summary

Key Takeaways

Install Ollama and pull a new coding model (gpt-oss:20b or qwen2.5-coder:7b)
Use Cline in VS Code for autonomous coding tasks
Use the OpenAI-compatible API (/v1) with any other tool
Enable compact prompts in Cline for better local performance
Keep tasks focused — local models work best with clear, scoped prompts

Happy vibecoding! 🚀

Local AI Coding with Ollama

>>> Set up a local LLM on your laptop and use it for AI-assisted coding in VS Code—completely free and private.

Overview

Two main approaches for local AI coding:

Approaches

• Direct Ollama Integration (Some tools) - Tools like Cline support directly integrating with Ollama and use models on your computer.

• OpenAI-compatible API (Any tool) — Use Ollama with any tools that support the OpenAI API standard. Expose your model so you can use the API remotely.

Part 1: Install Ollama

Install using brew

Homebrew

brew install ollama

Start the server

Start Ollama

ollama serve

Ollama runs at http://localhost:11434 by default.

Part 2: Pull Coding Models

The newest models offer superior performance, especially for agentic and complex coding tasks (competitive with closed models like GPT-4o).

Recommended models

Pull Models

# Best for Cline / Agentic Tasks
# Requires 16GB+ RAM/VRAM
ollama pull gpt-oss:20b

# Best for Pure Code Generation
ollama pull qwen2.5-coder:7b

# Maximum performance (requires 24GB+ VRAM/RAM)
ollama pull qwen2.5-coder:32b

# Fast general purpose model
ollama pull ministral-3:8b

Check installed models

List Models

ollama list

Hardware guidelines

The parameter sizes below are often a lower bound. Always refer to the model's Ollama page for specific memory requirements.

RAM/VRAM Requirements

• 8GB → 3B - 7B models (e.g., qwen2.5-coder:7b)

• 16GB → 7B - 20B models (e.g., gpt-oss:20b)

• 24GB+ → 20B - 32B+ models (e.g., qwen2.5-coder:32b)

Part 3: Cline Setup

Cline is an autonomous coding agent that can create files, run terminal commands, and make multi-file changes.

Install

Open VS Code
Go to Extensions (Cmd+Shift+X / Ctrl+Shift+X)
Search "Cline" and install

Configure for Ollama

Click the Cline icon in the sidebar
Open settings (gear icon)
Set:
- API Provider: Ollama
- Base URL: http://localhost:11434
- Model ID: Select your model (e.g., gpt-oss:20b)

Enable Compact Prompt (important!)

Local models have limited context. Enable compact prompts to reduce token usage by ~90%:

Enable Compact Prompts

Cline Settings → Features → Use Compact Prompt → Toggle ON

Best models for Cline

Cline relies heavily on reliable tool-calling and reasoning.

Recommended Models for Cline

# 16GB+ RAM
ollama pull gpt-oss:20b

# 8-16GB RAM
ollama pull qwen2.5-coder:7b

Tips for local models

Best Practices

• Keep tasks small and focused — local models work best with clear, scoped prompts

• Use compact prompts — essential for staying within context limits

• Be patient — responses are slower than cloud APIs

• If the model plans but doesn't execute — try a larger model or simpler prompts

Part 4: OpenAI-Compatible API

Ollama exposes an OpenAI-compatible API at /v1. This lets you use Ollama with any tool that supports custom OpenAI endpoints—Cursor, Aider, your own apps, etc.

Endpoint

http://localhost:11434/v1

Test it

Test with cURL

curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "gpt-oss:20b",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

Python

Python Example

from openai import OpenAI

client = OpenAI(
  base_url="http://localhost:11434/v1",
  api_key="ollama"  # Required but ignored
)

response = client.chat.completions.create(
  model="qwen2.5-coder:7b",
  messages=[{"role": "user", "content": "Write a Python hello world"}]
)
print(response.choices[0].message.content)

JavaScript

JavaScript Example

import OpenAI from 'openai'

const client = new OpenAI({
baseURL: 'http://localhost:11434/v1',
apiKey: 'ollama'
})

const response = await client.chat.completions.create({
model: 'ministral-3:8b',
messages: [{ role: 'user', content: 'Write a JS hello world' }]
})
console.log(response.choices[0].message.content)

Environment variables for easy switching

Environment Variables

# Development (local Ollama)
OPENAI_API_KEY=ollama
OPENAI_BASE_URL=http://localhost:11434/v1
MODEL=gpt-oss:20b

# Production (OpenAI)
OPENAI_API_KEY=sk-xxx
OPENAI_BASE_URL=https://api.openai.com/v1
MODEL=gpt-4o

Supported endpoints

/v1/chat/completions — Chat
/v1/completions — Legacy completions
/v1/embeddings — Embeddings (Use models like embeddinggemma:300m)
/v1/models — List models

Tools that work with this

Compatible Tools

• Cursor — Set custom API base in settings

• Aider — aider --openai-api-base http://localhost:11434/v1

• Open WebUI — Point to Ollama endpoint

• LangChain / LlamaIndex — Use OpenAI client with custom base

• Any OpenAI-compatible client

Part 5: Expose Ollama Remotely

By default, Ollama only listens on localhost. To access from other machines:

Set environment variable

Expose to Network

export OLLAMA_HOST=0.0.0.0
ollama serve

Or permanently (add to ~/.zshrc or ~/.bashrc):

Permanent Configuration

export OLLAMA_HOST=0.0.0.0

Connect from another machine

http://<your-ip>:11434

Use this IP in your tools instead of localhost.

Part 6: Increase Context Length

Default context is often 4096-8192 tokens. For larger codebases, create a custom model:

Create a Modelfile

Create Modelfile

cat > Modelfile << 'EOF'
FROM gpt-oss:20b
PARAMETER num_ctx 65536
EOF

Build the custom model

Build Custom Model

ollama create gpt-oss-64k -f Modelfile

Use it

Select gpt-oss-64k in Cline or your tools.

Quick Reference

Commands

Ollama Commands

• ollama serve — Start server

• ollama list — Show installed models

• ollama pull <model> — Download a model

• ollama rm <model> — Delete a model

• ollama run <model> — Chat in terminal

• ollama show <model> — Show model info

URLs

Endpoints

• Ollama API — http://localhost:11434

• OpenAI-compatible — http://localhost:11434/v1

Model recommendations

Recommended Models

• Cline (Primary) — gpt-oss:20b (20B)

• Cline (Lighter) — qwen2.5-coder:7b (7B)

• Max Performance — qwen2.5-coder:32b (32B)

• Embeddings — embeddinggemma:300m (300M)

Troubleshooting

"Connection refused"

Make sure Ollama is running:

Start Server

ollama serve

Slow responses

Speed Improvements

• Use a smaller model

• Close other GPU-intensive apps

• Enable compact prompts in Cline

Model not showing in Cline

Fix Missing Model

Pull it first: ollama pull <model>
Restart VS Code

Out of memory

Memory Solutions

• Use a smaller model (e.g., switch from gpt-oss:20b to qwen2.5-coder:7b)

• Reduce context length

• Set OLLAMA_NUM_PARALLEL=1 to limit concurrent requests

Summary

Key Takeaways

Install Ollama and pull a new coding model (gpt-oss:20b or qwen2.5-coder:7b)
Use Cline in VS Code for autonomous coding tasks
Use the OpenAI-compatible API (/v1) with any other tool
Enable compact prompts in Cline for better local performance
Keep tasks focused — local models work best with clear, scoped prompts

Happy vibecoding! 🚀