Running Local Models on a VM

>>> Set up a GPU-enabled virtual machine to run Hugging Face models for text generation, image generation, and more.

Overview

Run open-source AI models from Hugging Face on your own GPU-enabled virtual machine. This gives you full control, no API costs after setup, and the ability to run models without content restrictions.

What You'll Learn

• Set up a GPU VM with CUDA and Python

• Install Hugging Face libraries (transformers, diffusers)

• Load and run text and image generation models

• Optimize for different GPU memory sizes

Part 1: VM Setup

Choose a Cloud Provider

GPU Cloud Options

• Lambda Labs — Simple GPU rentals, good pricing

• RunPod — Pay-per-hour GPU instances

• Vast.ai — Marketplace for GPU rentals

• AWS — EC2 P4d/G5 instances

• GCP — A100/V100 instances

• Azure — NC/ND series VMs

Recommended GPU Specs

GPU Requirements

• 16GB VRAM — Run 7B parameter models, basic image generation

• 24GB VRAM — Run 13B models, high-res image generation

• 40GB+ VRAM — Run 30B+ models, batch generation

Verify GPU Access

Once your VM is running, verify GPU access:

Check GPU

# Check NVIDIA driver and GPU
nvidia-smi

# Check CUDA version
nvcc --version

You should see your GPU model and VRAM listed.

Part 2: Install Dependencies

Python Environment

Set Up Python

# Update system packages
sudo apt update && sudo apt upgrade -y

# Install Python and pip (if not present)
sudo apt install python3 python3-pip -y

# Verify Python version (3.8+ required)
python3 --version

Install PyTorch with CUDA

Install PyTorch

# Install PyTorch with CUDA support
pip install torch torchvision --index-url \
https://download.pytorch.org/whl/cu121

# Verify CUDA is available
python3 -c "import torch; print(torch.cuda.is_available())"

Should print True if CUDA is properly configured.

Install Hugging Face Libraries

Install HF Libraries

# Core libraries
pip install transformers accelerate

# For image generation models
pip install diffusers

# Common dependencies
pip install sentencepiece protobuf pillow

Part 3: Running Text Models

Load a Text Generation Model

Text Generation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.bfloat16,
  device_map="auto"
)

# Generate text
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
  **inputs,
  max_new_tokens=256,
  temperature=0.7
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using a Pipeline (Simpler API)

Pipeline API

from transformers import pipeline
import torch

generator = pipeline(
  "text-generation",
  model="mistralai/Mistral-7B-Instruct-v0.2",
  torch_dtype=torch.bfloat16,
  device_map="auto"
)

response = generator(
  "Write a haiku about coding:",
  max_new_tokens=50
)
print(response[0]["generated_text"])

Part 4: Running Image Models

Load an Image Generation Model

Image Generation

import torch
from diffusers import DiffusionPipeline

# Load the model
pipe = DiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")

# Generate an image
prompt = "A serene mountain landscape at sunset, detailed, 8k"
image = pipe(prompt).images[0]

# Save the image
image.save("output.png")
print("Image saved!")

Adjusting Generation Parameters

Image Parameters

image = pipe(
  prompt,
  num_inference_steps=50,    # More steps = better quality
  guidance_scale=7.5,        # Prompt adherence (7-15)
  height=1024,
  width=1024,
  generator=torch.Generator("cuda").manual_seed(42)
).images[0]

Part 5: Memory Optimization

For Limited VRAM (< 24GB)

Memory Optimization

# For text models: load in 8-bit
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  load_in_8bit=True,
  device_map="auto"
)

# For image models: enable CPU offloading
pipe.enable_model_cpu_offload()

# Enable attention slicing
pipe.enable_attention_slicing()

# For very large images
pipe.enable_vae_tiling()

4-bit Quantization (Even Lower VRAM)

4-bit Loading

pip install bitsandbytes

Load in 4-bit

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
  model_id,
  quantization_config=bnb_config,
  device_map="auto"
)

Clear GPU Memory

Clear Memory

import torch
import gc

# Delete model and clear cache
del model
gc.collect()
torch.cuda.empty_cache()

Part 6: Model Management

Pre-download Models

Download Model

# Install HF CLI
pip install huggingface_hub

# Download a model
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.2

# Set custom cache directory
export HF_HOME=/path/to/cache

Check Cache Location

Models are cached in ~/.cache/huggingface/hub/ by default. Large models can be 10-50GB+.

Check Cache Size

du -sh ~/.cache/huggingface/hub/

Part 7: Running as a Service

Simple API with Flask

Flask API

from flask import Flask, request, jsonify
from transformers import pipeline
import torch

app = Flask(__name__)

# Load model once at startup
generator = pipeline(
  "text-generation",
  model="mistralai/Mistral-7B-Instruct-v0.2",
  torch_dtype=torch.bfloat16,
  device_map="auto"
)

@app.route("/generate", methods=["POST"])
def generate():
  data = request.json
  prompt = data.get("prompt", "")
  
  result = generator(prompt, max_new_tokens=256)
  return jsonify({"response": result[0]["generated_text"]})

if __name__ == "__main__":
  app.run(host="0.0.0.0", port=5000)

Run in Background

Background Process

# Using nohup
nohup python3 api.py > output.log 2>&1 &

# Or with screen
screen -S model-api
python3 api.py
# Ctrl+A, D to detach

Monitoring

Watch GPU Usage

Monitor GPU

# Real-time GPU monitoring
watch -n 1 nvidia-smi

# Or one-time check
nvidia-smi

Check Memory Usage

Memory Check

python3 -c "
import torch
print(f'Allocated: {torch.cuda.memory_allocated()/1e9:.1f}GB')
print(f'Reserved: {torch.cuda.memory_reserved()/1e9:.1f}GB')
"

Troubleshooting

Out of Memory

OOM Solutions

• Use a smaller model variant (7B instead of 13B)

• Enable 8-bit or 4-bit quantization

• Use device_map="auto" to spread across CPU/GPU

• Enable CPU offloading for diffusion models

• Reduce batch size or image resolution

CUDA Not Available

Debug CUDA

# Check if CUDA is detected
python3 -c "import torch; print(torch.cuda.is_available())"

# Check CUDA version PyTorch was built with
python3 -c "import torch; print(torch.version.cuda)"

# Reinstall PyTorch if needed
pip uninstall torch -y
pip install torch --index-url https://download.pytorch.org/whl/cu121

Slow Model Download

Download Tips

• Pre-download models before running scripts

• Set HF_HOME to a fast SSD location

• Use huggingface-cli download for reliable downloads

• Check your internet connection speed

Import Errors

Fix Dependencies

# Reinstall with compatible versions
pip uninstall transformers diffusers -y
pip install transformers diffusers accelerate

# Install missing tokenizer dependencies
pip install sentencepiece protobuf

Popular Models to Try

Text Generation

Text Models

• mistralai/Mistral-7B-Instruct-v0.2 — Fast, capable 7B model

• meta-llama/Llama-2-13b-chat-hf — Meta's chat model

• codellama/CodeLlama-7b-hf — Code-focused model

• Qwen/Qwen2.5-7B-Instruct — Strong multilingual model

Image Generation

Image Models

• stabilityai/stable-diffusion-xl-base-1.0 — High quality images

• runwayml/stable-diffusion-v1-5 — Classic, well-supported

• black-forest-labs/FLUX.1-schnell — Fast generation

Quick Reference

Common Commands

• nvidia-smi — Check GPU status

• watch -n 1 nvidia-smi — Monitor GPU in real-time

• huggingface-cli download <model> — Pre-download model

• torch.cuda.empty_cache() — Clear GPU memory

Key Libraries

• transformers — Text models (LLMs, BERT, etc.)

• diffusers — Image generation models

• accelerate — Multi-GPU and mixed precision

• bitsandbytes — Quantization (8-bit, 4-bit)

Summary

Key Takeaways

Set up your VM with NVIDIA drivers and CUDA
Install PyTorch with CUDA support
Install Hugging Face libraries (transformers, diffusers)
Use quantization if you have limited VRAM
Monitor GPU usage with nvidia-smi
Pre-download models for faster iteration

Happy model running! 🚀

Running Local Models on a VM

>>> Set up a GPU-enabled virtual machine to run Hugging Face models for text generation, image generation, and more.

Overview

Run open-source AI models from Hugging Face on your own GPU-enabled virtual machine. This gives you full control, no API costs after setup, and the ability to run models without content restrictions.

What You'll Learn

• Set up a GPU VM with CUDA and Python

• Install Hugging Face libraries (transformers, diffusers)

• Load and run text and image generation models

• Optimize for different GPU memory sizes

Part 1: VM Setup

Choose a Cloud Provider

GPU Cloud Options

• Lambda Labs — Simple GPU rentals, good pricing

• RunPod — Pay-per-hour GPU instances

• Vast.ai — Marketplace for GPU rentals

• AWS — EC2 P4d/G5 instances

• GCP — A100/V100 instances

• Azure — NC/ND series VMs

Recommended GPU Specs

GPU Requirements

• 16GB VRAM — Run 7B parameter models, basic image generation

• 24GB VRAM — Run 13B models, high-res image generation

• 40GB+ VRAM — Run 30B+ models, batch generation

Verify GPU Access

Once your VM is running, verify GPU access:

Check GPU

# Check NVIDIA driver and GPU
nvidia-smi

# Check CUDA version
nvcc --version

You should see your GPU model and VRAM listed.

Part 2: Install Dependencies

Python Environment

Set Up Python

# Update system packages
sudo apt update && sudo apt upgrade -y

# Install Python and pip (if not present)
sudo apt install python3 python3-pip -y

# Verify Python version (3.8+ required)
python3 --version

Install PyTorch with CUDA

Install PyTorch

# Install PyTorch with CUDA support
pip install torch torchvision --index-url \
https://download.pytorch.org/whl/cu121

# Verify CUDA is available
python3 -c "import torch; print(torch.cuda.is_available())"

Should print True if CUDA is properly configured.

Install Hugging Face Libraries

Install HF Libraries

# Core libraries
pip install transformers accelerate

# For image generation models
pip install diffusers

# Common dependencies
pip install sentencepiece protobuf pillow

Part 3: Running Text Models

Load a Text Generation Model

Text Generation

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "mistralai/Mistral-7B-Instruct-v0.2"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  torch_dtype=torch.bfloat16,
  device_map="auto"
)

# Generate text
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
  **inputs,
  max_new_tokens=256,
  temperature=0.7
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using a Pipeline (Simpler API)

Pipeline API

from transformers import pipeline
import torch

generator = pipeline(
  "text-generation",
  model="mistralai/Mistral-7B-Instruct-v0.2",
  torch_dtype=torch.bfloat16,
  device_map="auto"
)

response = generator(
  "Write a haiku about coding:",
  max_new_tokens=50
)
print(response[0]["generated_text"])

Part 4: Running Image Models

Load an Image Generation Model

Image Generation

import torch
from diffusers import DiffusionPipeline

# Load the model
pipe = DiffusionPipeline.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")

# Generate an image
prompt = "A serene mountain landscape at sunset, detailed, 8k"
image = pipe(prompt).images[0]

# Save the image
image.save("output.png")
print("Image saved!")

Adjusting Generation Parameters

Image Parameters

image = pipe(
  prompt,
  num_inference_steps=50,    # More steps = better quality
  guidance_scale=7.5,        # Prompt adherence (7-15)
  height=1024,
  width=1024,
  generator=torch.Generator("cuda").manual_seed(42)
).images[0]

Part 5: Memory Optimization

For Limited VRAM (< 24GB)

Memory Optimization

# For text models: load in 8-bit
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  load_in_8bit=True,
  device_map="auto"
)

# For image models: enable CPU offloading
pipe.enable_model_cpu_offload()

# Enable attention slicing
pipe.enable_attention_slicing()

# For very large images
pipe.enable_vae_tiling()

4-bit Quantization (Even Lower VRAM)

4-bit Loading

pip install bitsandbytes

Load in 4-bit

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
  model_id,
  quantization_config=bnb_config,
  device_map="auto"
)

Clear GPU Memory

Clear Memory

import torch
import gc

# Delete model and clear cache
del model
gc.collect()
torch.cuda.empty_cache()

Part 6: Model Management

Pre-download Models

Download Model

# Install HF CLI
pip install huggingface_hub

# Download a model
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.2

# Set custom cache directory
export HF_HOME=/path/to/cache

Check Cache Location

Models are cached in ~/.cache/huggingface/hub/ by default. Large models can be 10-50GB+.

Check Cache Size

du -sh ~/.cache/huggingface/hub/

Part 7: Running as a Service

Simple API with Flask

Flask API

from flask import Flask, request, jsonify
from transformers import pipeline
import torch

app = Flask(__name__)

# Load model once at startup
generator = pipeline(
  "text-generation",
  model="mistralai/Mistral-7B-Instruct-v0.2",
  torch_dtype=torch.bfloat16,
  device_map="auto"
)

@app.route("/generate", methods=["POST"])
def generate():
  data = request.json
  prompt = data.get("prompt", "")
  
  result = generator(prompt, max_new_tokens=256)
  return jsonify({"response": result[0]["generated_text"]})

if __name__ == "__main__":
  app.run(host="0.0.0.0", port=5000)

Run in Background

Background Process

# Using nohup
nohup python3 api.py > output.log 2>&1 &

# Or with screen
screen -S model-api
python3 api.py
# Ctrl+A, D to detach

Monitoring

Watch GPU Usage

Monitor GPU

# Real-time GPU monitoring
watch -n 1 nvidia-smi

# Or one-time check
nvidia-smi

Check Memory Usage

Memory Check

python3 -c "
import torch
print(f'Allocated: {torch.cuda.memory_allocated()/1e9:.1f}GB')
print(f'Reserved: {torch.cuda.memory_reserved()/1e9:.1f}GB')
"

Troubleshooting

Out of Memory

OOM Solutions

• Use a smaller model variant (7B instead of 13B)

• Enable 8-bit or 4-bit quantization

• Use device_map="auto" to spread across CPU/GPU

• Enable CPU offloading for diffusion models

• Reduce batch size or image resolution

CUDA Not Available

Debug CUDA

# Check if CUDA is detected
python3 -c "import torch; print(torch.cuda.is_available())"

# Check CUDA version PyTorch was built with
python3 -c "import torch; print(torch.version.cuda)"

# Reinstall PyTorch if needed
pip uninstall torch -y
pip install torch --index-url https://download.pytorch.org/whl/cu121

Slow Model Download

Download Tips

• Pre-download models before running scripts

• Set HF_HOME to a fast SSD location

• Use huggingface-cli download for reliable downloads

• Check your internet connection speed

Import Errors

Fix Dependencies

# Reinstall with compatible versions
pip uninstall transformers diffusers -y
pip install transformers diffusers accelerate

# Install missing tokenizer dependencies
pip install sentencepiece protobuf

Popular Models to Try

Text Generation

Text Models

• mistralai/Mistral-7B-Instruct-v0.2 — Fast, capable 7B model

• meta-llama/Llama-2-13b-chat-hf — Meta's chat model

• codellama/CodeLlama-7b-hf — Code-focused model

• Qwen/Qwen2.5-7B-Instruct — Strong multilingual model

Image Generation

Image Models

• stabilityai/stable-diffusion-xl-base-1.0 — High quality images

• runwayml/stable-diffusion-v1-5 — Classic, well-supported

• black-forest-labs/FLUX.1-schnell — Fast generation

Quick Reference

Common Commands

• nvidia-smi — Check GPU status

• watch -n 1 nvidia-smi — Monitor GPU in real-time

• huggingface-cli download <model> — Pre-download model

• torch.cuda.empty_cache() — Clear GPU memory

Key Libraries

• transformers — Text models (LLMs, BERT, etc.)

• diffusers — Image generation models

• accelerate — Multi-GPU and mixed precision

• bitsandbytes — Quantization (8-bit, 4-bit)

Summary

Key Takeaways

Set up your VM with NVIDIA drivers and CUDA
Install PyTorch with CUDA support
Install Hugging Face libraries (transformers, diffusers)
Use quantization if you have limited VRAM
Monitor GPU usage with nvidia-smi
Pre-download models for faster iteration

Happy model running! 🚀