
Running Local Models on a VM
>>> Set up a GPU-enabled virtual machine to run Hugging Face models for text generation, image generation, and more.
Overview
Run open-source AI models from Hugging Face on your own GPU-enabled virtual machine. This gives you full control, no API costs after setup, and the ability to run models without content restrictions.
• Set up a GPU VM with CUDA and Python
• Install Hugging Face libraries (transformers, diffusers)
• Load and run text and image generation models
• Optimize for different GPU memory sizes
Part 1: VM Setup
Choose a Cloud Provider
• Lambda Labs — Simple GPU rentals, good pricing
• RunPod — Pay-per-hour GPU instances
• Vast.ai — Marketplace for GPU rentals
• AWS — EC2 P4d/G5 instances
• GCP — A100/V100 instances
• Azure — NC/ND series VMs
Recommended GPU Specs
• 16GB VRAM — Run 7B parameter models, basic image generation
• 24GB VRAM — Run 13B models, high-res image generation
• 40GB+ VRAM — Run 30B+ models, batch generation
Verify GPU Access
Once your VM is running, verify GPU access:
# Check NVIDIA driver and GPU nvidia-smi # Check CUDA version nvcc --version
You should see your GPU model and VRAM listed.
Part 2: Install Dependencies
Python Environment
# Update system packages sudo apt update && sudo apt upgrade -y # Install Python and pip (if not present) sudo apt install python3 python3-pip -y # Verify Python version (3.8+ required) python3 --version
Install PyTorch with CUDA
# Install PyTorch with CUDA support pip install torch torchvision --index-url \ https://download.pytorch.org/whl/cu121 # Verify CUDA is available python3 -c "import torch; print(torch.cuda.is_available())"
Should print True if CUDA is properly configured.
Install Hugging Face Libraries
# Core libraries pip install transformers accelerate # For image generation models pip install diffusers # Common dependencies pip install sentencepiece protobuf pillow
Part 3: Running Text Models
Load a Text Generation Model
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Generate text
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Using a Pipeline (Simpler API)
from transformers import pipeline import torch generator = pipeline( "text-generation", model="mistralai/Mistral-7B-Instruct-v0.2", torch_dtype=torch.bfloat16, device_map="auto" ) response = generator( "Write a haiku about coding:", max_new_tokens=50 ) print(response[0]["generated_text"])
Part 4: Running Image Models
Load an Image Generation Model
import torch
from diffusers import DiffusionPipeline
# Load the model
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.bfloat16
)
pipe = pipe.to("cuda")
# Generate an image
prompt = "A serene mountain landscape at sunset, detailed, 8k"
image = pipe(prompt).images[0]
# Save the image
image.save("output.png")
print("Image saved!")Adjusting Generation Parameters
image = pipe(
prompt,
num_inference_steps=50, # More steps = better quality
guidance_scale=7.5, # Prompt adherence (7-15)
height=1024,
width=1024,
generator=torch.Generator("cuda").manual_seed(42)
).images[0]Part 5: Memory Optimization
For Limited VRAM (< 24GB)
# For text models: load in 8-bit model = AutoModelForCausalLM.from_pretrained( model_id, load_in_8bit=True, device_map="auto" ) # For image models: enable CPU offloading pipe.enable_model_cpu_offload() # Enable attention slicing pipe.enable_attention_slicing() # For very large images pipe.enable_vae_tiling()
4-bit Quantization (Even Lower VRAM)
pip install bitsandbytes
from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto" )
Clear GPU Memory
import torch import gc # Delete model and clear cache del model gc.collect() torch.cuda.empty_cache()
Part 6: Model Management
Pre-download Models
# Install HF CLI pip install huggingface_hub # Download a model huggingface-cli download mistralai/Mistral-7B-Instruct-v0.2 # Set custom cache directory export HF_HOME=/path/to/cache
Check Cache Location
Models are cached in ~/.cache/huggingface/hub/ by default. Large models can be 10-50GB+.
du -sh ~/.cache/huggingface/hub/
Part 7: Running as a Service
Simple API with Flask
from flask import Flask, request, jsonify
from transformers import pipeline
import torch
app = Flask(__name__)
# Load model once at startup
generator = pipeline(
"text-generation",
model="mistralai/Mistral-7B-Instruct-v0.2",
torch_dtype=torch.bfloat16,
device_map="auto"
)
@app.route("/generate", methods=["POST"])
def generate():
data = request.json
prompt = data.get("prompt", "")
result = generator(prompt, max_new_tokens=256)
return jsonify({"response": result[0]["generated_text"]})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)Run in Background
# Using nohup nohup python3 api.py > output.log 2>&1 & # Or with screen screen -S model-api python3 api.py # Ctrl+A, D to detach
Monitoring
Watch GPU Usage
# Real-time GPU monitoring watch -n 1 nvidia-smi # Or one-time check nvidia-smi
Check Memory Usage
python3 -c "
import torch
print(f'Allocated: {torch.cuda.memory_allocated()/1e9:.1f}GB')
print(f'Reserved: {torch.cuda.memory_reserved()/1e9:.1f}GB')
"Troubleshooting
Out of Memory
• Use a smaller model variant (7B instead of 13B)
• Enable 8-bit or 4-bit quantization
• Use device_map="auto" to spread across CPU/GPU
• Enable CPU offloading for diffusion models
• Reduce batch size or image resolution
CUDA Not Available
# Check if CUDA is detected python3 -c "import torch; print(torch.cuda.is_available())" # Check CUDA version PyTorch was built with python3 -c "import torch; print(torch.version.cuda)" # Reinstall PyTorch if needed pip uninstall torch -y pip install torch --index-url https://download.pytorch.org/whl/cu121
Slow Model Download
• Pre-download models before running scripts
• Set HF_HOME to a fast SSD location
• Use huggingface-cli download for reliable downloads
• Check your internet connection speed
Import Errors
# Reinstall with compatible versions pip uninstall transformers diffusers -y pip install transformers diffusers accelerate # Install missing tokenizer dependencies pip install sentencepiece protobuf
Popular Models to Try
Text Generation
• mistralai/Mistral-7B-Instruct-v0.2 — Fast, capable 7B model
• meta-llama/Llama-2-13b-chat-hf — Meta's chat model
• codellama/CodeLlama-7b-hf — Code-focused model
• Qwen/Qwen2.5-7B-Instruct — Strong multilingual model
Image Generation
• stabilityai/stable-diffusion-xl-base-1.0 — High quality images
• runwayml/stable-diffusion-v1-5 — Classic, well-supported
• black-forest-labs/FLUX.1-schnell — Fast generation
Quick Reference
• nvidia-smi — Check GPU status
• watch -n 1 nvidia-smi — Monitor GPU in real-time
• huggingface-cli download <model> — Pre-download model
• torch.cuda.empty_cache() — Clear GPU memory
• transformers — Text models (LLMs, BERT, etc.)
• diffusers — Image generation models
• accelerate — Multi-GPU and mixed precision
• bitsandbytes — Quantization (8-bit, 4-bit)
Summary
-
Set up your VM with NVIDIA drivers and CUDA
-
Install PyTorch with CUDA support
-
Install Hugging Face libraries (transformers, diffusers)
-
Use quantization if you have limited VRAM
-
Monitor GPU usage with nvidia-smi
-
Pre-download models for faster iteration
Happy model running! 🚀