Overview

Play a little bit with local LLM models in a badget laptop, by keeping an stable Linux base system. In my case Fedora Linux.

Current hardware

CPU: AMD Ryzen 5 Pro 7535U
GPU: AMD Radeon 660M GPU
RAM: 16 32 GB

ROCm

The official ROCm drivers for AMD GPUs in Linux work fine for local LLMS but less performance than Vulkan. I leave here the steps I followed to try them.

Distrobox

Distrobox is ideal to test or play linux distros because is lightweight and very intregrated with the host machine.

sudo dnf install distrobox

Install Ubuntu 22.04

distrobox create \
--name rocm-ubuntu \
--image ubuntu:22.04 \
--additional-flags "--device /dev/kfd --device /dev/dri"

Install ROCm and Pytorch for ROCm

distrobox enter rocm-ubuntu # add --verbose if you have any issue while fist boot
sudo apt update
sudo apt install -y \
  wget \
  gnupg\\
  ca-certificates \
  software-properties-common \
  lsb-release
sudo apt install python3-setuptools python3-wheel
wget https://repo.radeon.com/amdgpu-install/7.2/ubuntu/jammy/amdgpu-install_7.2.70200-1_all.deb
sudo dpkg -i amdgpu-install_7.2.70200-1_all.deb
sudo amdgpu-install -y --usecase=graphics,rocm --no-dkms
export HSA_OVERRIDE_GFX_VERSION=10.3.0 # remember to add this in your .bashrc or .zshrc!
mkdir -p ~/pip_tmp
export TMPDIR=~/pip_tmp
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm7.1 --no-cache-dir

Testing GPU with ROCM

You can run rocminfo and/or run this Python script:

import torch

try:
    print(f"ROCm Version: {torch.version.hip}")
    if torch.cuda.is_available():
        # Try to actually move data to the GPU (This fails if binaries are missing)
        x = torch.tensor([1.0, 2.0, 3.0]).cuda()
        print(f"✅ SUCCESS! Tensor created on: {torch.cuda.get_device_name(0)}")
        print(x)
    else:
        print("❌ No GPU detected by PyTorch.")
except Exception as e:
    print(f"❌ CRASHED: {e}")

Check the current llama.cpp issue when run on AMD APUs (like my laptop one) when using ROCm. This is why I prefer Vulkan approach.

Vulkan

Install Vulkan and llama.cpp

Ubuntu

wget -qO - https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo apt-key add -
sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-jammy.list https://packages.lunarg.com/vulkan/lunarg-vulkan-jammy.list
sudo apt update
sudo apt install vulkan-sdk

Fedora

sudo dnf install git make gcc-c++ vulkan-headers vulkan-loader-devel libshaderc-devel glslc glslang cmake ninja-build

Now test it with: vulkaninfo | head -20

Install local LLMs

I tried Ollama and llama.cpp. The second one get a higher performance, but I leave the steps for both.

Ollama

Download and install the Ollama binary and libs

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

In another terminal, test a basic model.

distrobox enter rocm-ubuntu
ollama run llama3.2

Compile ollama with ROCm [not recomended]

export ROCM_PATH=/opt/rocm-7.2.0
export HIP_PATH=${ROCM_PATH}
export HSA_PATH=${ROCM_PATH}
export LD_LIBRARY_PATH=${ROCM_PATH}/lib:${ROCM_PATH}/lib64:$LD_LIBRARY_PATH
rm build-rocm # if you have tried it before
cmake -B build-rocm \
  -S . \
  -DGGML_HIP=ON \                         
  -DAMDGPU_TARGETS=gfx1030 \  
  -DCMAKE_BUILD_TYPE=Release \      
  -DCMAKE_PREFIX_PATH=${ROCM_PATH} \               
  -DCMAKE_HIP_COMPILER=${ROCM_PATH}/llvm/bin/clang++
cmake --build build-rocm -j$(nproc)

llama.cpp

Download source code and build it:

git clone --depth 1 https://github.com/ggml-org//llama.cpp.git # if you haven't tried it with ROCm before
cd llama.cpp
make clean
# rm -rf build-vulkan # if you built it previously
cmake -B build-vulkan \
  -S . \
  -DGGML_VULKAN=ON \
  -DGGML_CUDA=OFF \
  -DGGML_HIP=OFF \
  -DCMAKE_BUILD_TYPE=Release \
  -G Ninja
cmake --build build-vulkan -j$(nproc)

Basic WebUI is availaible, useful to include documents, images or files in general.

~/src/llama.cpp/build-vulkan/bin/llama-server \                                                         INT ✘ 
  -m ~/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
  -ngl -1 \                        
  -c 0 \                          
  --jinja \           
  --mlock \                                              
  --host 127.0.0.1 --port 8033

GPU monitoring

In another terminals you can monitorize the hardware resources:

watch -n 1 rocm-smi # subterminal 1
radeontop # subterminal 2

LLMs performance benchmarks

Common code-related prompt:

export PROMPT_AI="Write a Python function that loads a YAML file and validates required keys."

ollama:

ollama serve
ollama pull hf.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M
curl -s http://localhost:11434/api/generate -d "{
  \"model\": \"qwen2.5-coder:7b-instruct-q4_K_M\",
  \"prompt\": \"$PROMPT_AI\",
  \"num_ctx\": 8192,
  \"num_gpu_layers\": 99,
  \"stream\": true
}"

llama.cpp

build-vulkan/bin/llama-cli \
  -m ~/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf \
  -ngl -1 \
  -c 8192 \
  -t 6 \
  -n 512 \
  --color on \
  -p "$PROMPT_AI"

IDE integration

Visual Studio Code

llama.vscode I have followed the setup instructions but it's very very slow in my laptop.

jaimemrjm/local_llm_linux.md

Select an option

No results found