How to install LLAMA CPP with CUDA (on Windows) | by Kaizin | Medium


This article provides a step-by-step guide on installing the LLAMA CPP model with CUDA support on a Windows system, enabling faster text generation using GPUs.
AI Summary available β€” skim the key points instantly. Show AI Generated Summary
Show AI Generated Summary

How to install LLAMA CPP with CUDA (on Windows)

As LLM such as OpenAI GPT becomes very popular, many attempts have been done to install LLM in local environment. The most famous LLM that we can install in local environment is indeed LLAMA models. However running LLMs requires lots of computing power even when just generating texts. Therefore we need GPUs to boost up the speed of generating.

Recently C/C++ port of LLAMA model has been developed. Since it is written in C/C++ language which is high-performance programming language, it could be running faster than ChatGPT with high-performance computing platform.

Although I don’t have such a high-performance computing platform, I tried to install some LLAMA cpp models with GPU enables.

Zephyr 7B

It is fine-tuned version of LLAMA and It shows great performance on Extraction, Coding, STEM, and Writing compare to other LLAMA models. LLAMA cpp team introduced a new format called GGUF for cpp models. Below repo contains model of GGUF format and I used this model to install.

https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF

To use LLAMA cpp, llama-cpp-python package should be installed. But to use GPU, we must set environment variable first. Make sure that there is no space,β€œβ€, or β€˜β€™ when set environment variable.

Since I use anaconda, run below codes to install llama-cpp-python.

# on anaconda prompt!set CMAKE_ARGS=-DLLAMA_CUBLAS=onpip install llama-cpp-python# if you somehow fail and need to re-install run below codes.# it ignore files that downloaded previously and re-install with new files.pip install llama-cpp-python  --upgrade --force-reinstall --no-cache-dir --verbose

Running above code actually showed no errors, but you have to check if it is installed properly. When you run the model actually (with verbose True option), you can observe logs like below, and BLAS must be set as 1. Otherwise LLAMA model would not use GPU.

🧠 Pro Tip

Skip the extension β€” just come straight here.

We’ve built a fast, permanent tool you can bookmark and use anytime.

Go To Paywall Unblock Tool
Sign up for a free account and get the following:
  • Save articles and sync them across your devices
  • Get a digest of the latest premium articles in your inbox twice a week, personalized to you (Coming soon).
  • Get access to our AI features

  • Save articles to reading lists
    and access them on any device
    If you found this app useful,
    Please consider supporting us.
    Thank you!

    Save articles to reading lists
    and access them on any device
    If you found this app useful,
    Please consider supporting us.
    Thank you!