Run GPTQ, GGML, GGUF… One Library to rule them ALL! | by Fabio Matricardi | Artificial Corner | Medium


This article introduces a single library capable of handling various quantized large language models (LLMs), simplifying their execution on personal computers.
AI Summary available — skim the key points instantly. Show AI Generated Summary
Show AI Generated Summary
We located an Open Access version of this article, legally shared by the author or publisher. Open It

Run GPTQ, GGML, GGUF… One Library to rule them ALL!

Created by the author and Leonardo.ai

Quantization and Hardware resources are mingled together: but why should it be so complicated to run a Large Language Model on my Computer? In this article I will explore with you a single library able to handle all the quantized models, and few tricks to make it work with any LLM.

NOTE: because of the astonishing benchmarks of Mistral-7b and Zephyr-7b, I will refer to these 2 models for the examples. In previous articles I already discussed how to run llama2 based model, Vicuna, WizardLM and Orca_mini.

We will cover the following points:

Quantization: what and why?The 2 main quantization formats: GGML/GGUF and GPTQWhat do you  need to start?The basics: 8 lines of code to run them allPrompt templatesTokenizer tipsMac M1/M2 IssuesConclusions

All the example are also in the GitHub Repo for this article. You can code along the way or use the Google Colab Notebook from there.

Buckle up and let’s start!

Quantization: what and why?

LLM quantization, also known as Language Model Quantization, refers to the process of compressing or reducing the size of a large language model (LLM) to make it more efficient in terms of memory usage and computational requirements.

Large language models, such as GPT-3, Llama2, Falcon and many other, can be massive in terms of their model size, often consisting of billions or even trillions of parameters. This large size poses challenges when it comes to use them on consumer hardware (like almost 99% of us)

LLM quantization aims to address these challenges by reducing the model size while minimizing the impact on the model’s performance. This technique reduces the precision of the Neural Network parameters. For example, instead of representing a parameter with a floating-point number, it can be quantized to a fixed-point number with a lower precision, such as an 4-bit, 8-bit or 16-bit integer. This reduction in precision allows for more efficient storage and computation.

🧠 Pro Tip

Skip the extension — just come straight here.

We’ve built a fast, permanent tool you can bookmark and use anytime.

Go To Paywall Unblock Tool
Sign up for a free account and get the following:
  • Save articles and sync them across your devices
  • Get a digest of the latest premium articles in your inbox twice a week, personalized to you (Coming soon).
  • Get access to our AI features

  • Save articles to reading lists
    and access them on any device
    If you found this app useful,
    Please consider supporting us.
    Thank you!

    Save articles to reading lists
    and access them on any device
    If you found this app useful,
    Please consider supporting us.
    Thank you!