Run GPTQ, GGML, GGUF… One Library to rule them ALL! | by Fabio Matricardi | Artificial Corner | Medium

This article introduces a single library capable of handling various quantized large language models (LLMs), simplifying their execution on personal computers.

AI Summary available — skim the key points instantly. Show AI Generated Summary

Show AI Generated Summary

Run GPTQ, GGML, GGUF… One Library to rule them ALL!

Created by the author and Leonardo.ai

Quantization and Hardware resources are mingled together: but why should it be so complicated to run a Large Language Model on my Computer? In this article I will explore with you a single library able to handle all the quantized models, and few tricks to make it work with any LLM.

NOTE: because of the astonishing benchmarks of Mistral-7b and Zephyr-7b, I will refer to these 2 models for the examples. In previous articles I already discussed how to run llama2 based model, Vicuna, WizardLM and Orca_mini.

We will cover the following points:

Quantization: what and why?The 2 main quantization formats: GGML/GGUF and GPTQWhat do you  need to start?The basics: 8 lines of code to run them allPrompt templatesTokenizer tipsMac M1/M2 IssuesConclusions

All the example are also in the GitHub Repo for this article. You can code along the way or use the Google Colab Notebook from there.

Buckle up and let’s start!

Quantization: what and why?

LLM quantization, also known as Language Model Quantization, refers to the process of compressing or reducing the size of a large language model (LLM) to make it more efficient in terms of memory usage and computational requirements.

Large language models, such as GPT-3, Llama2, Falcon and many other, can be massive in terms of their model size, often consisting of billions or even trillions of parameters. This large size poses challenges when it comes to use them on consumer hardware (like almost 99% of us)

LLM quantization aims to address these challenges by reducing the model size while minimizing the impact on the model’s performance. This technique reduces the precision of the Neural Network parameters. For example, instead of representing a parameter with a floating-point number, it can be quantized to a fixed-point number with a lower precision, such as an 4-bit, 8-bit or 16-bit integer. This reduction in precision allows for more efficient storage and computation.

Was this article displayed correctly? Not happy with what you see?

See Archived Versions Request Manual Review

We located an Open Access version of this article, legally shared by the author or publisher. Open It

Category: AI

Tags: LLM quantization GPTQ GGML GGUF

Tabs Reminder: Tabs piling up in your browser? Set a reminder for them, close them and get notified at the right time.

If you often open multiple tabs and struggle to keep track of them, Tabs Reminder is the solution you need. Tabs Reminder lets you set reminders for tabs so you can close them and get notified about them later. Never lose track of important tabs again with Tabs Reminder!

Try our Chrome extension today!

Add to Chrome

Save As Favorite

Add To Reading List

Share this article with your
friends and colleagues.
Earn points from views and
referrals who sign up.
Learn more

Twitter/X

WhatsApp

Facebook

Save articles to reading lists
and access them on any device