Quantization and Hardware resources are mingled together: but why should it be so complicated to run a Large Language Model on my Computer? In this article I will explore with you a single library able to handle all the quantized models, and few tricks to make it work with any LLM.
NOTE: because of the astonishing benchmarks of Mistral-7b and Zephyr-7b, I will refer to these 2 models for the examples. In previous articles I already discussed how to run llama2 based model, Vicuna, WizardLM and Orca_mini.
We will cover the following points:
Quantization: what and why?The 2 main quantization formats: GGML/GGUF and GPTQWhat do you need to start?The basics: 8 lines of code to run them allPrompt templatesTokenizer tipsMac M1/M2 IssuesConclusions
All the example are also in the GitHub Repo for this article. You can code along the way or use the Google Colab Notebook from there.
Buckle up and let’s start!
LLM quantization, also known as Language Model Quantization, refers to the process of compressing or reducing the size of a large language model (LLM) to make it more efficient in terms of memory usage and computational requirements.
Large language models, such as GPT-3, Llama2, Falcon and many other, can be massive in terms of their model size, often consisting of billions or even trillions of parameters. This large size poses challenges when it comes to use them on consumer hardware (like almost 99% of us)
LLM quantization aims to address these challenges by reducing the model size while minimizing the impact on the model’s performance. This technique reduces the precision of the Neural Network parameters. For example, instead of representing a parameter with a floating-point number, it can be quantized to a fixed-point number with a lower precision, such as an 4-bit, 8-bit or 16-bit integer. This reduction in precision allows for more efficient storage and computation.
Skip the extension — just come straight here.
We’ve built a fast, permanent tool you can bookmark and use anytime.
Go To Paywall Unblock Tool