GitHub - xtekky/gpt4local: Openai-style, fast & lightweight local language model inference w/ documents

g4l is a high-level Python library that allows you to run language models using the llama.cpp bindings. It is a sister project to @gpt4free, which also provides AI, but using internet and external providers, aswell as additional feature such as text retrieval from documents.

pull requests are welcome !!

Roadmap

Requirements

To use G4L, you need to have the llama.cpp Python bindings installed. You can install them using pip:

pip3 install -U llama-cpp-python

Installation

Clone the G4L repository:

git clone https://github.com/gpt4free/gpt4local

Navigate to the cloned directory:

cd gpt4local

Install the required dependencies:

pip install -r requirements.txt

Downloading Models

Download the desired models in the GGUF format from HuggingFace. You can find a variety of quantized .gguf models on TheBloke's page.
Place the downloaded models in the ./models folder.

Some popular models include:

Model Quantization

The models are available in different quantization levels, such as q2_0, q4_0, q5_0, and q8_0. Higher quantization 'bit counts' (4 bits or more) generally preserve more quality, whereas lower levels compress the model further, which can lead to a significant loss in quality. The standard quantization level is q4_0.

Keep in mind the memory requirements for different model sizes:

7b parameters ~ 8gb of RAM
13b parameters ~ 16gb of RAM

Best Models

According to chat.lmsys.org, the best models are:

Best 7b model: Mistral-7B-Instruct-v0.2
Best opensource model: Qwen1.5-72B-Chat (available here)

Usage

Basic Usage

from g4l.local import LocalEngine

engine = LocalEngine(
    gpu_layers = -1,  # use all GPU layers
    cores      = 0    # use all CPU cores
)

response = engine.chat.completions.create(
    model    = 'orca-mini-3b-gguf2-g4_0',
    messages = [{"role": "user", "content": "hi"}],
    stream   = True
)

for token in response:
    print(token.choices[0].delta.content)

Note: The model parameter must match the file name of the .gguf model you placed in ./models, without the .gguf extension!

Chat With Documents

from g4l.local import LocalEngine, DocumentRetriever

engine = LocalEngine(
    gpu_layers = -1,  # use all GPU layers
    cores      = 0,   # use all CPU cores
    document_retriever = DocumentRetriever(
        files       = ['einstein-albert.pdf'], 
        embed_model = 'SmartComponents/bge-micro-v2', # https://huggingface.co/spaces/mteb/leaderboard
    )
)

response = engine.chat.completions.create(
    model    = 'mistral-7b-instruct',
    messages = [
        {
            "role": "user", "content": "how was einstein's work in the laboratory"
        }
    ],
    stream   = True
)

for token in response:
    print(token.choices[0].delta.content or "", end="", flush=True)

! The embeddings model will be downloaded upon first use, but it is really small and lightweight.

Document Retrieval

G4L provides a DocumentRetriever class that allows you to retrieve relevant information from documents based on a query. Here's an example of how to use it:

from g4l.local import DocumentRetriever

engine = DocumentRetriever(
    files=['einstein-albert.txt'], 
    embed_model='SmartComponents/bge-micro-v2', # https://huggingface.co/spaces/mteb/leaderboard
    verbose=True,
)

retrieval_data = engine.retrieve('what inventions did he do')

for node_with_score in retrieval_data:
    node = node_with_score.node
    score = node_with_score.score
    text = node.text
    metadata = node.metadata
    page_label = metadata['page_label']
    file_name = metadata['file_name']
    
    print(f"Text: {text}")
    print(f"Score: {score}")
    print(f"Page Label: {page_label}")
    print(f"File Name: {file_name}")
    print("---")

You can also get a ready-to-go prompt for the language model using the retrieve_for_llm method:

retrieval_data = engine.retrieve_for_llm('what inventions did he do')
print(retrieval_data)

The prompt template used by retrieve_for_llm is as follows:

prompt = (f'Context information is below.\n'
    + '---------------------\n'
    + f'{context_batches}\n'
    + '---------------------\n'
    + 'Given the context information and not prior knowledge, answer the query.\n'
    + f'Query: {query_str}\n'
    + 'Answer: ')

Advanced Usage

G4L provides several configuration options to customize the behavior of the LocalEngine. Here are some of the available options:

gpu_layers: The number of layers to offload to the GPU. Use -1 to offload all layers.
cores: The number of CPU cores to use. Use 0 to use all available cores.
use_mmap: Whether to use memory mapping for faster model loading. Default is True.
use_mlock: Whether to lock the model in memory to prevent swapping. Default is False.
offload_kqv: Whether to offload key, query, and value tensors to the GPU. Default is True.
context_window: The maximum context window size. Default is 4900.

You can pass these options when creating an instance of LocalEngine:

engine = LocalEngine(
    gpu_layers = -1,
    cores      = 0,
    use_mmap   = True,
    use_mlock  = False,
    offload_kqv= True,
    context_window = 4900
)

Benchmark

Benchmark ran on a 2022 MacBook Air M2, 8GB RAM.

PC: Mac Air M2
CPU/GPU: M2 chip
Cores: All (8)
GPU Layers: All
GPU Offload: 100%

No power:
Model: mistral-7b-instruct-v2
Number of iterations: 5
Average loading time: 1.85s
Average total tokens: 48.20
Average total time: 5.34s
Average speed: 9.02 t/s

With power:
Model: mistral-7b-instruct-v2
Number of iterations: 5
Average loading time: 1.88s
Average total tokens: 317
Average total time: 17.7s
Average speed: 17.9 t/s

Why gpt4local?

I have coded G4L in a way that you can use language models in a very familiar way with quick installation, while preserving maximum performance.
Using the direct Python bindings, I was able to max out the performance by using 100% GPU, CPU, and RAM.
I tried different 3rd party packages that wrap llama.cpp, like LmStudio, which still had great performance but in my case a speed of ~7.83 tokens/s in contrast to 9.02 t/s with native llama.cpp Python bindings.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
benchmark		benchmark
examples		examples
files		files
g4l		g4l
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Roadmap

Table of Contents

Requirements

Installation

Downloading Models

Model Quantization

Best Models

Usage

Basic Usage

Chat With Documents

Document Retrieval

Advanced Usage

Benchmark

Why gpt4local?

About

Releases

Languages

xtekky/gpt4local

Folders and files

Latest commit

History

Repository files navigation

Roadmap

Table of Contents

Requirements

Installation

Downloading Models

Model Quantization

Best Models

Usage

Basic Usage

Chat With Documents

Document Retrieval

Advanced Usage

Benchmark

Why gpt4local?

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages