A Complete Beginners guide to localllm

Read Time:

A little history of AI and GPUs

Initially, GPUs were designed to accelerate graphics rendering for gaming and professional visualization. However, researchers and practitioners soon discovered that GPUs, with their highly parallel architecture, were exceptionally well-suited for the matrix and vector computations fundamental to machine learning and deep learning algorithms.

This realization marked a pivotal shift around the mid-2000s, as GPUs began to be harnessed for accelerating training times of neural networks, significantly impacting the field of artificial intelligence (AI). The adoption of GPUs for ML tasks has not only enabled the training of increasingly complex models on ever-larger datasets but has also driven advancements in GPU technology itself, tailored specifically for AI applications.

But Is investing in GPUs for ML justifiable?

On the surface, questioning the value of GPUs for machine learning might seem nonsensical. At their core, machine learning algorithms, especially those in deep learning, consist largely of arithmetic computations, with matrix multiplications being a prime example of operations that can be parallelized. Given that GPUs are engineered to excel at tasks involving parallel arithmetic computations, their compatibility with machine learning is apparent, making them a seemingly perfect pairing.

The prospect of having a high-cost GPU lying dormant during CPU-centric phases of the machine learning workflow does raise concerns about resource efficiency. While the reduction in latency offered by GPUs is attractive, the financial implications cannot be ignored.

Considering NVIDIA's data, the decision boils down to whether the benefits justify the cost, or if it's more practical to accept a sevenfold decrease in inference speed by relying solely on CPUs. This calculation hinges on specific project requirements, budget constraints, and performance targets, making it a nuanced decision rather than a clear-cut one.

Doing the math to run the llama2-70b model (that occupies 24GB). The minimum hourly cost to run on the NVDIA P100 would be $2.92 per hour. The monthly cost of which runs up to about $2100.  

Introducing localllm

localllm is a set of tools and libraries that provides easy access to quantized models from HuggingFace through a command-line utility.  It's pretty awesome for developers who want to dive into big language models (LLMs) but don't have a fancy GPU to work with. This whole setup gives you everything you need to get these LLMs running on just your CPU and whatever memory you've got, whether you're on Google Cloud Workstation or just your own computer. So, you don't need to worry about having a top-tier GPU to get into the game. It's all about making the power of LLMs accessible for everyone, wherever you're coding from.

- GPU-free LLM execution: No need for GPUs. localllm runs large language models using just CPU and memory, making it easier to add LLMs to your apps without losing out on performance.
- Boosted productivity: It's all about making things simpler. With localllm, you're working within the Google Cloud, which means less hassle with servers or external services. Focus on crafting cool apps, not on juggling hardware.
- Saves money: Forget about splurging on expensive GPUs. localllm lets you cut down costs by using CPUs instead, all within Google Cloud. This means more money in your pocket and a better investment in what matters.
- Keeps your data safe: Run LLMs on your local setup and keep your data close. This way, you dodge risks linked to sending data out or relying on outsiders, making your data security tight.
- Works smoothly with Google Cloud: localllm fits right into the Google Cloud world. From storage to machine learning APIs, it taps into Google Cloud's powers, letting you do more with less fuss.

Getting Started

First things first lets start by cloning localllm on github, installing the requirements and running TheBloke/Llama-2-13B-Ensemble-v5-GGUF. This will create an inference server that is working on port 8000.

git clone https://github.com/googlecloudplatform/localllm
cd localllm
pip3 install openai
pip3 install ./llm-tool/.
llm run TheBloke/Llama-2-13B-Ensemble-v5-GGUF 8000
Sure! Here's a haiku about cats:

Furry, playful friends
Purring, cuddling, sleeping tight
Cat naps in sunbeams 

If you're using models from TheBloke and you don't specify a filename, we can use the model with 4 bit medium quantization, or specify a filename explicitly. Here’s a list of more commands and helpful modules you could use for your workflow.

- List downloaded models
   - Use llm list to display all models that have been downloaded to your local machine. This command helps you keep track of available models without the need to manually search through your directories.
- List running models
   - Execute llm ps to see a list of currently running models. This is particularly useful for monitoring which models are active and managing system resources effectively.
- Start serving models
   - Start serving the default model from the repo. Download if not present.
       - Run llm run TheBloke/Llama-2-13B-Ensemble-v5-GGUF 8000 to serve the default model from its repository. If the model isn't already downloaded, this command will automatically retrieve it, ensuring you're working with the latest version.
   - Start serving a specific model. Download if not present.
       - Use llm run TheBloke/Llama-2-13B-Ensemble-v5-GGUF --filename llama-2-13b-ensemble-v5.Q4_K_S.gguf 8000 to initiate serving a specific model version. This is ideal when you need to utilize a particular model variant or configuration for your tasks.
- Stop serving models
   - Stop serving all models from the repo.
       - The command llm kill TheBloke/Llama-2-13B-Ensemble-v5-GGUF halts all currently running instances of models from the specified repository, freeing up resources or allowing you to switch to different models as needed.
   - Stop serving a specific model.
       - With llm kill TheBloke/Llama-2-13B-Ensemble-v5-GGUF --filename llama-2-13b-ensemble-v5.Q4_K_S.gguf, you can stop serving a specific model identified by its filename. This targeted approach helps manage specific models without affecting others.
- Download models
   - Download the default model from the repo.
       - llm pull TheBloke/Llama-2-13B-Ensemble-v5-GGUF will download the default model from its repository, ensuring you have the necessary files to start serving the model locally.
   - Download a specific model from the repo.
       - Use llm pull TheBloke/Llama-2-13B-Ensemble-v5-GGUF --filename llama-2-13b-ensemble-v5.Q4_K_S.gguf to download a specific model. This option is useful when you need a particular configuration or version of a model for your application.
- Remove models
   - Remove all models downloaded from the repo.
       - Executing llm rm TheBloke/Llama-2-13B-Ensemble-v5-GGUF deletes all models that have been downloaded from the specified repository from your local storage. This is useful for clearing space or removing models that are no longer needed.
   - Remove a specific model from the repo.
       - The command llm rm TheBloke/Llama-2-13B-Ensemble-v5-GGUF --filename llama-2-13b-ensemble-v5.Q4_K_S.gguf allows you to remove a specific model file, giving you the ability to manage your local model inventory more precisely.

These commands and their explanations provide a comprehensive guide to managing models with the `llm` tool, from listing and serving to downloading and removing them as needed.

Hosting on the Google Cloud Platform

To host a quantized LLM in a Google Cloud environment you can follow these steps. Ensure that you have a [GCP Project](https://cloud.google.com/docs/get-started) and have the `gcloud` CLI installed.

Once installed, go ahead and run

gcloud auth login

this will login your account onto the gcloud CLI and you should see something similar to this

Go ahead and copy this copybuild.yaml file  and keep it a new directory inside your pwd. Let’s call this directory localllm.

mkdir localllm
cd localllm
curl https://raw.githubusercontent.com/GoogleCloudPlatform/localllm/main/cloudbuild.yaml

Now we need to set up our environment variables.

Set the `PROJECT_ID` and `PROJECT_NUM` environment variables from your GCP project. You must modify the values.

export PROJECT_ID=<project-id>
export PROJECT_NUM=<project-num>

Set other needed environment variables. You can modify the values.

export REGION=us-central1
export LOCALLLM_REGISTRY=localllm-registry
export LOCALLLM_IMAGE_NAME=localllm
export LOCALLLM_CLUSTER=localllm-cluster
export LOCALLLM_WORKSTATION=localllm-workstation
export LOCALLLM_PORT=8000

Set the default project.

gcloud config set project $PROJECT_ID

Enable needed services.

gcloud services enable \
 cloudbuild.googleapis.com \
 workstations.googleapis.com \
 container.googleapis.com \
 containeranalysis.googleapis.com \
 containerscanning.googleapis.com \

Create an Artifact Registry repository for docker images.

gcloud artifacts repositories create $LOCALLLM_REGISTRY \
 --location=$REGION \

Build and push the image to Artifact Registry using Cloud Build. Details are in [cloudbuild.yaml](https://github.com/GoogleCloudPlatform/localllm/blob/main/cloudbuild.yaml).

gcloud builds submit . \

Configure a Cloud Workstation cluster.

gcloud workstations clusters create $LOCALLLM_CLUSTER \

Create a Cloud Workstation configuration. We suggest using a machine type of e2-standard-32 which has 32 vCPU, 16
core and 128 GB memory.

gcloud workstations configs create $LOCALLLM_WORKSTATION \
--region=$REGION \
--machine-type=e2-standard-32 \

Create a Cloud Workstation.

gcloud workstations create $LOCALLLM_WORKSTATION \

Grant access to the default Cloud Workstation service account.

gcloud artifacts repositories add-iam-policy-binding $LOCALLLM_REGISTRY \
 --location=$REGION \
 --member=serviceAccount:service-$PROJECT_NUM@gcp-sa-workstationsvm.iam.gserviceaccount.com \

Start the workstation.

gcloud workstations start $LOCALLLM_WORKSTATION \
 --cluster=$LOCALLLM_CLUSTER \

Connect to the workstation using ssh. Alternatively, you can connect to the workstation interactively  in the browser.

gcloud workstations ssh $LOCALLLM_WORKSTATION \
 --cluster=$LOCALLLM_CLUSTER \

Start serving the default model from the repo.

llm run TheBloke/Llama-2-13B-Ensemble-v5-GGUF $LOCALLLM_PORT

Get the hostname of the workstation using:

gcloud workstations describe $LOCALLLM_WORKSTATION \
 --cluster=$LOCALLLM_CLUSTER \

- Interact with the model by visiting the live OpenAPI documentation page: `https://$LOCALLLM_PORT-$LLM_HOSTNAME/docs`.

So, what exactly happened here? Why can we run these models without a GPU???

What is Quantization in LLMs?

Quantization is a technique used to reduce the precision of the numbers in a dataset, to decrease the computational complexity and the size of machine learning models, thereby making them more efficient for deployment, especially on devices with limited resources.

Let’s take an example of using 8-bit medium quantization. This process involves converting the 32-bit floating-point numbers, which are commonly used in the weights and activations of neural networks, down to just 8 bits.

This dramatic reduction in bit depth means that each number can only represent one of 32 different values, compared to the 4 billion+ values that a 32-bit number can represent. While this may seem like it would significantly degrade the model's performance, sophisticated algorithms are used to ensure that the most important information is retained.

This is achieved by strategically mapping the 32-bit floating point values to this much smaller set of 8-bit representations, often through a combination of clustering similar values together and trimming outliers. As a result, 8-bit medium quantization provides a balance between model efficiency and performance, enabling faster computation and lower memory usage without drastically sacrificing accuracy, making it a valuable tool for deploying models on edge devices or in environments where computational resources are constrained


The localllm approach aims to overcome the limitations of relying on remote servers or cloud-based GPU instances, addressing concerns related to latency, security, and dependency on third-party services. This article goes through getting started with localllm and hosting your first local model and generating an inference.

Need help with custom LLMs?

We specialize in AI integrations, enhancing text analysis, and ensuring AI decisions are clear. Concerned about complex AI systems or understanding AI decisions? Our CEO, Rohan, offers a free consultation to demystify how LLMs can revolutionize your business, making AI a reliable partner for your growth.

Take your AI strategy to the next level with us. Contact us for a conversation about transforming your business with LLMs.

Book an AI consultation

Looking to build AI solutions? Let's chat.

Schedule your consultation today - this not a sales call, feel free to come prepared with your technical queries.

You'll be meeting Rohan Sawant, the Founder.
Book a Call

Let us help you.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Behind the Blog 👀
Srihari Unnikrishnan

Srihari is an NLP Engineer at Ionio. That's about it really.