Skip to main content

Open LLMs: Concepts, Parameters, Ecosystem, and Usage

· 17 min read
Anand Raja
Senior Software Engineer

What is an Open LLM?

An Open LLM (Large Language Model) is a language model whose architecture, weights, and often training code are openly available for anyone to use, modify, and run. Unlike proprietary LLMs (like OpenAI’s GPT-4 or Google Gemini), open LLMs can be run locally, fine-tuned, and integrated into custom workflows without vendor lock-in or sending your data to third-party servers.


Explanation:

  • Training Data → Preprocessing → LLM Model → Training → Trained Weights → Inference

Key Parameters & Concepts

GGUF (GPT-Generated Unified Format)

A modern, efficient file format for storing quantized LLM weights and metadata. GGUF is designed for compatibility with inference engines like llama.cpp, LM Studio, and Ollama, making it easy to run models on various hardware.

Note: GPT - Generative Pre-trained Transformer

Temperature

In LLMs, Temperature is a parameter that controls the randomness or creativity of the model’s output during text generation.

  • Low temperature (e.g., 0.2):

    • Makes the model more deterministic and focused.
    • Boosts the probability of the most likely tokens (e.g., a candidate with 45% probability might be boosted to 90%, while a less likely one drops from 12% to 3%).
    • Output is more predictable and repetitive.
  • High temperature (e.g., 1.0):

    • Makes the model more creative and varied.
    • Flattens the probability distribution (e.g., 45% and 12% might both become closer to 20%).
    • Output is more diverse, but can be less coherent.
  • In summary:

    • Temperature adjusts how much the model explores less likely options.
    • Lower values = safer, more predictable text.
    • Higher values = more surprising, creative, or diverse text.

top_k

In LLMs, top_k is a parameter that limits the next-token sampling to the top K most likely tokens.

  • It restricts the number of candidate tokens considered at each generation step (or getting rid of some candidates).

  • Lower values (e.g., k=1) make the output more deterministic (always pick the most likely token).

  • Higher values allow more diversity, as the model can choose from a wider set of possible tokens.

  • For example, if k=5, only the 5 most likely tokens are considered for the next word; if k=1, only the single most likely token is chosen.

  • In summary:

    • top_k controls diversity and randomness in generated text by limiting how many token candidates are sampled from at each step.

top_p

In LLMs, top_p (also called nucleus sampling) is a parameter that controls how many candidate tokens are considered for each next word, based on their combined probabilities.

  • The model sorts all possible next tokens by probability.

  • It then selects the smallest set of tokens whose cumulative probability exceeds p (e.g., 0.9).

  • Only these tokens are considered for sampling; the rest are ignored.

  • How it works:

    • If top_p = 0.5: Only tokens whose combined probability is more than 50% are considered. For example, if "blue" (45%) and "visible" (21%) are the top tokens, both are included (45% + 21% = 66% > 50%).
    • If top_p = 0.9: More tokens are included, up to the point where their combined probability exceeds 90%.
  • Summary:

    • Lower top_p: Output is more focused and deterministic.
    • Higher top_p: Output is more diverse and creative.
    • top_p balances randomness and coherence by limiting candidates based on their combined probabilities.

min_p

In LLMs, min_p is a parameter that sets a minimum probability threshold for token selection, filtering out unlikely tokens.

  • It discards candidate tokens whose probability is below the specified threshold.

  • In simple words: It ensures only tokens with a probability above min_p are considered for generation.

  • Example:

    • If min_p = 0.05, tokens with less than 5% probability are ignored.
    • If min_p = 0.5, only tokens with at least 50% probability are considered.
  • Summary:

    • min_p helps control output quality by removing very unlikely token candidates, making the output more focused and less random.

repeat_penalty

In LLMs, repeat_penalty is a parameter that penalizes repeated tokens to reduce repetition in generated text.

  • When generating text, the model may sometimes repeat words or phrases.

  • The repeat_penalty parameter lowers the probability of tokens that have already appeared in the generated sequence.

  • Higher values (e.g., 1.2) make the model less likely to repeat itself.

  • Lower values (e.g., 1.0) have little or no effect on repetition.

  • In summary:

    • repeat_penalty helps make the output more natural and less repetitive by discouraging the model from generating the same tokens multiple times.

Model size (params)

In LLMs, Model size refers to the total number of parameters (weights) in the model, often expressed in billions (e.g., 7B, 13B, 70B).

  • Larger models (more parameters) are generally more capable, can understand and generate more complex text, and perform better on a wider range of tasks.

  • Smaller models require less memory (RAM/VRAM), are faster to run, and can be used on less powerful hardware, but may be less accurate or creative.

  • Example:

    • A 7B model has 7 billion parameters.
    • A 70B model has 70 billion parameters and requires much more memory and compute.
  • In summary:

    • Model size is a key factor in the capability and resource requirements of an LLM.

Weights

In LLMs, Weights are the learned values (parameters) in the neural network that define the model’s behavior.

  • During training, the model adjusts these weights to minimize prediction errors.

  • Each weight represents the strength of a connection between neurons in the network.

  • In open LLMs, these weights are provided for download and use, allowing you to run or fine-tune the model locally.

  • The total number of weights (parameters) determines the model’s size (e.g., 7B, 13B, 70B).

  • In summary:

    • Weights are the core data that encode all the knowledge and capabilities of an LLM.

Inference

In LLMs, Inference is the process of using a trained language model to generate text or predictions, as opposed to training the model.

  • During inference, you provide an input prompt to the model.

  • The model processes the prompt using its learned weights (parameters) to predict and generate the next tokens (words, subwords, or characters).

  • The output tokens are then converted back to human-readable text.

  • In summary:

    • Inference is how you interact with an LLM to get answers, completions, or other generated text, using the knowledge the model acquired during training.

Token & Tokenizer

In LLMs, Tokens are the basic units of text that the model processes. A token can be a word, part of a word (subword), or even a single character, depending on the tokenizer used.

  • Token: A chunk of text (word, subword, or character) processed by the model.

  • Tokenizer: A tool that converts text into tokens and back. It breaks down input text into tokens before feeding it to the model and reconstructs human-readable text from output tokens.

  • The number of tokens affects model input/output limits and cost (for API-based models).

  • Example:

    • The word "fantastic" might be a single token or split into "fan", "tas", "tic" depending on the tokenizer.
  • Try it:


Open LLM vs Proprietary LLM

AspectOpen LLMsProprietary LLMs
AccessOpen weights & codeClosed, API-only
CustomizationCan fine-tune & run locallyLimited or no customization
CostFree or lower costPay-per-use/API fees
PrivacyData stays localData sent to vendor servers
ExamplesLlama, Gemma, DeepSeek, QwenGPT-4, Gemini, Claude, GPT-4o (Optimized version)

  • Llama (Meta)
    • The double "L" at the beginning is pronounced like a "Y" in Spanish, so it sounds like YAH-mah (not "LAH-mah").
    • In English, some people say LAH-mah but the correct Spanish pronunciation is "YAH-mah."
    • LLaMA Model Family:
      • 🦙 LLaMA (Large Language Model Meta AI)
      • 👁️ LLaVA (Large Language and Vision Assistant)
      • 💻 CodeLlama (Code-Specialized LLaMA)
  • Gemma (Google)
    • The "G" is pronounced like the English "J" (as in "gem" or "general").
    • The emphasis is on the first syllable: JEM-uh.
  • DeepSeek
  • Qwen (Alibaba)
  • Mistral
  • Phi (Microsoft)

Where to Find and Run Open LLMs

  • Hugging Face – The largest repository of open LLMs.
  • TheBloke – Community quantized models for easy local inference.
  • Open LLM Leaderboard – Compare open LLMs by benchmarks.
  • Groq – High-speed inference for open models via API.
  • Open WebUI – Self-hosted web interface for running and chatting with local models (works with Ollama and others).

Running Open LLMs Locally

  • LM Studio: GUI for running, managing, and chatting with local LLMs (supports GGUF).
    • Note: LM Studio uses llama.cpp under the hood for efficient inference.
  • Ollama: Simple CLI and API for running LLMs locally with one command.
    • Note: Ollama also uses llama.cpp behind the scenes.
  • llama.cpp: Fast CPU/GPU inference for GGUF models.
tip

Thanks to quantization (reducing model precision), many open LLMs can run efficiently on consumer hardware (even laptops and some phones).

You can choose model size and quantization level to match your hardware profile.


Running LLMs For Inference

Inference is the process of using a (open or proprietary) LLM to generate output

Explanation:

  • In simple words: Input Promt -> Training Model -> Output Response
  • User provides input (prompt/query).
  • Hosting system with GPU loads the trained model entirely into GPU VRAM.
  • The model is executed by the GPU for inference.
  • The output response is generated and returned to the user.
  • The diagram emphasizes that the model must be fully loaded into VRAM and executed by the GPU for efficient inference.

System Requirements: VRAM, RAM, and Why RAM is Used

Explanation:

  • Input Prompt: The prompt is broken down into tokens (token IDs).
  • Training Model: Each connection has a weight (parameter), stored as float32/float16, consuming memory (e.g., a 2B model uses ~4–8 GB).
  • Output Response: Output tokens are converted back to human-readable text.

So Will it run on my system?

Explanation:

  • The Original Trained Model uses float16/32 parameters and requires a lot of memory.
  • The Quantized Model transforms these parameters to Int4/8, drastically reducing memory usage.

Understanding Quantization

Quantization is a technique used to reduce the memory footprint and sometimes increase the inference speed of large language models (LLMs).

  • What is quantization?
    Quantization compresses the original model by converting its parameters (weights) from high-precision formats (like float32 or float16) to lower-precision formats (like int8, int4, etc.).

  • Why quantize?

    • Lower memory usage: Quantized models require much less RAM/VRAM, making it possible to run large models on consumer hardware.
    • Faster inference: Lower-precision arithmetic can be computed faster on modern hardware, sometimes speeding up inference.
    • Minimal accuracy loss: Well-designed quantization preserves most of the model’s accuracy while drastically reducing resource requirements.
  • How does it work?

    • The original weights (e.g., float32, 4 bytes each) are mapped to a smaller set of values (e.g., int4, 0.5 bytes each).
    • For example, a 7B parameter model in float16 might require ~14 GB, but in int4 quantization, it could fit in ~3.5 GB.
  • You don’t need to quantize yourself:
    Most open LLMs shared on Hugging Face, LM Studio, or Ollama are already available in quantized formats (like GGUF).

  • In summary

    • In order to make LLMs consume less (V)RAM—and potentially increase inference speed—the original models are typically quantized (compressed)
    • Quantization makes it practical to run powerful LLMs on everyday hardware by shrinking model size and memory requirements, with little impact on output quality.
info

You don’t need to perform this quantization yourself! Models shared on Huggingface, especially when usable via LM Studio or Ollama, are available as quantized versions

  • VRAM (Video RAM):
    If running on a GPU, VRAM is used to store the model and process data quickly. Larger models and higher batch sizes require more VRAM.

  • RAM (System Memory):
    If running on CPU, RAM is used to load the model and perform inference. Quantized models (e.g., GGUF) reduce RAM requirements, making it possible to run even large models on consumer hardware.

  • Why RAM over ROM:
    RAM is fast, temporary memory needed for active computation and data processing. ROM (Read-Only Memory) is for permanent storage and cannot be used for running models in real time.

  • Typical Requirements:

    • Small models (3B–7B): 4–8 GB RAM/VRAM
    • Medium models (13B): 8–16 GB RAM/VRAM
    • Large models (30B+): 16–32 GB RAM/VRAM or more

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a technique that enhances AI chatbots by allowing them to access and integrate external knowledge into their responses. This is achieved by combining the strengths of pre-trained language models (LLMs) with information retrieval systems, enabling chatbots to answer questions more accurately and with more up-to-date information

How RAG Works

  1. Document Preparation (Offline Process):

    • Documents are split into manageable chunks
    • Each chunk is converted into vector embeddings
    • These embeddings are stored in a vector database
  2. Query Processing (Real-time):

    • User submits a question
    • The query is converted to the same vector embedding format
    • A similarity search finds the most relevant document chunks
    • Top results are retrieved and formatted as context
  3. Augmented Generation:

    • The original query and retrieved context are combined into a prompt
    • This "knowledge-enhanced" prompt is sent to the LLM
    • The LLM generates a response grounded in the retrieved information

This approach enables the model to access knowledge beyond its training data, provide more accurate responses, and cite sources for its information.

Vector Embeddings vs. Weights in AI Models

Key Differences

AspectVector EmbeddingsModel Weights
DefinitionRepresentations of data (text, images, etc.) in vector spaceLearned parameters that define a neural network's behavior
PurposeRepresent semantic meaning of contentTransform inputs through the neural network
CreationGenerated by applying a model to input dataLearned during model training through backpropagation
UsageUsed for similarity matching, clustering, searchUsed for computation within the model itself
LocationExternal to the model (e.g., stored in vector databases)Internal to the model architecture
ExampleWord "king" → [0.2, -0.5, 0.8, ...]Connection strength between neurons in layer 1 and 2

In the Context of LLMs

  • Weights (Parameters)

    • What they are: The learned values that define the LLM itself
    • How many: Typically billions (7B, 13B, 70B, etc.)
    • Storage: Stored in model files (e.g., GGUF format)
    • Memory usage: A major factor in RAM/VRAM requirements
    • Can be: Quantized to reduce size (e.g., from float16 to int4)
    • Example: The weights in Llama 3 define how it processes and generates text
  • Vector Embeddings

    • What they are: Numerical representations of text/data
    • How created: Generated by encoding models (which use weights)
    • Storage: Often stored in vector databases
    • Purpose in RAG: Allow semantic similarity search of content
    • Size: Typically hundreds or thousands of dimensions
    • Example: Converting a document chunk to a vector for retrieval

Additional Resources


In summary:
Open LLMs empower you to run, fine-tune, and experiment with powerful language models locally or in the cloud, with full control over privacy, cost, and customization. Thanks to quantization and tools like LM Studio and Ollama (which use llama.cpp), hardware requirements are manageable for many users.