[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
-
Updated
Aug 13, 2024 - Python
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
A local and uncensored AI entity.
LLM RAG Application with Cross-Encoders Re-ranking for YouTube video 🎥
Use your open source local model from the terminal
Local AI Search assistant web or CLI for ollama and llama.cpp. Lightweight and easy to run, providing a Perplexity-like experience.
Lightweight Python tool using Optuna for tuning llama.cpp flags: towards optimal tok/s for your machine
A powerful shell that's powered by a locally running LLM (ideally Llama 3.x or Qwen 2.5)
This project allows you to run your own local Large Language Model (LLM) chatbot using an API like Ollama.
Summarize emails received by Thunderbird mail client extension via locally run LLM. Early development.
A narrative/roleplay engine with TCOD levels driven by unreliable narrators - both in the literal and literature sense. Currently hooks into LMStudio and gemini for responses, allowing overrides of tasks to player control.
Add a description, image, and links to the localllama topic page so that developers can more easily learn about it.
To associate your repository with the localllama topic, visit your repo's landing page and select "manage topics."