A simplified, educational implementation of the Kimi-K2 architecture that demonstrates its key innovations over traditional transformer models.
nanoKimi showcases three revolutionary innovations from Kimi-K2:
- Mixture of Experts (MoE) - Sparse expert routing for efficient scaling
- Muon Optimizer - Specialized optimizer for MoE training
- Latent Attention - Enhanced context understanding through latent space projections
# Clone the repository
git clone [https://github.com/yourusername/nanokimi.git](https://github.com/dishant2009/nanoKimi)
cd nanokimi
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install package in development mode
pip install -e .
# Train a small model for testing
python scripts/train.py --config configs/nano.yaml
# Train with custom settings
python scripts/train.py --config configs/small.yaml --max-steps 10000 --no-wandb
# Generate text with trained model
python scripts/generate.py --checkpoint ./checkpoints/nano/best_model.pt --prompt "The future of AI is"
# Interactive generation
python scripts/generate.py --checkpoint ./checkpoints/nano/best_model.pt --interactive
# Streaming generation
python scripts/generate.py --checkpoint ./checkpoints/nano/best_model.pt --stream --prompt "Once upon a time"
import torch
from nanokimi import KimiModel, KimiConfig, Generator
# Create model configuration
config = KimiConfig(
n_layer=12,
n_head=12,
n_embd=768,
vocab_size=50257,
block_size=1024,
moe={
"use_moe": True,
"num_experts": 8,
"top_k": 2,
}
)
# Create and use model
model = KimiModel(config)
generator = Generator(model)
# Generate text
text = generator.generate_text("The future of artificial intelligence", max_new_tokens=50)
print(text)
- Parameters: ~10M
- Layers: 6
- Embedding: 384
- Experts: 4
- Use case: Development and testing
- Parameters: ~124M
- Layers: 12
- Embedding: 768
- Experts: 8
- Use case: Research and benchmarking
- Parameters: ~350M
- Layers: 24
- Embedding: 1024
- Experts: 16
- Use case: Real applications
Input → Router → Top-K Experts → Weighted Combination → Output
↓
Load Balancing Loss
- Sparse routing: Only top-k experts process each token
- Load balancing: Prevents expert collapse during training
- Efficiency: Better parameter utilization than dense models
Input → Cross-Attention → Latent Space → Self-Attention → Cross-Attention → Output
(Input→Latent) (Latent→Input)
- Latent tokens: Learnable tokens that compress information
- Enhanced context: Better long-range dependency modeling
- Efficiency: Reduced attention complexity for long sequences
- Expert-aware: Different learning rates for expert vs. shared parameters
- Convergence: Specialized momentum handling for sparse gradients
- Memory efficient: Optimized for large sparse models
- Random tokens for testing
- Fast iteration during development
- No external dependencies
- High-quality web text dataset
- Comparable to GPT-2 training data
- Requires preprocessing (see Data Preparation)
Add your own datasets by implementing the TokenDataset
interface.
# Download OpenWebText dataset
# Process raw text files
python -c "
from nanokimi.training.data import TextDataProcessor, DataConfig
config = DataConfig()
processor = TextDataProcessor(config)
processor.process_dataset('openwebtext', ['train.txt'], ['val.txt'])
"
from nanokimi.training.data import TextDataProcessor, DataConfig
config = DataConfig(tokenizer_name="gpt2", block_size=1024)
processor = TextDataProcessor(config)
# Process your text files
train_path, val_path = processor.process_dataset(
"my_dataset",
train_files=["train1.txt", "train2.txt"],
val_files=["val.txt"]
)
Compare nanoKimi against nanoGPT and other baselines:
from nanokimi.benchmark import ModelEvaluator, NanoGPTComparison
evaluator = ModelEvaluator()
comparison = NanoGPTComparison()
# Evaluate perplexity
ppl = evaluator.evaluate_perplexity(model, dataset)
# Compare with nanoGPT
results = comparison.compare_models(nanokimi_model, nanogpt_model, dataset)
training:
mixed_precision: true # Enable FP16/BF16
training:
compile_model: true # Use torch.compile (PyTorch 2.0+)
# Automatic expert caching during inference
generator = Generator(model)
text = generator.generate_text(prompt, use_cache=True)
training:
use_wandb: true
wandb_project: "my-nanokimi-experiments"
wandb_run_name: "experiment-1"
All metrics are also logged locally:
- Training loss and learning rate
- Expert utilization statistics
- Memory usage and tokens/second
- Validation perplexity
- Training Speed: 20%+ faster convergence with Muon optimizer
- Parameter Efficiency: Better performance per parameter with MoE
- Context Understanding: Enhanced long-range modeling with latent attention
- Scalability: Efficient scaling to larger models with sparse experts
- Memory Efficiency: Comparable memory usage despite architectural complexity
nanoKimi is designed for extensibility:
- Routing Strategies: Experiment with different expert routing algorithms
- Attention Patterns: Try various sparse attention mechanisms
- Multimodal: Extend to vision and other modalities
- Deployment: Optimize for edge devices and inference
- Training Guide - Detailed training instructions
- Architecture Deep Dive - Technical implementation details
- API Reference - Complete API documentation
- Benchmarking - Performance evaluation guide
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Format code
black nanokimi/
flake8 nanokimi/
This project is licensed under the MIT License - see LICENSE for details.
- Moonshot AI - For the original Kimi-K2 innovations
- nanoGPT - For inspiration and educational approach
- Transformers Community - For the foundation this builds upon
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: [email protected]
Note: This is an educational implementation designed to demonstrate Kimi-K2 concepts. For production use, consider the full Kimi-K2 implementation.
If you find this project helpful, please consider giving it a star on GitHub!