AI Engineering Toolbox

A curated collection of open source tools for every stage of AI engineering, from data and evaluation to building, deploying, and monitoring AI systems
Author

Amit Chaudhary

Published

May 11, 2025

Modified

January 2, 2026

Data

Category Tool Remarks
Synthetic Data curator, datadreamer, distillabel, fabricator, promptwright
auto-label Generate annotated datasets
Pretraining pipelines datatrove, nemo-curator
Data Exploration lilac
turftopic, berttopic Topic Modeling
Deduplication semhash, rensa
Web Scraping trafilatura, crawl4ai
autoscraper specify target and get scraping code
Language Detection fast-langdetect

Evals

Category Tool Remarks
Libraries athina-evals, deepeval, geval, openevals, promptfoo, inspect-ai
evaluate, torchmetrics, torcheval Off-the-shelf metrics
mir-eval
Agents agentsevals
Benchmarks openai-evals, yet-another-applied-llm-benchmark, lm-evaluation-harness, openbench
llmtest-needleinahaystack Needle in a Hatstack
agentbench, textarena Agents
open-llm-leaderboard, LMArena Arena
Diversity diversity
Hallucination lettucedetect, hallucination-leaderboard, selfcheckgpt
RAG ragas, ragchecker
auto-evaluator Generate synthetic QA pairs from docs
chunking-evaluation Evaluate chunking strategies
Text Generation bleurt, bertscore, moverscore
Deep Research search_evals

Workflow

Experiment Tracking

Category Tool Remarks
Libraries trulens, mlflow, wandb, mlop

Developer Tools

Category Tool Remarks
Coding Agents claude-code, codex, gemini-cli, qwen-code, mistral-vibe, amp, opencode, openhands
vibe-kanban, conductor Orchestration
ruler Rules
superpowers Skills
continuous-claude, claude-mem Context Management

Context Engineering

Prompt Engineering

Category Tool Remarks
Automatic Prompt Engineering dspy, textgrad, adalflow, zenbase, gepa
dspydantic Pydantic models
openevolve Evolutionary code optimization
Function Calling functionary
Structured Output guidance, instructor, jsonformer,lm-format-enforcer, outlines, xgrammar, lqml, fructose
json-repair Post-process broken JSON
llm-scraper webpage to json
Memory mem0, letta, memobase, memary, langmem, memoripy
cognee Memory using knowledge graphs
Prompt Compression toon, llm-lingua
Rate Limiting backoff, tenacity, ratelimit
limits Rate-limit for own APIs
Code Interpreter gpt-code-interpreter, open-interpreter, codeinterpreter-api
Steering Vectors dialz, repeng
System Prompts llm-system-prompts, leaked-system-prompts, system-prompt-leaks, awesome-ai-system-prompts, cl4r1t4s, system-prompts-and-models-of-ai-tools

RAG

Category Tool Remarks
Libraries llama-index, verba, fastrag, haystack, ragbits
Chunking wtsplit, semchunk, chonkie, langchain-text-splitters, chonky
chonky-distilbert-base-multilingual-cased Multilingual Chunking
open-parse, doclayout-yolo Layout parsing visually
sparseprimingrepresentations SPR
Reranking rerankers, flaskrank
pyversity Diversity re-ranking
Retrieval pyserini
bm25s Sparse retrieval
Embeddings fastembed, sentence-transformers
model2vec Static Vectors
embedding-atlas Visualize
Vector Index annoy, diskann, faiss, chroma, qdrant, pinecone, weviate, milvus
pgvector, sqlite-vec SQL Extensions
simsimd Faster dot-product on CPUs
Late Interaction ragatouille, pylate Train ColBERT models
maxsim-cpu Speed-up max-sim for ColBERT
Graph RAG fast-graphrag, graphrag, nano-graphrag
OCR textract, deepseek-ocr
nougat Academic Documents
Document Understanding donut
table-transformer Table Extraction
marker, pdf2md, mineru, docext, docling PDF to markdown
RAG on internal docs onyx, xyne, pipeshub-ai

Agents

Category Tool Remarks
Libraries autogen, crewai, langroid, openai-agents, pydantic-ai, marvin, metagpt, semantic-kernel
langgraph Graphs
smolagents Code-based agent
MCP fastmcp, enrichmcp
mcp-scan scan security vulnerability
Web Use browser-use
deer-flow Deep Research
magnitude Automated Testing
Computer Use ui-tars-desktop, cuda
Code Execution Sandbox microsandbox, e2b, screnenv
Coding Agents opencode
Web Search API sonar

Finetuning

Category Tool Remarks
LLM Finetuning axolotl, unsloth, torchtune, peft, litgpt, llama-factory
onebitllms, matmulfreellm 1.58-bit LLMs
Model Merging mergekit, mergoo, mergenetic
Multi-modal VLMs maestro, nanovlm, mlx-vlm
smol-vision Recipes
llava Visual Instruction Tuning
RL art, trl, verl, openrlhf, atropos, retrain, nemo-rl, slime
verifiers Verifiers
search-and-learn MCTS
Distributed Training metagron-lm, deepspeed, yafsdp, nanotron, fairscale, colossalai
hivemind, psyche Decentralized Training
Self-instruct airoboros
Tokenization supertokenizer Train multi-word BPE
tokendagger faster tiktoken
Classification adaptive-classifier Continuous Learning
Abliteration heretic

Multimodality

Audio

Category Tool Remarks
Denoising denosier
Voice Activity Detection silero-vad
General models seamless-communication
indicconformerasr, vistaar, Indic Languages
Speech to Text faster-whisper, whisperx
Speaker Identification whisperkitlive
Text to Speech chatterbox
whisper-streaming Realtime

Vision

Category Tool Remarks
Facial recognition deepface
VLM CogVLM
Watermarking meta-seal

Deployment

Inference

Category Tool Remarks
Inference Servers ray, vllm, powerinfer, text-generation-inference, sglang, tensorrt-llm, ctransalte2, mlc-llm, deepspeed-mii, openllm, exllamav2, fastgen, tokasaurus
Batch Processing skypilot
GPU Snapshot inferx
Multi-token prediction medusa
Multi-LoRA inference s-lora, punica, lorax
KV Cache kvzip, lmcache
LLM Gateway litellm, portkey, any-llm, tensorzero
LLM Routing routellm, automix, openrouter, awesome-ai-model-routing, helicone
Model Cascade frugalgpt
Semantic Cache gptcache
Quantization llm-compressor, bitsandbytes, optimum, sinq
Embeddings text-embeddings-inference, infinity
Kernels liger-kernel
attorch Triton kernels
sparse_transformers Sparse kernels
Local Servers lmstudio, ollama, text-generation-webui, koboldcpp, llama.cpp
Frontend UI gradio, streamlit
copilot-kit, assistant-ui Chat UI components
agent-inbox Agent UI
Caching panza Caching for async functions

Monitoring

Category Tool Remarks
Observability openllmetry, phoenix, logfire
Guardrails openai-guardrails, giskard, langkit, garak, deepchecks, nemo-guardrails
rebuff Prompt Injection Detection
uqlm Uncertainty Quantification
Drift Detection ft-drift Detect drift in OpenAI messages
AI Detection binoculars