Machine Learning Toolbox
This page contains useful libraries I’ve found when working on Machine Learning projects.
The libraries are organized below by phases of a typical Machine Learning project.
Phase: Data
Data Annotation
Category | Tool | Remarks |
---|---|---|
General | superintendent, pigeon | Annotate in notebooks |
labelstudio | Open Source Data Labeling Tool | |
Image | makesense.ai, labelimg, via, cvat | |
Text | doccano, dataturks, brat | |
prodigy | Paid | |
chatio | Generate text datasets using DSL | |
Audio | audio-annotator, audiono |
Data Collection
Importing Data
Category | Tool | Remarks |
---|---|---|
Prebuilt | openml, lineflow | |
rs_datasets | Recommendation Datasets | |
nlp | Python interface to NLP datasets | |
tensorflow_datasets | Access datasets in Tensorflow | |
hub | Prebuild datasets for PyTorch and Tensorflow | |
Audio | pydub | |
Video | moviepy | Edit Videos |
pytube | Download youtube vidoes | |
Image | py-image-dataset-generator, idt, jmd-imagescraper | Auto fetch images from web for certain search |
News | news-please, news-catcher | Scrap News |
pygooglenews | Google News | |
Lyrics | lyricsgenius | |
talon | ||
camelot, tabula-py, parsr, pdftotext, pdfplumber, pymupdf | ||
grobid | Parse PDF into structured XML | |
PyPDF2 | Read and write PDF in Python | |
pdf2image | Convert PDF to image | |
Excel | openpyxl | |
Remote file | smart_open | |
Crawling | MechanicalSoup, libextract | |
pyppeteer | Chrome Automation | |
hext | DSL for extracting data from HTML | |
ratelimit | API rate limit decorator | |
Google Search | googlesearch | Parse google search results |
Google sheets | gspread | |
Google drive | gdown, pydrive | |
Python API | pydataset | |
Google Maps | geo-heatmap | |
Text to Speech | gtts | |
Database | blaze | Pandas and Numpy interface to databases |
twint, tweepy | Scrape Twitter | |
App Store | google-play-scraper | |
Wikipedia | wikipedia | Access data from wikipedia |
Arxiv | pyarxiv | Programmatic access to arxiv.org |
Google Ngrams | google-ngram-downloader | |
Machine Translation Corpus | mtdata | |
XML | xmltodict | Parse XML as python dictionary |
Data Augmentation
Category | Tool | Remarks |
---|---|---|
Text | nlpaug, noisemix, textattack, textaugment, niacin, SeaQuBe | |
fastent | Expand NER entity list | |
Image | imgaug, albumentations, augmentor, solt | |
Audio | audiomentations, muda | |
OCR data | TextRecognitionDataGenerator | |
Tabular data | deltapy | |
mockaroo | Generate synthetic user details | |
Automatic augmentation | deepaugment | Image |
Phase: Exploration
Data Preparation
Category | Tool | Remarks |
---|---|---|
Dataframe | cudf | Pandas on GPU |
Parallelize | pandarallel, swifter, modin | Parallelize pandas |
vaex | Pandas on huge data | |
numba | Parallelize numpy | |
Missing values | missingno | |
Split images into train/validation/test | split-folders | |
Class Imbalance | imblearn | |
Categorical encoding | category_encoders | |
Data Validation | pandera, pandas-profiling | Pandas |
Data Cleaning | pyjanitor | Janitor ported to python |
Parsing | pyparsing, parse | |
Weak Supervision | snorkel | |
Graph Sampling | little ball of fur |
Data Exploration
Category | Tool | Remarks |
---|---|---|
Explore Data | sweetviz, dataprep, quickda, vizidata | Generate quick visualizations of data |
ipyplot | Plot images | |
Notebook Tools | nbdime | View Jupyter notebooks through CLI |
papermill | Parametrize notebooks | |
nbformat | Access notebooks programatically | |
nbconvert | Convert notebooks to other formats | |
ipyleaflet | Maps in notebooks | |
ipycanvas | Draw diagrams in notebook | |
Relationship | ppscore | Predictive Power Score |
pdpbox | Partial Dependence Plot |
Feature Generation
Category | Tool | Remarks |
---|---|---|
Automatic feature engineering | featuretools, autopandas | |
tsfresh | Automatic feature engineering for time series | |
Metric learning | metric-learn, pytorch-metric-learning | |
Time series | python-holidays | List of holidays |
skits | Transformation for time-series data | |
catch22 | Pre-built features for time-series data | |
DAG based dataset generation | DFFML | |
Dimensionality reduction | fbpca, fitsne, trimap |
Phase: Modeling
Model Selection
Category | Tool | Remarks |
---|---|---|
Project Structure | cookiecutter-data-science | |
Find SOTA models | sotawhat, papers-with-code, codalab, nlpprogress, evalai, collectiveknowledge, sotabench | Benchmarks |
bert-related-papers | BERT Papers | |
acl-explorer | ACL Publications Explorer | |
survey-papers | Collection of survey papers | |
Pretrained models | modeldepot, pytorch-hub | General |
pretrained-models.pytorch, pytorchcv | Pre-trained ConvNets | |
pytorch-image-models | 200+ pretrained ConvNet backbones | |
huggingface-models, huggingface-pretrained | Transformer Models | |
huggingface-languages | Multi-lingual Models | |
model-forge, The Super Duper NLP Repo | Pre-trained NLP models by usecase | |
AutoML | auto-sklearn, mljar-supervised, automl-gs, pycaret, evalml | |
lazypredict | Run all sklearn models at once | |
tpot | Genetic AutoML | |
autocat | Auto-generate text classification models in spacy | |
mindsdb, lugwig | Autogenerate ML code | |
Gradient Boosting | catboost, xgboost, ngboost | |
lightgbm, thunderbm | GPU Capable | |
Hidden Markov Models | hmmlearn | |
Genetic Programming | gplearn | |
Active Learning | modal | |
Support Vector Machines | thundersvm | Run SVM on GPU |
Rule based classifier | sklearn-expertsys | |
Probabilistic modeling | pomegranate, pymc3 | |
Graph Embedding and Community Detection | karateclub, python-louvain | |
Anomaly detection | adtk | |
Spiking Neural Network | norse | |
Fuzzy Learning | fylearn, scikit-fuzzy | |
Noisy Label Learning | cleanlab | |
Few Shot Learning | keras-fewshotlearning | |
Deep Clustering | deep-clustering-toolbox | |
Graph Neural Networks | spektral | GNN for Keras |
Contrastive Learning | contrastive-learner | |
Self-Supervised Learning | lightly | Implementations of SSL models |
Optimization | nevergrad | Gradient Free Optimization |
cvxpy | Convex Optimization |
Frameworks
Category | Tool | Remarks |
---|---|---|
Tensorflow | tensorflow-addons | |
tensorflow-text | Addons for NLP | |
tensorflow-wheels | Optimized wheels for Tensorflow | |
tf-sha-rnn | ||
keras-radam | RADAM optimizer | |
scikeras | Scikit-learn Wrapper for Keras | |
larq | Binarized neural networks | |
ktrain | FastAI like interface for keras | |
tavolo | Kaggle Tricks as Keras Layers | |
Pytorch | pytorch-summary | Keras-like summary |
skorch | Wrap pytorch in scikit-learn compatible API | |
pytorch-lightning | Lightweight wrapper for PyTorch | |
einops | Einstein Notation | |
kornia | Computer Vision Methods | |
torchcontrib | SOTA Bulding Blocks in PyTorch | |
pytorch-optimizer | Collection of optimizers | |
pytorch-block-sparse | Sparse matrix replacement for nn.Linear | |
pytorch-forecasting | Time series forecasting in PyTorch lightning | |
nonechucks | Drop corrupt data automatically in DataLoader | |
Scikit-learn | scikit-lego, iterative-stratification | |
tscv | Time-series cross-validation | |
iterstrat | Cross-validation for multi-label data | |
scikit-multilearn | Multi-label classification | |
Addons | mlxtend | Extra utilities not present in frameworks |
tensor-sensor | Visualize tensors |
Natural Language Processing
Category | Tool | Remarks |
---|---|---|
Libraries | spacy , nltk, corenlp, deeppavlov, kashgari, transformers, ernie, stanza, nlp-architect, spark-nlp, pytext, FARM | |
headliner, txt2txt | Sequence to sequence models | |
Nvidia NeMo | Toolkit for ASR, NLP and TTS | |
nlu | 1-line models for NLP | |
CPU-optimizations | turbo_transformers, onnx_transformers | |
Wrappers | fast-bert, simpletransformers | |
finetune | Scikit-learn like API for transformers | |
Preprocessing | textacy, texthero, textpipe | |
JamSpell, pyhunspell, pyspellchecker, cython_hunspell, hunspell-dictionaries, autocorrect (can add more languages), symspellpy, spello (train your own spelling correction), contextualSpellCheck, neuspell, nlprule | Spelling Correction | |
ekphrasis | Pre-processing for social media texts | |
contractions, pycontractions | Contraction Mapping | |
truecase | Fix casing | |
nnsplit, deepsegment, sentence-doctor, pysbd, sentence-splitter | Sentence Segmentation | |
wordninja | Probabilistic Word Segmentation | |
punctuator2 | Punctuation Restoration | |
stopwords-iso | Stopwords for all languages | |
language-check, langdetect, polyglot, pycld2, cld2, cld3, langid | Language Identification | |
neuralcoref | Coreference Resolution | |
inflect, lemminflect | Inflections | |
scrubadub | PID removal | |
ftfy, clean-text,text-unidecode | Fix Unicode Issues | |
fastpunct | Punctuation Restoration | |
pypostal, mordecai | Parse Street Addresses | |
python-phonenumbers | Parse phone numbers | |
numerizer, word2number | Parse natural language number | |
dateparser | Parse natural dates | |
emoji | Handle emoji | |
pyarabic | multilingual | |
Tokenization | sentencepiece, youtokentome, subword-nmt | |
sacremoses | Rule-based | |
jieba | Chinese Word Segmentation | |
kytea | Japanese word segmentation | |
Thesaurus | python-datamuse | |
Feature Generation | homer, textstat | Readability scores |
LexicalRichness | Lexical Richness Measure | |
Gibberish Detection | nostril, gibberish-detector | |
Paraphrasing | pegasus | Question Paraphrasing |
sentaugment | Paraphrase mining | |
Spacy Extensions | spacy-pattern-builder | Generate dependency matcher patterns automatically |
spacy_grammar | Rule-based grammar error detection | |
role-pattern-builder | Pattern based SRL | |
textpipeliner | Extract RDF triples | |
tenseflow | Convert tense of sentence | |
camphr | Wrapper to transformers, elmo, udify | |
spleno | Domain-specific lemmatization | |
spacy-udpipe | Use UDPipe from Spacy | |
Linguistics | nodebox_linguistics_extended | Verb Conjugation |
Morphology | unimorph | Morphology data for many languages |
Phonetics | epitran | Transliterate text into IPA |
allosaurus | Recognize phone for 2000 languages | |
Phonology | panphon | Generate phonological feature representations |
phoible | Database of segment inventories for 2186 languages | |
Typology | lang2vec | Compare typological features of languages |
Word Sense Disambiguation | pywsd | |
Embeddings | InferSent, embedding-as-service, bert-as-service, sent2vec, sense2vec,glove-python, fse | |
rank_bm25, BM25Transformer | BM25 | |
sentence-transformers, DeCLUTR | BERT sentence embeddings | |
conceptnet-numberbatch | Word embeddings trained with common-sense knowledge graph | |
word2vec-twitter | Word2vec trained on twitter | |
pymagnitude | Access word-embeddings programatically | |
chakin | Download pre-trained word vectors | |
zeugma | Pretrained-word embeddings as scikit-learn transformers | |
starspace | Learn embeddings for anything | |
svd2vec | Learn embeddings from co-occurrence | |
all-but-the-top | Post-processing for word vectors | |
Cross-lingual Embeddings | muse, laserembeddings, xlm, LaBSE | |
transvec | Train mapping between monolingual embeddings | |
MuRIL | Embeddings for 17 indic languages with transliteration | |
BPEmb | Subword Embeddings in 275 Languages | |
piecelearn | Train own sub-word embeddings | |
Multilingual support | polyglot, trankit | |
inltk, indic_nlp | Indic Languages | |
cltk | NLP for latin and classic languages | |
Compact Models | mobilebert, distilbert, tinybert,BERT-of-Theseus-MNLI, MiniML | |
Information Extraction | claucy | |
Knowledge | conceptnet-lite | |
stanford-openie | Knowledge Graphs | |
verbnet-parser | VerbNet parser | |
Domain-specific BERT | codebert | Code |
clinicalbert-mimicnotes, clinicalbert-discharge-summary | Clinical Domain | |
twitter-roberta-base | ||
scispacy | bio-medical data | |
Text Extraction | textract (Image, Audio, PDF) | |
Text Generation | gp2client, textgenrnn, gpt-2-simple, aitextgen | GPT-2 |
markovify | Markov chains | |
Transliteration | wiktra | |
Machine Translation | MarianMT, Opus-MT, joeynmt, OpenNMT | |
googletrans, word2word, translate-python, deep_translator | Translation libraries | |
translators | Free calls to multiple translation APIs | |
giza++, fastalign, simalign | Word Alignment | |
Summarization | textrank, pytldr, bert-extractive-summarizer, sumy, fast-pagerank, sumeval | |
doc2query | Summarize document with queries | |
Question Generation | question-generation, questiongen.ai | Question Generation Pipeline for Transformers |
Keyword extraction | rake, pke, phrasemachine, keybert, word2phrase | |
pyate | Automated Term Extraction | |
Question Answering | haystack | Build end-to-end QA system |
mcQA | Multiple Choice Question Answering | |
TAPAS | Table Question Answering | |
Ranking | transformer-rankers | |
Search | elasticsearch-dsl | Wrapper for elastic search |
jina | production-level neural semantic search | |
mellisearch-python | ||
NLU | snips-nlu | |
Semantic parsing | quepy | |
Toxicity Detection | detoxify | |
Topic Modeling | gensim, guidedlda, enstop, top2vec, contextualized-topic-models, corex_topic, lda2vec, bertopic, tomotopy.ToModAPI | |
Code Switching | codeswitch | |
Clustering | kmodes, star-clustering, genieclust | |
spherecluster | K-means with cosine distance | |
kneed | Automatically find number of clusters from elbow curve | |
OptimalCluster | Automatically find optimal number of clusters | |
Metrics | seqeval | NER, POS tagging |
ranking-metrics | Metrics for Information Retrieval | |
String match | phrase-seeker, textsearch | |
jellyfish | Perform string and phonetic comparison | |
flashtext | Super-fast extract and replace keywords | |
pythonverbalexpressions | Verbally describe regex | |
commonregex | Ready-made regex for email/phone etc. | |
textdistance, editdistance, word-mover-distance | Text distances | |
wmd-relax | Word mover distance for spacy | |
fuzzywuzzy, spaczz, PolyFuzz, rapidfuzz, dedupe | Fuzzy Search | |
Sentiment | vaderSentiment | Rule based |
absa | Aspect Based Sentiment Analysis | |
Emotion Classification | distilroberta-finetuned, goemotion-pytorch | |
emosent-py | Sentiment scores for Emojis | |
Profanity detection | profanity-check | |
Visualization | stylecloud | Word Clouds |
scattertext | Compare word usage across segments | |
picture-text | Interactive tree-maps for hierarchical clustering | |
Named Entity Recognition(NER) | spaCy , Stanford NER, sklearn-crfsuite | |
med7 | Spacy NER for medical records | |
Entity Linking | dbpedia-spotlight, GENRE | |
Entity Matching | py_entitymatching, deepmatcher | |
Fill blanks | fitbert | |
Dictionary | vocabulary | |
Nearest neighbor | faiss, sparse_dot_topn | |
Knowledge Distillation | textbrewer, aquvitae | |
Language Model Scoring | lm-scorer, bertscore, kenlm, spacy_kenlm | |
Record Linking | fuzzymatcher | |
Cross-lingual transfer learning | langrank | Auto-select optimal transfer language |
Pronunciation | pronouncing | |
Dialogue System | ParlAI | |
Relation Extraction | OpenNRE |
Computer Vision
Category | Tool | Remarks |
---|---|---|
Image processing | scikit-image, imutils, opencv-wrapper, opencv-python | |
torchio | Medical Images | |
Segmentation Models | segmentation_models | Keras |
segmentation_models.pytorch | Segmentation models in PyTorch | |
High-level libraries | terran | Face detection, recognition, pose estimation |
Face recognition | face_recognition, mtcnn | |
face-alignment | Find facial landmarks | |
Facial-Expression-Recognition.Pytorch | Face Emotion | |
GANS | mimicry, imaginaire | |
Image Inpainting | GAN Image Inpainting | |
Face swapping | faceit, faceit-live, avatarify | |
Video summarization | videodigest | |
Semantic search over videos | scoper | |
OCR | keras-ocr, pytesseract, keras-craft, ocropy, doc2text | |
easyocr, kraken, PaddleOCR | Multilingual OCR | |
layout-parser, pdftabextract | OCR tables from document | |
Object detection | luminoth, detectron2, mmdetection | |
Image hashing | ImageHash |
Speech
Category | Tool | Remarks |
---|---|---|
Libraries | pyannotate, librosa, espnet | |
Speech Recognition | kaldi, speech_recognition, delta | |
Speech Synthesis | festvox, cmuflite | |
Feature Engineering | python_speech_features | Convert raw audio to features |
Diarization | resemblyzer | |
Source Separation | spleeter, nussl, open-unmix-pytorch, asteroid |
Recommendation System
Category | Tool | Remarks |
---|---|---|
Libraries | xlearn, DeepCTR | Factorization machines (FM), and field-aware factorization machines (FFM) |
lightfm, spotlight | Popular Recsys algos | |
tensorflow_recommenders | Recommendation System in Tensorflow | |
Collaborative Filtering | implicit | |
Scikit-learn like API | surprise | |
Recommendation System in Pytorch | CaseRecommender | |
Apriori algorithm | apyori | |
Metrics | rs_metrics |
Timeseries
Category | Tool | Remarks |
---|---|---|
Libraries | prophet, tslearn, pyts, seglearn, cesium, stumpy, darts | |
sktime | Scikit-learn like API | |
atspy | Automated time-series models | |
ARIMA models | pmdarima |
Hyperparameter Optimization
Category | Tool | Remarks |
---|---|---|
General | hyperopt, optuna, evol, talos | |
Keras | keras-tuner | |
Scikit-learn | hyperopt-sklearn, scikit-optimize | Bayesian Optimization |
sklearn-deap | Evolutionary algorithm | |
Parameter optimization | ParameterImportance |
Phase: Validation
Experiment Monitoring
Category | Tool | Remarks |
---|---|---|
MLOps | clearml, wandb, neptune.ai, replicate.ai | |
Experiment tracking | tensorboard, mlflow | |
Learning curve | lrcurve, livelossplot | Plot realtime learning curve in Keras |
Notification | knockknock | Get notified by slack/email |
jupyter-notify | Notify when task is completed in jupyter | |
apprise | Notify to any platform | |
Progress bar | fastprogress, tqdm | |
GPU Usage | gpumonitor, nvtop | |
jupyterlab-nvdashboard | See GPU Usage in jupyterlab |
Interpretability
Category | Tool | Remarks |
---|---|---|
Interpret models | eli5, lime, shap, alibi, tf-explain, treeinterpreter, pybreakdown, xai, lofo-importance, interpretML, shapash | |
exbert | Interpret BERT | |
bertviz | Explore self-attention in BERT | |
Interpret word2vec | word2viz, whatlies | |
Interpret NLP models | Language Interpretability Tool | |
Adversarial Attack | cleverhans | General |
foolbox | Image | |
triggers | NLP |
Visualization
Category | Tool | Remarks |
---|---|---|
Libraries | matplotlib, seaborn, pygal, plotly, plotnine | |
yellowbrick, scikit-plot | Visualization for scikit-learn | |
pyldavis | Visualize topics models | |
dtreeviz | Visualize decision tree | |
Interactive charts | bokeh | |
flourish-studio | Create interactive charts online | |
mpld3 | Matplotlib to D3 Converter | |
Model Visualization | netron, nn-svg | Architecture |
keract | Activation maps for keras | |
keras-vis | Visualize keras models | |
Styling | open-color | Color Schemes |
mplcyberpunk | Cyberpunk style for matplotlib | |
chart.xkcd | XKCD like charts | |
Generate graphs using markdown | mermaid | |
High dimensional visualization | umap | |
ivis | Ivis Algorithm | |
Animated charts | bar_chart_race | Bar chart race animation |
pandas_alive | Animated charts in pandas | |
Tree-map chart | squarify | |
3D charts | babyplots |
Phase: Production
Model Export
Category | Tool | Remarks |
---|---|---|
Cloud Storage | Zenodo, Github Releases, OneDrive, Google Drive, Dropbox, S3, mega, DAGsHub, huggingface-hub | |
Serialization | sklearn-porter, m2cgen | Transpile sklearn model to C, Java, JavaScript and others |
hummingbird | Convert ML models to PyTorch | |
cloudpickle, jsonpickle | Pickle extensions | |
Dependencies | pip-chill | pip freeze without dependencies |
pipreqs | Generate requirements.txt based on imports | |
conda-pack | Export conda for offline use | |
Benchmarking | torchprof | Profile pytorch layers |
scalene, pyinstrument | Profile python code | |
k6 | Load test API | |
ai-benchmark | Bechmark VM on 19 different models | |
Distributed training | horovod | |
Data Pipeline | pypeln |
Inference
Category | Tool | Remarks |
---|---|---|
Model Serving Frameworks | cortex, torchserve, ray-serve, bentoml | |
Dashboard | streamlit | Generate frontend with python |
gradio | Fast UI generation for prototyping | |
dash | React Dashboard using Python | |
voila | Convert Jupyter notebooks into dashboard | |
streamlit-drawable-canvas | Drawable Canvas for Streamlit | |
streamlit-terran-timeline | Show timeline of faces in videos | |
streamlit components | Collection of streamlit components | |
API Frameworks | flask | |
fastapi | Automatic Docs and Validation | |
Configuration Management | config, python-decouple | |
Data Validation | schema, jsonschema, cerebrus, pydantic, marshmallow, validators | |
CORS | flask-cors | CORS in Flask |
Caching | cachetools, cachew (cache to local sqlite) | |
Authentication | pyjwt (JWT) | |
Task Queue | rq, schedule, huey | |
mlq | Queue ML Tasks in Flask | |
Job Scheduler | airflow | |
Database | flask-sqlalchemy, tinydb, flask-pymongo, odmantic | |
tortoise-orm | Asyncio ORM similar to Django | |
Logging | loguru | |
Testing | schemathesis | Automatic test generation from Swagger |
pytest-benchmark | Profile time in pytest | |
exdown | Extract code from markdown files | |
mktestdocs | Test code present in markdown files | |
Data Logging & Monitoring | whylogs |
Python libraries
Category | Tool | Remarks |
---|---|---|
Decorators | retrying (retry some function) | |
Subprocess | delegator.py | |
bloom filter | python-bloomfilter | |
Run python libraries in sandbox | pipx | |
Pretty print tables in CLI | tabulate | |
Leaflet maps from python | folium | |
Debugging | PySnooper | |
Date and Time | pendulum | |
Create interactive prompts | prompt-toolkit | |
Concurrent database | pickleshare | |
Aync | tomorrow | |
Testing | crosshair(find failure cases for functions) | |
Virtual webcam | pyfakewebcam | |
CLI Formatting | rich | |
Control mouse and output device | pynput | |
Shell commands as functions | sh | |
Path-like interface to remote files | pathy | |
Standard Library Extension | ubelt | |
Improved doctest | xdoctest | |
Code to Maths | latexify-py, handcalcs | |
Multiprocessing | filelock | Lock files during access from multiple process |
Collections | bidict | Bidirectional dictionary |
munch | Dictionary with dot access |
Utilities
Category | Tool | Remarks |
---|---|---|
Database | mlab | Free 500 MB MongoDB |
Trade-off tools | egograph | Find alternatives to anything |
Data Visualization | flourish-studio | |
Linux | ripgrep | |
Colab | colab-cli | Manager colab notebook from command line |
Drive | drive-cli | Use google drive similar to git |
Git | gitjk | Undo what you just did in git |