Machine Learning Toolbox

This page contains useful libraries I’ve found when working on Machine Learning projects.

The libraries are organized below by phases of a typical Machine Learning project.

Phase: Data

Data Annotation

Category Tool Remarks
General superintendent, pigeon Annotate in notebooks
  labelstudio Open Source Data Labeling Tool
Image makesense.ai, labelimg, via, cvat  
Text doccano, dataturks, brat  
  prodigy Paid
  chatio Generate text datasets using DSL
Audio audio-annotator, audiono  

Data Collection

Category Tool Remarks
Curations datasetlist, UCI, Google Dataset Search, fastai-datasets  
  huggingface-datasets, The Big Bad NLP Database, nlp-datasets, nlp corpora NLP Datasets
  bifrost Vision Datasets
Words curse-words, badwords, LDNOOBW, 10K most common words, common-misspellings  
  wordlists Words organized by topic
  english-words A text file containing over 466k English words
  tf-idf-iif-top-100-wordlists Top 100 distinctive words for each language
  freeling Dictionary of words grouped by POS
Text Corpus project gutenberg, nlp-datasets, 1 trillion n-grams, litbank, BookCorpus, south-asian text corpus  
  opus, oscar (big multilingual corpus) Translation Parallel Text
  pile 825GB text corpus
  freebase Relation triples
  opensubtitles Movie subtitles parallel corpus
  lti-langid Language Identification Corpus for 1152 languages
  fandom-transcripts Movie and Series Transcripts
  cognet Cognates for 338 languages
  wold Loan words
Sentiment SST2, Amazon Reviews, Yelp Reviews, Movie Reviews, Food Reviews, Twitter Airline, GOP Debate, Sentiment Lexicons for 81 languages, SentiWordNet, Opinion Lexicon, Wordstat words, Emoticon Sentiment, socialsent  
Emotion NRC-Emotion-Lexicon-Wordlevel, ISEAR(17K), HappyDB, emotion-to-emoji-mapping  
  EmoTag1200 Emoji-Emotion scores
NLU Intents rasa-nlu-training-data  
N-grams google-book-ngrams, norvig-processed-ngrams  
Summarization curation-corpus  
Conversations conversational-datasets, cornell-movie-dialog-corpus, persona-chat, DialogDatasets  
Semantic Parsing wikisql, spider Text to SQL
  WebQuestions, ComplexWebQuestions Text to Knowledge Graph
  CoNaLa, CONCODE Text to program
  amrlib Parse AMR data
Image 1 million fake faces, flickr-faces, objectnet, YFCC100m, USPS, Animal Faces-HQ dataset (AFHQ)  
  tiny-images,SVHN, STL-10, imagenette, CIFAR-10 Small image datasets for quick experimentation
  omniglot, mini-imagenet One Shot Learning
Paraphrasing PPDB  
Audio audioset YouTube audio with labels
Speech voxforge, openslr, cmu wilderness, commonvoice  
Speech synthesis CMU Artic  
Graphs Social Networks (Github, Facebook, Reddit)  
Handwriting iam-handwriting  
  text_renderer Generate synthetic OCR text

Importing Data

Category Tool Remarks
Prebuilt openml, lineflow  
  rs_datasets Recommendation Datasets
  nlp Python interface to NLP datasets
  tensorflow_datasets Access datasets in Tensorflow
  hub Prebuild datasets for PyTorch and Tensorflow
Audio pydub  
Video moviepy Edit Videos
  pytube Download youtube vidoes
Image py-image-dataset-generator, idt, jmd-imagescraper Auto fetch images from web for certain search
News news-please, news-catcher Scrap News
  pygooglenews Google News
Lyrics lyricsgenius  
Email talon  
PDF camelot, tabula-py, parsr, pdftotext, pdfplumber, pymupdf  
  grobid Parse PDF into structured XML
  PyPDF2 Read and write PDF in Python
  pdf2image Convert PDF to image
Excel openpyxl  
Remote file smart_open  
Crawling MechanicalSoup, libextract  
  pyppeteer Chrome Automation
  hext DSL for extracting data from HTML
  ratelimit API rate limit decorator
Google Search googlesearch Parse google search results
Google sheets gspread  
Google drive gdown, pydrive  
Python API pydataset  
Google Maps geo-heatmap  
Text to Speech gtts  
Database blaze Pandas and Numpy interface to databases
Twitter twint, tweepy Scrape Twitter
App Store google-play-scraper  
Wikipedia wikipedia Access data from wikipedia
Arxiv pyarxiv Programmatic access to arxiv.org
Google Ngrams google-ngram-downloader  
Machine Translation Corpus mtdata  
XML xmltodict Parse XML as python dictionary

Data Augmentation

Category Tool Remarks
Text nlpaug, noisemix, textattack, textaugment, niacin, SeaQuBe  
  fastent Expand NER entity list
Image imgaug, albumentations, augmentor, solt  
Audio audiomentations, muda  
OCR data TextRecognitionDataGenerator  
Tabular data deltapy  
  mockaroo Generate synthetic user details
Automatic augmentation deepaugment Image

Phase: Exploration

Data Preparation

Category Tool Remarks
Dataframe cudf Pandas on GPU
Parallelize pandarallel, swifter, modin Parallelize pandas
  vaex Pandas on huge data
  numba Parallelize numpy
Missing values missingno  
Split images into train/validation/test split-folders  
Class Imbalance imblearn  
Categorical encoding category_encoders  
Data Validation pandera, pandas-profiling Pandas
Data Cleaning pyjanitor Janitor ported to python
Parsing pyparsing, parse  
Weak Supervision snorkel  
Graph Sampling little ball of fur  

Data Exploration

Category Tool Remarks
Explore Data sweetviz, dataprep, quickda, vizidata Generate quick visualizations of data
  ipyplot Plot images
Notebook Tools nbdime View Jupyter notebooks through CLI
  papermill Parametrize notebooks
  nbformat Access notebooks programatically
  nbconvert Convert notebooks to other formats
  ipyleaflet Maps in notebooks
  ipycanvas Draw diagrams in notebook
Relationship ppscore Predictive Power Score
  pdpbox Partial Dependence Plot

Feature Generation

Category Tool Remarks
Automatic feature engineering featuretools, autopandas  
  tsfresh Automatic feature engineering for time series
Metric learning metric-learn, pytorch-metric-learning  
Time series python-holidays List of holidays
  skits Transformation for time-series data
  catch22 Pre-built features for time-series data
DAG based dataset generation DFFML  
Dimensionality reduction fbpca, fitsne, trimap  

Phase: Modeling

Model Selection

Category Tool Remarks
Project Structure cookiecutter-data-science  
Find SOTA models sotawhat, papers-with-code, codalab, nlpprogress, evalai, collectiveknowledge, sotabench Benchmarks
  bert-related-papers BERT Papers
  acl-explorer ACL Publications Explorer
  survey-papers Collection of survey papers
Pretrained models modeldepot, pytorch-hub General
  pretrained-models.pytorch, pytorchcv Pre-trained ConvNets
  pytorch-image-models 200+ pretrained ConvNet backbones
  huggingface-models, huggingface-pretrained Transformer Models
  huggingface-languages Multi-lingual Models
  model-forge, The Super Duper NLP Repo Pre-trained NLP models by usecase
AutoML auto-sklearn, mljar-supervised, automl-gs, pycaret, evalml  
  lazypredict Run all sklearn models at once
  tpot Genetic AutoML
  autocat Auto-generate text classification models in spacy
  mindsdb, lugwig Autogenerate ML code
Gradient Boosting catboost, xgboost, ngboost  
  lightgbm, thunderbm GPU Capable
Hidden Markov Models hmmlearn  
Genetic Programming gplearn  
Active Learning modal  
Support Vector Machines thundersvm Run SVM on GPU
Rule based classifier sklearn-expertsys  
Probabilistic modeling pomegranate, pymc3  
Graph Embedding and Community Detection karateclub, python-louvain  
Anomaly detection adtk  
Spiking Neural Network norse  
Fuzzy Learning fylearn, scikit-fuzzy  
Noisy Label Learning cleanlab  
Few Shot Learning keras-fewshotlearning  
Deep Clustering deep-clustering-toolbox  
Graph Neural Networks spektral GNN for Keras
Contrastive Learning contrastive-learner  
Self-Supervised Learning lightly Implementations of SSL models
Optimization nevergrad Gradient Free Optimization
  cvxpy Convex Optimization

Frameworks

Category Tool Remarks
Tensorflow tensorflow-addons  
  tensorflow-text Addons for NLP
  tensorflow-wheels Optimized wheels for Tensorflow
  tf-sha-rnn  
  keras-radam RADAM optimizer
  scikeras Scikit-learn Wrapper for Keras
  larq Binarized neural networks
  ktrain FastAI like interface for keras
  tavolo Kaggle Tricks as Keras Layers
Pytorch pytorch-summary Keras-like summary
  skorch Wrap pytorch in scikit-learn compatible API
  pytorch-lightning Lightweight wrapper for PyTorch
  einops Einstein Notation
  kornia Computer Vision Methods
  torchcontrib SOTA Bulding Blocks in PyTorch
  pytorch-optimizer Collection of optimizers
  pytorch-block-sparse Sparse matrix replacement for nn.Linear
  pytorch-forecasting Time series forecasting in PyTorch lightning
  nonechucks Drop corrupt data automatically in DataLoader
Scikit-learn scikit-lego, iterative-stratification  
  tscv Time-series cross-validation
  iterstrat Cross-validation for multi-label data
  scikit-multilearn Multi-label classification
Addons mlxtend Extra utilities not present in frameworks
  tensor-sensor Visualize tensors

Natural Language Processing

Category Tool Remarks
Libraries spacy , nltk, corenlp, deeppavlov, kashgari, transformers, ernie, stanza, nlp-architect, spark-nlp, pytext, FARM  
  headliner, txt2txt Sequence to sequence models
  Nvidia NeMo Toolkit for ASR, NLP and TTS
  nlu 1-line models for NLP
CPU-optimizations turbo_transformers, onnx_transformers  
Wrappers fast-bert, simpletransformers  
  finetune Scikit-learn like API for transformers
Preprocessing textacy, texthero, textpipe  
  JamSpell, pyhunspell, pyspellchecker, cython_hunspell, hunspell-dictionaries, autocorrect (can add more languages), symspellpy, spello (train your own spelling correction), contextualSpellCheck, neuspell, nlprule Spelling Correction
  ekphrasis Pre-processing for social media texts
  contractions, pycontractions Contraction Mapping
  truecase Fix casing
  nnsplit, deepsegment, sentence-doctor, pysbd, sentence-splitter Sentence Segmentation
  wordninja Probabilistic Word Segmentation
  punctuator2 Punctuation Restoration
  stopwords-iso Stopwords for all languages
  language-check, langdetect, polyglot, pycld2, cld2, cld3, langid Language Identification
  neuralcoref Coreference Resolution
  inflect, lemminflect Inflections
  scrubadub PID removal
  ftfy, clean-text,text-unidecode Fix Unicode Issues
  fastpunct Punctuation Restoration
  pypostal, mordecai Parse Street Addresses
  python-phonenumbers Parse phone numbers
  numerizer, word2number Parse natural language number
  dateparser Parse natural dates
  emoji Handle emoji
  pyarabic multilingual
Tokenization sentencepiece, youtokentome, subword-nmt  
  sacremoses Rule-based
  jieba Chinese Word Segmentation
  kytea Japanese word segmentation
Thesaurus python-datamuse  
Feature Generation homer, textstat Readability scores
  LexicalRichness Lexical Richness Measure
Gibberish Detection nostril, gibberish-detector  
Paraphrasing pegasus Question Paraphrasing
  sentaugment Paraphrase mining
Spacy Extensions spacy-pattern-builder Generate dependency matcher patterns automatically
  spacy_grammar Rule-based grammar error detection
  role-pattern-builder Pattern based SRL
  textpipeliner Extract RDF triples
  tenseflow Convert tense of sentence
  camphr Wrapper to transformers, elmo, udify
  spleno Domain-specific lemmatization
  spacy-udpipe Use UDPipe from Spacy
Linguistics nodebox_linguistics_extended Verb Conjugation
Morphology unimorph Morphology data for many languages
Phonetics epitran Transliterate text into IPA
  allosaurus Recognize phone for 2000 languages
Phonology panphon Generate phonological feature representations
  phoible Database of segment inventories for 2186 languages
Typology lang2vec Compare typological features of languages
Word Sense Disambiguation pywsd  
Embeddings InferSent, embedding-as-service, bert-as-service, sent2vec, sense2vec,glove-python, fse  
  rank_bm25, BM25Transformer BM25
  sentence-transformers, DeCLUTR BERT sentence embeddings
  conceptnet-numberbatch Word embeddings trained with common-sense knowledge graph
  word2vec-twitter Word2vec trained on twitter
  pymagnitude Access word-embeddings programatically
  chakin Download pre-trained word vectors
  zeugma Pretrained-word embeddings as scikit-learn transformers
  starspace Learn embeddings for anything
  svd2vec Learn embeddings from co-occurrence
  all-but-the-top Post-processing for word vectors
Cross-lingual Embeddings muse, laserembeddings, xlm, LaBSE  
  transvec Train mapping between monolingual embeddings
  MuRIL Embeddings for 17 indic languages with transliteration
  BPEmb Subword Embeddings in 275 Languages
  piecelearn Train own sub-word embeddings
Multilingual support polyglot, trankit  
  inltk, indic_nlp Indic Languages
  cltk NLP for latin and classic languages
Compact Models mobilebert, distilbert, tinybert,BERT-of-Theseus-MNLI, MiniML  
Information Extraction claucy  
Knowledge conceptnet-lite  
  stanford-openie Knowledge Graphs
  verbnet-parser VerbNet parser
Domain-specific BERT codebert Code
  clinicalbert-mimicnotes, clinicalbert-discharge-summary Clinical Domain
  twitter-roberta-base twitter
  scispacy bio-medical data
Text Extraction textract (Image, Audio, PDF)  
Text Generation gp2client, textgenrnn, gpt-2-simple, aitextgen GPT-2
  markovify Markov chains
Transliteration wiktra  
Machine Translation MarianMT, Opus-MT, joeynmt, OpenNMT  
  googletrans, word2word, translate-python, deep_translator Translation libraries
  translators Free calls to multiple translation APIs
  giza++, fastalign, simalign Word Alignment
Summarization textrank, pytldr, bert-extractive-summarizer, sumy, fast-pagerank, sumeval  
  doc2query Summarize document with queries
Question Generation question-generation, questiongen.ai Question Generation Pipeline for Transformers
Keyword extraction rake, pke, phrasemachine, keybert, word2phrase  
  pyate Automated Term Extraction
Question Answering haystack Build end-to-end QA system
  mcQA Multiple Choice Question Answering
  TAPAS Table Question Answering
Ranking transformer-rankers  
Search elasticsearch-dsl Wrapper for elastic search
  jina production-level neural semantic search
  mellisearch-python  
NLU snips-nlu  
Semantic parsing quepy  
Toxicity Detection detoxify  
Topic Modeling gensim, guidedlda, enstop, top2vec, contextualized-topic-models, corex_topic, lda2vec, bertopic, tomotopy.ToModAPI  
Code Switching codeswitch  
Clustering kmodes, star-clustering, genieclust  
  spherecluster K-means with cosine distance
  kneed Automatically find number of clusters from elbow curve
  OptimalCluster Automatically find optimal number of clusters
Metrics seqeval NER, POS tagging
  ranking-metrics Metrics for Information Retrieval
String match phrase-seeker, textsearch  
  jellyfish Perform string and phonetic comparison
  flashtext Super-fast extract and replace keywords
  pythonverbalexpressions Verbally describe regex
  commonregex Ready-made regex for email/phone etc.
  textdistance, editdistance, word-mover-distance Text distances
  wmd-relax Word mover distance for spacy
  fuzzywuzzy, spaczz, PolyFuzz, rapidfuzz, dedupe Fuzzy Search
Sentiment vaderSentiment Rule based
  absa Aspect Based Sentiment Analysis
Emotion Classification distilroberta-finetuned, goemotion-pytorch  
  emosent-py Sentiment scores for Emojis
Profanity detection profanity-check  
Visualization stylecloud Word Clouds
  scattertext Compare word usage across segments
  picture-text Interactive tree-maps for hierarchical clustering
Named Entity Recognition(NER) spaCy , Stanford NER, sklearn-crfsuite  
  med7 Spacy NER for medical records
Entity Linking dbpedia-spotlight, GENRE  
Entity Matching py_entitymatching, deepmatcher  
Fill blanks fitbert  
Dictionary vocabulary  
Nearest neighbor faiss, sparse_dot_topn  
Knowledge Distillation textbrewer, aquvitae  
Language Model Scoring lm-scorer, bertscore, kenlm, spacy_kenlm  
Record Linking fuzzymatcher  
Cross-lingual transfer learning langrank Auto-select optimal transfer language
Pronunciation pronouncing  
Dialogue System ParlAI  
Relation Extraction OpenNRE  

Computer Vision

Category Tool Remarks
Image processing scikit-image, imutils, opencv-wrapper, opencv-python  
  torchio Medical Images
Segmentation Models segmentation_models Keras
  segmentation_models.pytorch Segmentation models in PyTorch
High-level libraries terran Face detection, recognition, pose estimation
Face recognition face_recognition, mtcnn  
  face-alignment Find facial landmarks
  Facial-Expression-Recognition.Pytorch Face Emotion
GANS mimicry, imaginaire  
Image Inpainting GAN Image Inpainting  
Face swapping faceit, faceit-live, avatarify  
Video summarization videodigest  
Semantic search over videos scoper  
OCR keras-ocr, pytesseract, keras-craft, ocropy, doc2text  
  easyocr, kraken, PaddleOCR Multilingual OCR
  layout-parser, pdftabextract OCR tables from document
Object detection luminoth, detectron2, mmdetection  
Image hashing ImageHash  

Speech

Category Tool Remarks
Libraries pyannotate, librosa, espnet  
Speech Recognition kaldi, speech_recognition, delta  
Speech Synthesis festvox, cmuflite  
Feature Engineering python_speech_features Convert raw audio to features
Diarization resemblyzer  
Source Separation spleeter, nussl, open-unmix-pytorch, asteroid  

Recommendation System

Category Tool Remarks
Libraries xlearn, DeepCTR Factorization machines (FM), and field-aware factorization machines (FFM)
  lightfm, spotlight Popular Recsys algos
  tensorflow_recommenders Recommendation System in Tensorflow
Collaborative Filtering implicit  
Scikit-learn like API surprise  
Recommendation System in Pytorch CaseRecommender  
Apriori algorithm apyori  
Metrics rs_metrics  

Timeseries

Category Tool Remarks
Libraries prophet, tslearn, pyts, seglearn, cesium, stumpy, darts  
  sktime Scikit-learn like API
  atspy Automated time-series models
ARIMA models pmdarima  

Hyperparameter Optimization

Category Tool Remarks
General hyperopt, optuna, evol, talos  
Keras keras-tuner  
Scikit-learn hyperopt-sklearn, scikit-optimize Bayesian Optimization
  sklearn-deap Evolutionary algorithm
Parameter optimization ParameterImportance  

Phase: Validation

Experiment Monitoring

Category Tool Remarks
MLOps clearml, wandb, neptune.ai, replicate.ai  
Experiment tracking tensorboard, mlflow  
Learning curve lrcurve, livelossplot Plot realtime learning curve in Keras
Notification knockknock Get notified by slack/email
  jupyter-notify Notify when task is completed in jupyter
  apprise Notify to any platform
Progress bar fastprogress, tqdm  
GPU Usage gpumonitor, nvtop  
  jupyterlab-nvdashboard See GPU Usage in jupyterlab

Interpretability

Category Tool Remarks
Interpret models eli5, lime, shap, alibi, tf-explain, treeinterpreter, pybreakdown, xai, lofo-importance, interpretML, shapash  
  exbert Interpret BERT
  bertviz Explore self-attention in BERT
Interpret word2vec word2viz, whatlies  
Interpret NLP models Language Interpretability Tool  
Adversarial Attack cleverhans General
  foolbox Image
  triggers NLP

Visualization

Category Tool Remarks
Libraries matplotlib, seaborn, pygal, plotly, plotnine  
  yellowbrick, scikit-plot Visualization for scikit-learn
  pyldavis Visualize topics models
  dtreeviz Visualize decision tree
Interactive charts bokeh  
  flourish-studio Create interactive charts online
  mpld3 Matplotlib to D3 Converter
Model Visualization netron, nn-svg Architecture
  keract Activation maps for keras
  keras-vis Visualize keras models
Styling open-color Color Schemes
  mplcyberpunk Cyberpunk style for matplotlib
  chart.xkcd XKCD like charts
Generate graphs using markdown mermaid  
High dimensional visualization umap  
  ivis Ivis Algorithm
Animated charts bar_chart_race Bar chart race animation
  pandas_alive Animated charts in pandas
Tree-map chart squarify  
3D charts babyplots  

Phase: Production

Model Export

Category Tool Remarks
Cloud Storage Zenodo, Github Releases, OneDrive, Google Drive, Dropbox, S3, mega, DAGsHub, huggingface-hub  
Serialization sklearn-porter, m2cgen Transpile sklearn model to C, Java, JavaScript and others
  hummingbird Convert ML models to PyTorch
  cloudpickle, jsonpickle Pickle extensions
Dependencies pip-chill pip freeze without dependencies
  pipreqs Generate requirements.txt based on imports
  conda-pack Export conda for offline use
Benchmarking torchprof Profile pytorch layers
  scalene, pyinstrument Profile python code
  k6 Load test API
  ai-benchmark Bechmark VM on 19 different models
Distributed training horovod  
Data Pipeline pypeln  

Inference

Category Tool Remarks
Model Serving Frameworks cortex, torchserve, ray-serve, bentoml  
Dashboard streamlit Generate frontend with python
  gradio Fast UI generation for prototyping
  dash React Dashboard using Python
  voila Convert Jupyter notebooks into dashboard
  streamlit-drawable-canvas Drawable Canvas for Streamlit
  streamlit-terran-timeline Show timeline of faces in videos
  streamlit components Collection of streamlit components
API Frameworks flask  
  fastapi Automatic Docs and Validation
Configuration Management config, python-decouple  
Data Validation schema, jsonschema, cerebrus, pydantic, marshmallow, validators  
CORS flask-cors CORS in Flask
Caching cachetools, cachew (cache to local sqlite)  
Authentication pyjwt (JWT)  
Task Queue rq, schedule, huey  
  mlq Queue ML Tasks in Flask
Job Scheduler airflow  
Database flask-sqlalchemy, tinydb, flask-pymongo, odmantic  
  tortoise-orm Asyncio ORM similar to Django
Logging loguru  
Testing schemathesis Automatic test generation from Swagger
  pytest-benchmark Profile time in pytest
  exdown Extract code from markdown files
  mktestdocs Test code present in markdown files
Data Logging & Monitoring whylogs  

Python libraries

Category Tool Remarks
Decorators retrying (retry some function)  
Subprocess delegator.py  
bloom filter python-bloomfilter  
Run python libraries in sandbox pipx  
Pretty print tables in CLI tabulate  
Leaflet maps from python folium  
Debugging PySnooper  
Date and Time pendulum  
Create interactive prompts prompt-toolkit  
Concurrent database pickleshare  
Aync tomorrow  
Testing crosshair(find failure cases for functions)  
Virtual webcam pyfakewebcam  
CLI Formatting rich  
Control mouse and output device pynput  
Shell commands as functions sh  
Path-like interface to remote files pathy  
Standard Library Extension ubelt  
Improved doctest xdoctest  
Code to Maths latexify-py, handcalcs  
Multiprocessing filelock Lock files during access from multiple process
Collections bidict Bidirectional dictionary
  munch Dictionary with dot access

Utilities

Category Tool Remarks
Database mlab Free 500 MB MongoDB
Trade-off tools egograph Find alternatives to anything
Data Visualization flourish-studio  
Linux ripgrep  
Colab colab-cli Manager colab notebook from command line
Drive drive-cli Use google drive similar to git
Git gitjk Undo what you just did in git