Machine Learning Toolbox

This page contains useful libraries I’ve found when working on Machine Learning projects.

The libraries are organized below by phases of a typical Machine Learning project.

Phase: Data

Data Annotation

Category Tool Remarks
Image makesense.ai, labelimg  
Text doccano, dataturks, brat  
  prodigy Paid
  chatio Generate text datasets using DSL
Audio audio-annotator, audiono  
General superintendent, pigeon Annotate in notebooks
  labelstudio Open Source Data Labeling Tool

Data Collection

Category Tool Remarks
Curations datasetlist, UCI, Google Dataset Search, fastai-datasets  
  huggingface-datasets, The Big Bad NLP Database, nlp-datasets NLP Datasets
  bifrost Vision Datasets
Words curse-words, badwords, LDNOOBW, 10K most common words, common-misspellings  
  wordlists Words organized by topic
  english-words A text file containing over 466k English words
Text Corpus project gutenberg, nlp-datasets, 1 trillion n-grams, litbank, BookCorpus, south-asian text corpus  
  opus, oscar (big multilingual corpus) Translation Parallel Text
  opensubtitles Movie subtitles parallel corpus
Sentiment SST2, Amazon Reviews, Yelp Reviews, Movie Reviews, Food Reviews, Twitter Airline, GOP Debate, Sentiment Lexicons for 81 languages, SentiWordNet, Opinion Lexicon, Wordstat words, Emoticon Sentiment, socialsent  
Emotion NRC-Emotion-Lexicon-Wordlevel, ISEAR(17K), HappyDB, emotion-to-emoji-mapping  
NLU Intents rasa-nlu-training-data  
N-grams google-book-ngrams  
Summarization curation-corpus  
Conversations conversational-datasets, cornell-movie-dialog-corpus, persona-chat  
Image 1 million fake faces, flickr-faces, objectnet, YFCC100m, USPS, Animal Faces-HQ dataset (AFHQ)  
  tiny-images,SVHN, STL-10, imagenette, CIFAR-10 Small image datasets for quick experimentation
  omniglot, mini-imagenet One Shot Learning
Paraphrasing PPDB  
Audio audioset YouTube audio with labels
Graphs Social Networks (Github, Facebook, Reddit)  
Handwriting iam-handwriting  
  text_renderer Generate synthetic OCR text

Importing Data

Category Tool Remarks
Prebuilt openml, lineflow  
  rs_datasets Recommendation Datasets
  nlp Python interface to NLP datasets
Audio pydub  
Video moviepy Edit Videos
  pytube Download youtube vidoes
Image py-image-dataset-generator, idt, jmd-imagescraper Auto fetch images from web for certain search
News news-please, news-catcher Scrap News
  pygooglenews Google News
Lyrics lyricsgenius  
Email talon  
PDF camelot, tabula-py, parsr, pdftotext, pdfplumber, pymupdf  
Excel openpyxl  
Remote file smart_open  
Crawling MechanicalSoup, libextract  
  pyppeteer Chrome Automation
  hext DSL for extracting data from HTML
  ratelimit API rate limit decorator
Google sheets gspread  
Google drive gdown, pydrive  
Python API pydataset  
Google Maps geo-heatmap  
Text to Speech gtts  
Database blaze Pandas and Numpy interface to databases
Twitter twint, tweepy Scrape Twitter
App Store google-play-scraper  
Wikipedia wikipedia Access data from wikipedia
Google Ngrams google-ngram-downloader  

Data Augmentation

Category Tool Remarks
Text nlpaug, noisemix, textattack, textaugment, niacin, SeaQuBe  
Image imgaug, albumentations, augmentor, solt  
Audio audiomentations, muda  
OCR data TextRecognitionDataGenerator  
Tabular data deltapy  
Automatic augmentation deepaugment Image

Phase: Exploration

Data Preparation

Category Tool Remarks
Dataframe cudf Pandas on GPU
Missing values missingno  
Split images into train/validation/test split-folders  
Class Imbalance imblearn  
Categorical encoding category_encoders  
Numerical data numerizer, word2number Parse natural language number
Data Validation pandera, pandas-profiling Pandas
Data Cleaning pyjanitor Janitor ported to python
Parsing pyparsing, parse  
Natural date parser dateparser  
Unicode text-unidecode  
Emoji emoji  
Weak Supervision snorkel  
Graph Sampling little ball of fur  

Data Exploration

Category Tool Remarks
Explore Data sweetviz, dataprep, quickda Generate quick visualizations of data
Notebook Tools nbdime View Jupyter notebooks through CLI
  papermill Parametrize notebooks
  nbformat Access notebooks programatically
  nbconvert Convert notebooks to other formats
  ipyleaflet Maps in notebooks
Relationship ppscore Predictive Power Score

Phase: Feature Engineering

Feature Generation

Category Tool Remarks
Automatic feature engineering featuretools, autopandas  
  tsfresh Automatic feature engineering for time series
Metric learning metric-learn, pytorch-metric-learning  
Time series python-holidays List of holidays
  skits Transformation for time-series data
  catch22 Pre-built features for time-series data
DAG based dataset generation DFFML  
Dimensionality reduction fbpca, fitsne, trimap  

Phase: Modeling

Model Selection

Category Tool Remarks
Find SOTA models sotawhat, papers-with-code  
  bert-related-papers BERT Papers
  acl-explorer ACL Publications Explorer
  survey-papers Collection of survey papers
Pretrained models modeldepot, pytorch-hub General
  pretrained-models.pytorch, pytorchcv Pre-trained ConvNets
  pytorch-image-models 200+ pretrained ConvNet backbones
  huggingface-models, huggingface-pretrained Transformer Models
  huggingface-languages Multi-lingual Models
  model-forge, The Super Duper NLP Repo Pre-trained NLP models by usecase
AutoML auto-sklearn, mljar-supervised, automl-gs, pycaret, evalml  
  lazypredict Run all sklearn models at once
  tpot Genetic AutoML
  autocat Auto-generate text classification models in spacy
  mindsdb, lugwig Autogenerate ML code
Gradient Boosting catboost, ngboost  
  lightgbm, thunderbm GPU Capable
Hidden Markov Models hmmlearn  
Genetic Programming gplearn  
Active Learning modal  
Support Vector Machines thundersvm Run SVM on GPU
Rule based classifier sklearn-expertsys  
Probabilistic modeling pomegranate, pymc3  
Graph Embedding and Community Detection karateclub, python-louvain  
Anomaly detection adtk  
Spiking Neural Network norse  
Fuzzy Learning fylearn, scikit-fuzzy  
Noisy Label Learning cleanlab  
Few Shot Learning keras-fewshotlearning  
Deep Clustering deep-clustering-toolbox  
Graph Neural Networks spektral GNN for Keras
Contrastive Learning contrastive-learner  
Gradient Free Optimization nevergrad  

Natural Language Processing

Category Tool Remarks
Libraries spacy , nltk, corenlp, deeppavlov, kashgari, transformers, ernie, stanza  
  headliner, txt2txt Sequence to sequence models
  Nvidia NeMo Toolkit for ASR, NLP and TTS
Wrappers fast-bert, simpletransformers  
  finetune Scikit-learn like API for transformers
Preprocessing textacy  
  JamSpell, pyhunspell, pyspellchecker, cython_hunspell, hunspell-dictionaries, autocorrect (can add more languages), symspellpy, spello (train your own spelling correction), contextualSpellCheck, neuspell Spelling Correction
  contractions, pycontractions Contraction Mapping
  truecase Fix casing
  nnsplit, deepsegment, sentence-doctor, pysbd Sentence Segmentation
  wordninja Probabilistic Word Segmentation
  stopwords-iso Stopwords for all languages
  language-check, langdetect, polyglot, pycld2, cld2, cld3 Language Detection
  neuralcoref Coreference Resolution
  inflect, lemminflect Inflections
  scrubadub PID removal
  ftfy, clean-text Fix Unicode Issues
  fastpunct Punctuation Restoration
  pypostal, mordecai Parse Street Addresses
Tokenization sentencepiece, youtokentome, subword-nmt  
  sacremoses Rule-based
  jieba Chinese Word Segmentation
Paraphrasing pegasus Question Paraphrasing
  sentaugment Paraphrase mining
Spacy Extensions spacy-pattern-builder Generate dependency matcher patterns automatically
  spacy_grammar Rule-based grammar error detection
  role-pattern-builder Pattern based SRL
  textpipeliner Extract RDF triples
  tenseflow Convert tense of sentence
  camphr Wrapper to transformers, elmo, udify
  spleno Domain-specific lemmatization
Linguistics nodebox_linguistics_extended Verb Conjugation
Word Sense Disambiguation pywsd  
Embeddings InferSent, embedding-as-service, bert-as-service, sent2vec, sense2vec, BM25Transformer,glove-python, fse  
  sentence-transformers, DeCLUTR BERT sentence embeddings
  pymagnitude Access word-embeddings programatically
  chakin Download pre-trained word vectors
  zeugma Pretrained-word embeddings as scikit-learn transformers
  starspace Learn embeddings for anything
Cross-lingual Embeddings muse, laserembeddings, xlm, LaBSE  
  MuRIL Embeddings for 17 indic languages with transliteration
  BPEmb Subword Embeddings in 275 Languages
  piecelearn Train own sub-word embeddings
Multilingual support polyglot  
  inltk, indic_nlp Indic Languages
Compact Models mobilebert, distilbert, tinybert,BERT-of-Theseus-MNLI, MiniML  
Knowledge conceptnet-lite  
  stanford-openie Knowledge Graphs
  verbnet-parser VerbNet parser
Domain-specific BERT codebert Code
  clinicalbert-mimicnotes, clinicalbert-discharge-summary Clinical Domain
Scientific Domain scispacy Spacy for bio-medical data
Text Extraction textract (Image, Audio, PDF)  
Text Generation gp2client, textgenrnn, gpt-2-simple, aitextgen GPT-2
  markovify Markov chains
Machine Translation MarianMT, Opus-MT  
  googletrans, word2word, translate-python, deep_translator Translation libraries
  translators Free calls to multiple translation APIs
Summarization textrank, pytldr, bert-extractive-summarizer, sumy, fast-pagerank, sumeval  
  doc2query Summarize document with queries
Question Generation question-generation, questiongen.ai Question Generation Pipeline for Transformers
Keyword extraction rake, pke, phrasemachine  
  pyate Automated Term Extraction
Multiply Choice Question Answering mcQA  
Ranking transformer-rankers  
Search elasticsearch-dsl Wrapper for elastic search
  jina production-level neural semantic search
NLU snips-nlu  
Semantic parsing quepy  
Readability homer  
Topic Modeling guidedlda, enstop, top2vec, contextualized-topic-models, corex_topic, lda2vec, bertopic  
Clustering kmodes, star-clustering  
  spherecluster K-means with cosine distance
  kneed Automatically find number of clusters from elbow curve
  OptimalCluster Automatically find optimal number of clusters
Metrics seqeval NER, POS tagging
  ranking-metrics Metrics for Information Retrieval
String match phrase-seeker, textsearch  
  jellyfish Perform string and phonetic comparison
  flashtext Super-fast extract and replace keywords
  pythonverbalexpressions Verbally describe regex
  commonregex Ready-made regex for email/phone etc.
  textdistance, editdistance, word-mover-distance Text distances
  wmd-relax Word mover distance for spacy
  fuzzywuzzy, spaczz Fuzzy Search
Sentiment vaderSentiment Rule based
  absa Aspect Based Sentiment Analysis
Emotion Classification distilroberta-finetuned, goemotion-pytorch  
  emosent-py Sentiment scores for Emojis
Profanity detection profanity-check  
Visualization stylecloud Word Clouds
  scattertext Compare word usage across segments
  picture-text Interactive tree-maps for hierarchical clustering
Named Entity Recognition(NER) spaCy , Stanford NER, sklearn-crfsuite  
  med7 Spacy NER for medical records
Fill blanks fitbert  
Dictionary vocabulary  
Nearest neighbor faiss  
Knowledge Distillation textbrewer, aquvitae  
Language Model Scoring lm-scorer, bertscore, kenlm, spacy_kenlm  
Record Linking fuzzymatcher  
Cross-lingual transfer learning langrank Auto-select optimal transfer language

Computer Vision

Category Tool Remarks
Image processing scikit-image, imutils, opencv-wrapper, opencv-python  
Segmentation Models segmentation_models Keras
High-level libraries terran Face detection, recognition, pose estimation
Face recognition face_recognition, mtcnn  
  face-alignment Find facial landmarks
  Facial-Expression-Recognition.Pytorch Face Emotion
GANS mimicry, imaginaire  
Image Inpainting GAN Image Inpainting  
Face swapping faceit, faceit-live, avatarify  
Video summarization videodigest  
Semantic search over videos scoper  
OCR keras-ocr, pytesseract, keras-craft  
  easyocr 40+ languages
Object detection luminoth, detectron2, mmdetection  
Image hashing ImageHash  

Audio

Category Tool Remarks
Library speech_recognition, pyannotate, librosa  
Diarization resemblyzer  
Source Separation spleeter, nussl, open-unmix-pytorch, asteroid  

Recommendation System

Category Tool Remarks
Libraries xlearn, DeepCTR Factorization machines (FM), and field-aware factorization machines (FFM)
  lightfm, spotlight Popular Recsys algos
  tensorflow_recommenders Recommendation System in Tensorflow
Collaborative Filtering implicit  
Scikit-learn like API surprise  
Recommendation System in Pytorch CaseRecommender  
Apriori algorithm apyori  
Metrics rs_metrics  

Timeseries

Category Tool Remarks
Libraries prophet, tslearn, pyts, seglearn, cesium, stumpy, darts  
  sktime Scikit-learn like API
  atspy Automated time-series models
ARIMA models pmdarima  

Framework extensions

Category Tool Remarks
Addons mlxtend Extra utilities not present in frameworks
  tensor-sensor Visualize tensors
Pytorch pytorch-summary Keras-like summary
  skorch Wrap pytorch in scikit-learn compatible API
  pytorch-lightning Lightweight wrapper for PyTorch
  einops Einstein Notation
  kornia Computer Vision Methods
  torchcontrib SOTA Bulding Blocks in PyTorch
  pytorch-optimizer Collection of optimizers
  pytorch-block-sparse Sparse matrix replacement for nn.Linear
Scikit-learn scikit-lego, iterative-stratification  
  tscv Time-series cross-validation
  iterstrat Cross-validation for multi-label data
  scikit-multilearn Multi-label classification
Keras tf-sha-rnn  
  keras-radam RADAM optimizer
  scikeras Scikit-learn Wrapper for Keras
  larq Binarized neural networks
  ktrain FastAI like interface for keras
  tavolo Kaggle Tricks as Keras Layers
Tensorflow tensorflow-addons  

Phase: Validation

Model Training Monitoring

Category Tool Remarks
Learning curve lrcurve, livelossplot Plot realtime learning curve in Keras
Notification knockknock Get notified by slack/email
  jupyter-notify Notify when task is completed in jupyter
  apprise Notify to any platform
Progress bar fastprogress, tqdm  
GPU Usage gpumonitor  
  jupyterlab-nvdashboard See GPU Usage in jupyterlab

Interpretability

Category Tool Remarks
Interpret models eli5, lime, shap, alibi, tf-explain, treeinterpreter, pybreakdown, xai, lofo-importance, interpretML  
  exbert Interpret BERT
  bertviz Explore self-attention in BERT
Interpret word2vec word2viz, whatlies  

Visualization

Category Tool Remarks
Libraries pygal, plotly, plotnine  
  yellowbrick, scikit-plot Visualization for scikit-learn
  pyldavis Visualize topics models
  dtreeviz Visualize decision tree
Interactive charts bokeh  
  flourish-studio Create interactive charts online
  mpld3 Matplotlib to D3 Converter
Model Visualization netron, nn-svg Architecture
  keract Activation maps for keras
  keras-vis Visualize keras models
Styling open-color Color Schemes
  mplcyberpunk Cyberpunk style for matplotlib
  chart.xkcd XKCD like charts
Generate graphs using markdown mermaid  
High dimensional visualization umap  
  ivis Ivis Algorithm
Bar chart race animation bar_chart_race  

Phase: Optimization

Hyperparameter Optimization

Category Tool Remarks
General hyperopt, optuna, evol, talos  
Keras keras-tuner  
Scikit-learn hyperopt-sklearn, scikit-optimize Bayesian Optimization
  sklearn-deap Evolutionary algorithm
Parameter optimization ParameterImportance  

Phase: Production

Model Serialization

Category Tool Remarks
Transpiling sklearn-porter, m2cgen Transpile sklearn model to C, Java, JavaScript and others
  hummingbird Convert ML models to PyTorch
Pickling extended cloudpickle, jsonpickle  
Dependencies pip-chill pip freeze without dependencies
  pipreqs Generate requirements.txt based on imports

Scalability

Category Tool Remarks
Parallelize Pandas pandarallel, swifter, modin  
Pandas on Huge data vaex  
Parallelize numpy operations numba  
Distributed training horovod  
Data Pipeline pypeln  

Bechmarking

Category Tool Remarks
Profile pytorch layers torchprof  
Profile python code scalene  
Load testing k6  
Monitor GPU usage nvtop  
Benchmark Machine ai-benchmark Bechmark latency on 19 different models

API

Category Tool Remarks
API Frameworks flask  
  fastapi Automatic Docs and Validation
Configuration Management config, python-decouple  
Data Validation schema, jsonschema, cerebrus, pydantic, marshmallow, validators  
CORS flask-cors CORS in Flask
Caching cachetools, cachew (cache to local sqlite)  
Authentication pyjwt (JWT)  
Task Queue rq, schedule, huey  
  mlq Queue ML Tasks in Flask
Database flask-sqlalchemy, tinydb, flask-pymongo, odmantic  
Logging loguru  
Testing schemathesis Automatic test generation from Swagger
Environment Management conda-pack Export conda for offline use

Dashboard

Category Tool Remarks
Libraries streamlit Generate frontend with python
  gradio Fast UI generation for prototyping
  dash React Dashboard using Python
  voila Convert Jupyter notebooks into dashboard
streamlit streamlit-drawable-canvas Drawable Canvas for Streamlit
  streamlit-terran-timeline Show timeline of faces in videos

Testing

Category Tool Remarks
Generate images to fool model foolbox  
Generate phrases to fool NLP models triggers  
General cleverhans  
pytest pytest-benchmark Profile time in tests

Python libraries

Category Tool Remarks
Decorators retrying (retry some function)  
Subprocess delegator.py  
bloom filter python-bloomfilter  
Run python libraries in sandbox pipx  
Pretty print tables in CLI tabulate  
Leaflet maps from python folium  
Debugging PySnooper  
Date and Time pendulum  
Create interactive prompts prompt-toolkit  
Concurrent database pickleshare  
Aync tomorrow  
Testing crosshair(find failure cases for functions)  
Virtual webcam pyfakewebcam  
CLI Formatting rich  
Control mouse and output device pynput  
Shell commands as functions sh  
Standard Library Extension ubelt  
Improved doctest xdoctest  
Code to Maths latexify-py, handcalcs  
Multiprocessing filelock Lock files during access from multiple process
Bidirectional dictionary bidict  

Utilities

Category Tool Remarks
Database mlab Free 500 MB MongoDB
Trade-off tools egograph Find alternatives to anything
Data Visualization flourish-studio  

Workflow

Category Tool Remarks
Linux ripgrep  
Colab colab-cli Manager colab notebook from command line
Git gitjk Undo what you just did in git