Machine Learning Toolbox

This page contains useful libraries I’ve found when working on Machine Learning projects.

The libraries are organized below by phases of a typical Machine Learning project.

Phase: Data

Data Annotation

Category Tool Remarks
Image makesense.ai  
Text doccano, dataturks, brat  
  prodigy Paid
Audio audio-annotator, audiono  
General superintendent Label in notebooks
  labelstudio Open Source Data Labeling Tool

Data Collection

Category Tool Remarks
Curations datasetlist, UCI, Google Dataset Search, fastai-datasets  
  huggingface-datasets, The Big Bad NLP Database, nlp-datasets NLP Datasets
Words curse-words, badwords, LDNOOBW, 10K most common words, common-misspellings  
  wordlists Words organized by topic
  english-words A text file containing over 466k English words
Text Corpus project gutenberg, oscar (big multilingual corpus), nlp-datasets, 1 trillion n-grams, litbank, BookCorpus, south-asian text corpus  
Sentiment SST2, Amazon Reviews, Yelp Reviews, Movie Reviews, Food Reviews, Twitter Airline, GOP Debate, Sentiment Lexicons for 81 languages, SentiWordNet, Opinion Lexicon, Wordstat words, Emoticon Sentiment  
Emotion NRC-Emotion-Lexicon-Wordlevel, ISEAR(17K), HappyDB  
NLU Intents rasa-nlu-training-data  
N-grams google-book-ngrams  
Summarization curation-corpus  
Conversations conversational-datasets, cornell-movie-dialog-corpus  
Image 1 million fake faces, flickr-faces, objectnet, YFCC100m, USPS, Animal Faces-HQ dataset (AFHQ)  
  tiny-images,SVHN, STL-10, imagenette, CIFAR-10 Small image datasets for quick experimentation
  omniglot, mini-imagenet One Shot Learning
Paraphrasing PPDB  
Audio audioset YouTube audio with labels
Graphs Social Networks (Github, Facebook, Reddit)  
Handwriting iam-handwriting  

Importing Data

Category Tool Remarks
Prebuilt openml, lineflow  
  rs_datasets Recommendation Datasets
  nlp Python interface to NLP datasets
Audio pydub  
Video moviepy Edit Videos
  pytube Download youtube vidoes
Image py-image-dataset-generator Auto fetch images from web for certain search
News news-please, news-catcher Scrap News
  pygooglenews Google News
Lyrics lyricsgenius  
Email talon  
PDF camelot, tabula-py, parsr, pdftotext, pdfplumber  
Excel openpyxl  
Remote file smart_open  
Crawling MechanicalSoup, libextract  
  pyppeteer Chrome Automation
Google sheets gspread  
Google drive gdown, pydrive  
Python API pydataset  
Google Maps geo-heatmap  
Text to Speech gtts  
Database blaze Pandas and Numpy interface to databases
Twitter twint, tweepy Scrape Twitter
App Store google-play-scraper  
Wikipedia wikipedia Access data from wikipedia

Data Augmentation

Category Tool Remarks
Text nlpaug, noisemix, textattack, textaugment, niacin  
Image imgaug, albumentations, augmentor, solt  
Audio audiomentations, muda  
OCR data TextRecognitionDataGenerator  
Tabular data deltapy  
Automatic augmentation deepaugment Image

Phase: Exploration

Data Preparation

Category Tool Remarks
Dataframe cudf Pandas on GPU
Missing values missingno  
Split images into train/validation/test split-folders  
Class Imbalance imblearn  
Categorical encoding category_encoders  
Numerical data numerizer Parse natural language number
Data Validation pandera, pandas-profiling Pandas
Data Cleaning pyjanitor Janitor ported to python
Parsing pyparsing, parse  
Natural date parser dateparser  
Unicode text-unidecode  
Emoji emoji  
Weak Supervision snorkel  
Graph Sampling little ball of fur  

Data Exploration

Category Tool Remarks
Explore Data sweetviz, dataprep, quickda Generate quick visualizations of data
Notebook Tools nbdime View Jupyter notebooks through CLI
  papermill Parametrize notebooks
  nbformat Access notebooks programatically
  nbconvert Convert notebooks to other formats
Jupyter Extensions ipyleaflet Maps in notebooks
Relationship ppscore Predictive Power Score

Phase: Feature Engineering

Feature Generation

Category Tool Remarks
Automatic feature engineering featuretools, autopandas  
  tsfresh Automatic feature engineering for time series
Metric learning metric-learn, pytorch-metric-learning  
Time series python-holidays List of holidays
  skits Transformation for time-series data
  catch22 Pre-built features for time-series data
DAG based dataset generation DFFML  

Dimensionality reduction

Category Tool Remarks
Dimensionality reduction fbpca, fitsne  

Phase: Modeling

Model Selection

Category Tool Remarks
Find SOTA models sotawhat, papers-with-code  
  bert-related-papers BERT Papers
  acl-explorer ACL Publications Explorer
Pretrained models modeldepot, pytorch-hub General
  pretrained-models.pytorch Pre-trained ConvNets
  huggingface-models Transformer Models
  huggingface-languages Multi-lingual Models
AutoML auto-sklearn, mljar-supervised, automl-gs, pycaret  
  lazypredict Run all sklearn models at once
  tpot Genetic AutoML
  autocat Auto-generate text classification models in spacy
Autogenerate ML code mindsdb, lugwig  
ML from command line (or Python or HTTP) DFFML  
Gradient Boosting catboost, ngboost  
  lightgbm, thunderbm GPU Capable
Hidden Markov Models hmmlearn  
Genetic Programming gplearn  
Active Learning modal  
Support Vector Machines thundersvm Run SVM on GPU
Rule based classifier sklearn-expertsys  
Probabilistic modeling pomegranate, pymc3  
Graph Embedding and Community Detection karateclub, python-louvain  
Anomaly detection adtk  
Spiking Neural Network norse  
Fuzzy Learning fylearn, scikit-fuzzy  
Noisy Label Learning cleanlab  
Few Shot Learning keras-fewshotlearning  
Deep Clustering deep-clustering-toolbox  
Graph Neural Networks spektral GNN for Keras
Contrastive Learning contrastive-learner  
Gradient Free Optimization nevergrad  

Natural Language Processing

Category Tool Remarks
Libraries spacy , nltk, corenlp, deeppavlov, kashgari, camphr (spacy plugin for transformers, elmo, udify), transformers, ernie, stanza  
  headliner, txt2txt Sequence to sequence models
Wrappers fast-bert, simpletransformers  
  finetune Scikit-learn like API for transformers
Preprocessing textacy  
  JamSpell, pyhunspell, pyspellchecker, cython_hunspell, hunspell-dictionaries, autocorrect (can add more languages), symspellpy, spello (train your own spelling correction), contextualSpellCheck Spelling Correction
  contractions, pycontractions Contraction Mapping
  truecase Fix casing
  nnsplit, deepsegment Sentence Segmentation
  stopwords-iso Stopwords for all languages
  language-check Language Detection
  neuralcoref Coreference Resolution
  inflect, lemminflect Inflections
  scrubadub PID removal
  ftfy, clean-text Fix Unicode Issues
  fastpunct Punctuation Restoration
Tokenization sentencepiece, youtokentome, subword-nmt  
  jieba Chinese Word Segmentation
Embeddings InferSent, bert-as-service, sent2vec, sense2vec, BM25Transformer,glove-python, fse  
  sentence-transformers BERT sentence embeddings
  pymagnitude Access word-embeddings programatically
  chakin Download pre-trained word vectors
  zeugma Pretrained-word embeddings as scikit-learn transformers
  starspace Learn embeddings for anything
Cross-lingual Embeddings muse, laserembeddings, xlm, LaBSE  
  BPEmb Subword Embeddings in 275 Languages
Multilingual support polyglot  
  inltk, indic_nlp Indic Languages
Knowledge conceptnet-lite  
  stanford-openie Knowledge Graphs
Domain-specific BERT codebert Code
  clinicalbert-mimicnotes, clinicalbert-discharge-summary Clinical Domain
Scientific Domain scispacy Spacy for bio-medical data
Text Extraction textract (Image, Audio, PDF)  
Text Generation gp2client, textgenrnn, gpt-2-simple, aitextgen GPT-2
  markovify Markov chains
Machine Translation MarianMT  
  googletrans, word2word, translate-python Translation libraries
Summarization textrank, pytldr, bert-extractive-summarizer, sumy, fast-pagerank, sumeval  
Question Generation question-generation Question Generation Pipeline for Transformers
Keyword extraction rake, pke, phrasemachine  
  pyate Automated Term Extraction
Multiply Choice Question Answering mcQA  
Ranking transformer-rankers  
Search elasticsearch-dsl Wrapper for elastic search
  jina production-level neural semantic search
NLU snips-nlu  
Semantic parsing quepy  
Readability homer  
Topic Modeling guidedlda, enstop, top2vec, contextualized-topic-models, corex_topic, lda2vec  
Clustering kmodes, star-clustering  
  spherecluster K-means with cosine distance
  kneed Automatically find number of clusters from elbow curve
  OptimalCluster Automatically find optimal number of clusters
Metrics seqeval NER, POS tagging
String match phrase-seeker, textsearch  
  jellyfish Perform string and phonetic comparison
  flashtext Super-fast extract and replace keywords
  pythonverbalexpressions Verbally describe regex
  commonregex Ready-made regex for email/phone etc.
  textdistance, editdistance, word-mover-distance Text distances
  wmd-relax Word mover distance for spacy
  fuzzywuzzy, spaczz Fuzzy Search
Sentiment vaderSentiment Rule based
  absa Aspect Based Sentiment Analysis
Emotion Classification distilroberta-finetuned, goemotion-pytorch  
Profanity detection profanity-check  
Visualization stylecloud Word Clouds
  scattertext Compare word usage across segments
Named Entity Recognition(NER) spaCy , Stanford NER, sklearn-crfsuite  
  med7 Spacy NER for medical records
Fill blanks fitbert  
Dictionary vocabulary  
Nearest neighbor faiss  
Knowledge Distillation textbrewer, aquvitae  
Language Model Scoring lm-scorer, bertscore, kenlm  
Record Linking fuzzymatcher  

Computer Vision

Category Tool Remarks
Pretrained models pytorchcv  
Image processing scikit-image, imutils  
Segmentation Models segmentation_models Keras
Face recognition face_recognition  
  face-alignment Find facial landmarks
GANS mimicry  
Face swapping faceit, faceit-live, avatarify  
Video summarization videodigest  
Semantic search over videos scoper  
OCR keras-ocr, pytesseract, keras-craft  
  easyocr 40+ languages
Object detection luminoth, detectron2, mmdetection  
Image hashing ImageHash  

Audio

Category Tool Remarks
Library speech_recognition, pyannotate, librosa  
Diarization resemblyzer  
Source Separation spleeter, nussl, open-unmix-pytorch, asteroid  

Recommendation System

Category Tool Remarks
Libraries xlearn, DeepCTR Factorization machines (FM), and field-aware factorization machines (FFM)
  lightfm, spotlight Popular Recsys algos
Collaborative Filtering implicit  
Scikit-learn like API surprise  
Recommendation System in Pytorch CaseRecommender  
Apriori algorithm apyori  
Metrics rs_metrics  

Timeseries

Category Tool Remarks
Libraries prophet, tslearn, pyts, seglearn, cesium, stumpy, darts  
  sktime Scikit-learn like API
  atspy Automated time-series models
ARIMA models pmdarima  

Framework extensions

Category Tool Remarks
Addons mlxtend Extra utilities not present in frameworks
Pytorch pytorch-summary Keras-like summary
  skorch Wrap pytorch in scikit-learn compatible API
  pytorch-lightning Lightweight wrapper for PyTorch
  einops Einstein Notation
  kornia Computer Vision Methods
  torchcontrib SOTA Bulding Blocks in PyTorch
  pytorch-optimizer Collection of optimizers
Scikit-learn scikit-lego, iterative-stratification  
  tscv Time-series cross-validation
  iterstrat Cross-validation for multi-label data
Keras tf-sha-rnn  
  keras-radam RADAM optimizer
  scikeras Scikit-learn Wrapper for Keras
  larq Binarized neural networks
  ktrain FastAI like interface for keras
  tavolo Kaggle Tricks as Keras Layers
Tensorflow tensorflow-addons  

Phase: Validation

Model Training Monitoring

Category Tool Remarks
Learning curve lrcurve, livelossplot Plot realtime learning curve in Keras
Notification knockknock Get notified by slack/email
  jupyter-notify Notify when task is completed in jupyter
Progress bar fastprogress, tqdm  
GPU Usage gpumonitor  
  jupyterlab-nvdashboard See GPU Usage in jupyterlab

Interpretability

Category Tool Remarks
Interpret models eli5, lime, shap, alibi, tf-explain, treeinterpreter, pybreakdown, xai, lofo-importance, interpretML  
Interpret BERT exbert  
  bertviz Explore self-attention in BERT
Interpret word2vec word2viz, whatlies  

Visualization

Category Tool Remarks
Libraries pygal, plotly, plotnine  
  yellowbrick, scikit-plot Visualization for scikit-learn
  pyldavis Visualize topics models
Interactive charts bokeh  
  flourish-studio Create interactive charts online
  mpld3 Matplotlib to D3 Converter
Model Visualization netron, nn-svg Architecture
  keract Activation maps for keras
  keras-vis Visualize keras models
Styling open-color Color Schemes
  mplcyberpunk Cyberpunk style for matplotlib
  chart.xkcd XKCD like charts
Generate graphs using markdown mermaid  
High dimensional visualization umap  
  ivis Ivis Algorithm
Bar chart race animation bar_chart_race  

Phase: Optimization

Hyperparameter Optimization

Category Tool Remarks
General hyperopt, optuna, evol, talos  
Keras keras-tuner  
Scikit-learn hyperopt-sklearn Bayesian Optimization
  sklearn-deap Evolutionary algorithm
Parameter optimization ParameterImportance  

Phase: Production

Model Serialization

Category Tool Remarks
Transpiling sklearn-porter, m2cgen Transpile sklearn model to C, Java, JavaScript and others
  hummingbird Convert ML models to PyTorch
Pickling extended cloudpickle, jsonpickle  

Scalability

Category Tool Remarks
Parallelize Pandas pandarallel, swifter, modin  
Pandas on Huge data vaex  
Parallelize numpy operations numba  
Distributed training horovod  

Bechmark

Category Tool Remarks
Profile pytorch layers torchprof  
Load testing k6  
Monitor GPU usage nvtop  

API

Category Tool Remarks
API Frameworks flask  
  fastapi Automatic Docs and Validation
Configuration Management config, python-decouple  
Data Validation schema, jsonschema, cerebrus, pydantic, marshmallow, validators  
Enable CORS in Flask flask-cors  
Caching cachetools, cachew (cache to local sqlite)  
Authentication pyjwt (JWT)  
Task Queue rq, schedule, huey  
  mlq Queue ML Tasks in Flask
Database flask-sqlalchemy, tinydb, flask-pymongo  
Logging loguru  
Testing schemathesis Automatic test generation from Swagger

Dashboard

Category Tool Remarks
Libraries streamlit Generate frontend with python
  gradio Fast UI generation for prototyping
  dash React Dashboard using Python
  voila Convert Jupyter notebooks into dashboard

Adversarial testing

Category Tool Remarks
Generate images to fool model foolbox  
Generate phrases to fool NLP models triggers  
General cleverhans  

Python libraries

Category Tool Remarks
Decorators retrying (retry some function)  
bloom filter python-bloomfilter  
Run python libraries in sandbox pipx  
Pretty print tables in CLI tabulate  
Leaflet maps from python folium  
Debugging PySnooper  
Date and Time pendulum  
Create interactive prompts prompt-toolkit  
Concurrent database pickleshare  
Aync tomorrow  
Testing crosshair(find failure cases for functions)  
Virtual webcam pyfakewebcam  
CLI Formatting rich  
Control mouse and output device pynput  
Shell commands as functions sh  
Standard Library Extension ubelt  
Improved doctest xdoctest  
Code to Maths latexify-py, handcalcs  
Multiprocessing filelock Lock files during access from multiple process

Workflow

Category Tool Remarks
Linux ripgrep  
Colab colab-cli Manager colab notebook from command line
Git gitjk Undo what you just did in git