Skip to content
📖 Welcome to my knowledge base! Notes on AI/ML, Maths, CS, MBA, Trading, Economics, Health & Self-Help — all in one place.! 🎉 Discover what’s new
Data Analysis

Data Analysis

Foundational Data Science Libraries

NumPy

Fundamental Array Computing

The bedrock of numerical computing in Python. Provides the ndarray object that powers virtually every other data science library. Features:

  • N-dimensional arrays with vectorized operations
  • Broadcasting for operations on different shapes
  • Universal functions (ufuncs) for fast element-wise operations
  • Linear algebra (matrix multiplication, decompositions, eigenvalues)
  • Random number generation from statistical distributions
  • Fourier transforms for signal processing
  • Masked arrays for handling missing data

Essential because vectorized operations execute in compiled C code, making Python competitive for numerical computing. Without NumPy, Python would be impractical for most scientific computing tasks.


Pandas

Data Manipulation and Analysis

The Excel of Python but more powerful and scriptable. Builds on NumPy to provide labeled, tabular data structures. Features:

  • DataFrames and Series with intelligent indexing
  • Data cleaning (handling missing values, removing duplicates)
  • Data transformation (grouping, pivoting, melting, merging)
  • Time series support (dates, resampling, frequency conversion)
  • Data I/O (CSV, Excel, JSON, SQL, Parquet, HDF5)
  • Vectorized string operations
  • Categorical data for memory efficiency
  • Window functions (rolling, expanding, exponential weighted)
  • Database-like operations (group by, join, merge, pivot)

Essential because it handles the messiness of real-world data and provides intuitive tools for transforming raw data into analysis-ready format. Without pandas, you’d spend most of your time manually manipulating data with loops and lists.


SciPy

Scientific and Technical Computing

Builds on NumPy to provide a comprehensive suite of algorithms for scientific computing. Features:

  • scipy.optimize: Optimization and root finding
  • scipy.integrate: Numerical integration and ODE solvers
  • scipy.interpolate: Interpolation and smoothing
  • scipy.linalg: Linear algebra (SVD, eigenvalues, decompositions)
  • scipy.stats: Statistical distributions and hypothesis tests
  • scipy.signal: Signal processing (filters, spectral analysis)
  • scipy.sparse: Sparse matrix operations
  • scipy.spatial: KD-trees, Delaunay triangulation, distance metrics
  • scipy.special: Special mathematical functions (Bessel, gamma, error)
  • scipy.ndimage: N-dimensional image processing
  • scipy.fft: Fast Fourier Transforms
  • scipy.cluster: Clustering algorithms
  • scipy.io: Reading MATLAB, NetCDF, WAV files

Essential because it contains battle-tested, production-quality algorithms that would be difficult to implement from scratch. SciPy provides the computational muscle for research and engineering applications.


The Ecosystem Relationship

These three libraries form a layered foundation:

  1. NumPy provides the array data structure
  2. SciPy adds scientific algorithms on top of NumPy
  3. pandas adds labeled data structures and data manipulation tools

Typical Workflow:

  • Load raw data with pandas (CSV, Excel, database)
  • Clean, transform, and explore with pandas
  • Convert to NumPy arrays for scikit-learn or deep learning
  • Apply scientific algorithms from SciPy (optimization, interpolation, statistical tests)
  • Visualize results with Matplotlib/Seaborn

Practical Recommendations:

  • NumPy: Always prefer vectorized operations over Python loops. Use broadcasting to avoid explicit loops. Master advanced indexing (boolean and fancy indexing).
  • Pandas: Use loc for selection instead of chained indexing. Prefer transform over apply for group operations. Use categorical data type for columns with few unique values.
  • SciPy: Use scipy.optimize for optimization problems, scipy.stats for statistical tests, scipy.spatial for distance calculations, and scipy.sparse for memory-efficient work with large matrices.

Core Data Science Libraries

Matplotlib & Seaborn

Visualization & Plotting

  • Matplotlib: The foundational plotting library. Provides fine-grained control over every aspect of figures, axes, and plots. Essential for creating publication-quality static, animated, and interactive visualizations.

  • Seaborn: Built on top of Matplotlib with a high-level interface for statistical graphics. Makes it easy to create beautiful, informative plots with minimal code. Excellent for visualizing distributions, relationships, and categorical data.

Scikit-learn

Machine Learning

The go-to library for classical machine learning. Provides consistent APIs for:

  • Supervised learning (regression, classification)
  • Unsupervised learning (clustering, dimensionality reduction)
  • Model selection and evaluation
  • Preprocessing and feature engineering
  • Pipeline construction

It’s production-ready, well-documented, and integrates seamlessly with NumPy and Pandas.


Statistical & Scientific Computing

Statsmodels

Statistical Modeling & Hypothesis Testing

Complementary to Scikit-learn. Focuses on:

  • Statistical tests (t-tests, ANOVA, chi-square)
  • Linear and generalized linear models
  • Time series analysis (ARIMA, VAR)
  • Regression diagnostics
  • R-style formula interface

Essential when you need p-values, confidence intervals, and rigorous statistical inference.

Scipy

Scientific & Technical Computing

You already have this (you mentioned it earlier). Provides:

  • Optimization algorithms
  • Integration and interpolation
  • Signal and image processing
  • Sparse matrix operations
  • Special functions (Bessel, gamma, etc.)

Advanced Machine Learning

XGBoost, LightGBM, CatBoost

Gradient Boosting Libraries

These are the workhorses for tabular data competitions and real-world applications:

  • XGBoost: The pioneer. Highly optimized, handles missing values, built-in regularization
  • LightGBM: Faster training, lower memory usage, leaf-wise growth
  • CatBoost: Handles categorical features natively, requires minimal preprocessing

All three consistently outperform other algorithms on structured data.


Deep Learning

TensorFlow & Keras

Deep Learning Framework

TensorFlow is the production-ready framework. Keras (now integrated) provides a user-friendly API for building neural networks. Ideal for:

  • Computer vision (CNN)
  • Natural language processing (RNN, LSTM, Transformers)
  • Deployment to production (TensorFlow Serving, TensorFlow Lite)

PyTorch

Deep Learning Framework

More Pythonic and dynamic than TensorFlow. Preferred in research and increasingly in production. Features:

  • Dynamic computation graphs
  • Excellent debugging
  • Strong community and ecosystem (Hugging Face, PyTorch Lightning)
  • Great for research prototypes

Data Preparation & Feature Engineering

Feature-engine

Feature Engineering

Alternative to Scikit-learn’s preprocessing. Provides:

  • Missing value imputation (with more options)
  • Outlier capping
  • Feature encoding (with more categorical options)
  • Feature selection
  • Pipeline integration

Category Encoders

Categorical Variable Encoding

Extends Scikit-learn with more categorical encoding methods:

  • Target encoding
  • Weight of evidence
  • Leave-one-out encoding
  • Binary encoding
  • Hashing encoding

Essential when dealing with high-cardinality categorical variables.


Model Interpretation & Explainability

SHAP (SHapley Additive exPlanations)

Model Explainability

Game theory approach to explain machine learning models. Provides:

  • Global feature importance
  • Local explanations for individual predictions
  • Visualizations (summary plots, dependence plots, force plots)
  • Consistent with multiple model types

LIME (Local Interpretable Model-agnostic Explanations)

Local Explanations

Explains individual predictions by approximating the model locally:

  • Perturbs input data
  • Trains an interpretable model locally
  • Great for debugging specific predictions
  • Works with any model type

Eli5

Model Inspection

Lightweight library for inspecting models:

  • Feature importance
  • Prediction explanations
  • Scikit-learn compatibility
  • Useful for quick, preliminary analysis

Natural Language Processing

NLTK

Classical NLP

Comprehensive suite for traditional NLP:

  • Tokenization and stemming
  • POS tagging
  • Named entity recognition
  • Corpus access
  • Classic algorithms (Naive Bayes, MaxEnt)

spaCy

Modern NLP

Industrial-strength NLP with pretrained models:

  • Fast and efficient
  • Production-ready
  • Word vectors
  • Entity recognition
  • Dependency parsing
  • Easy integration

Transformers (Hugging Face)

State-of-the-art NLP

The go-to library for modern NLP:

  • Pretrained transformer models (BERT, GPT, T5)
  • Fine-tuning capabilities
  • Hundreds of models available
  • Unified API
  • Active community

Time Series Analysis

Prophet (Facebook)

Forecasting

Developed by Facebook, excellent for business forecasting:

  • Handles seasonality (daily, weekly, yearly)
  • Handles holidays automatically
  • Robust to missing data
  • Intuitive parameter tuning

PyFlux

Time Series Modeling

Comprehensive time series library:

  • ARIMA, GARCH, VAR models
  • Bayesian approaches
  • Good for academic research
  • More advanced than Statsmodels

Darts

Modern Time Series

Combines classical and deep learning methods:

  • Multiple models (ARIMA, Prophet, LSTM, Transformers)
  • Backtesting capabilities
  • Ensembling
  • Scaling and preprocessing

Hyperparameter Optimization

Optuna

Hyperparameter Optimization

Modern, efficient optimization library:

  • Define-by-run API (very flexible)
  • Pruning of unpromising trials
  • Integration with major ML frameworks
  • Parallel execution support

Hyperopt

Hyperparameter Optimization

Established library with several algorithms:

  • Tree of Parzen Estimators
  • Random search
  • Adaptive algorithms
  • Distributed computation

Optuna vs. Hyperopt

  • Optuna is newer, more Pythonic, and faster
  • Hyperopt is battle-tested, especially in Spark environments

Big Data & Scaling

Dask

Parallel Computing

Scales pandas and NumPy to larger-than-memory datasets:

  • Dask DataFrames (like pandas)
  • Dask Arrays (like NumPy)
  • Task scheduling
  • Distributed computing
  • Integration with Scikit-learn

Modin

Fast Pandas

Drop-in replacement for pandas that uses parallelization:

  • Uses Ray or Dask as backend
  • Same API as pandas
  • Handles large datasets
  • Speeds up pandas operations

Vaex

Out-of-Core DataFrames

Memory-efficient dataframes for huge datasets:

  • Lazy evaluation
  • Millions of rows per second
  • Plotting without memory issues
  • Fast groupby and aggregation

Data Storage & Databases

SQLAlchemy

Database Abstraction

Essential for database interactions:

  • ORM and SQL expression language
  • Connection pooling
  • Multiple database support (PostgreSQL, MySQL, SQLite, etc.)
  • Integrates with Pandas

PySpark

Big Data Processing

Apache Spark’s Python API:

  • Distributed data processing
  • SQL-like operations
  • Machine Learning library (MLlib)
  • Graph processing
  • Streaming capabilities

HDF5 / PyTables

Hierarchical Data Storage

For storing large numerical datasets:

  • Efficient I/O
  • Metadata support
  • Compression
  • Useful for scientific computing

Deployment & Production

FastAPI

API Development

Modern, fast web framework for deploying models:

  • Automatic OpenAPI documentation
  • Async support
  • Pydantic for data validation
  • Lightweight and performant

Flask

Web Framework

Lightweight and flexible:

  • Simple to start
  • Wide ecosystem
  • Good for MVPs and prototypes

Streamlit

Data App Development

Minimal effort to create interactive data apps:

  • Pure Python
  • Live reloading
  • Widgets and components
  • Perfect for demos and dashboards

Gradio

ML Demos

Quickly create UIs for machine learning models:

  • Minimal code
  • Web-based interfaces
  • Shareable links
  • Great for showcasing models

Model Management & Workflow

MLflow

ML Lifecycle Management

End-to-end ML lifecycle:

  • Experiment tracking
  • Model registry
  • Packaging and deployment
  • Works with any library

DVC (Data Version Control)

Data Versioning

Version control for data and models:

  • Git integration
  • Data pipeline management
  • Reproducible experiments
  • Handles large files

PyCaret

Low-code ML

Automated machine learning:

  • Minimal code to build models
  • Feature engineering automatically
  • Model comparison
  • Easy deployment

Data Collection & Web Scraping

BeautifulSoup

HTML/XML Parsing

For parsing HTML and XML documents:

  • Navigate parse trees
  • Search by tags, attributes
  • Simple and forgiving

Scrapy

Web Scraping Framework

Full-fledged scraping framework:

  • Concurrent scraping
  • Pipelines for processing
  • Export to multiple formats
  • Extensible architecture

Selenium

Browser Automation

For dynamic websites with JavaScript:

  • Controls browser programmatically
  • Simulates user interactions
  • Works with Chrome, Firefox, etc.
  • Handles complex AJAX

Miscellaneous Essentials

Click

Command Line Interface

Create command-line interfaces:

  • Decorative API
  • Nested commands
  • Parameter validation
  • Help generation

TQDM

Progress Bars

Visual feedback for long operations:

  • Progress bars in loops
  • Integration with Jupyter
  • Time estimation
  • Minimal overhead

Joblib

Serialization & Caching

Efficient serialization for large numpy arrays:

  • Faster than pickle
  • Caching functions
  • Parallel execution
  • Used by Scikit-learn

Learning Path Recommendation

For Beginners (Start Here):

  1. NumPy & Pandas - Foundation (you’re already here)
  2. Matplotlib & Seaborn - Visualization
  3. Scikit-learn - Machine learning basics
  4. Statsmodels - Statistical inference

For Intermediate (Expand Your Toolkit): 5. XGBoost/LightGBM - Boosted trees 6. SHAP & LIME - Model explainability 7. Feature-engine - Better preprocessing 8. Optuna - Hyperparameter tuning

For Advanced (Specialize):

  • NLP: spaCy + Transformers
  • Computer Vision: PyTorch/TensorFlow
  • Time Series: Prophet + Darts
  • Big Data: Dask + PySpark
  • Production: FastAPI + MLflow

My Top Picks for Immediate Focus

After mastering Pandas/NumPy/SciPy, prioritize:

  1. Scikit-learn - Absolute necessity
  2. Matplotlib & Seaborn - You can’t analyze without visualizing
  3. XGBoost or LightGBM - Most practical ML algorithm
  4. Statsmodels - For rigorous analysis
  5. SHAP - To explain your models
  6. Streamlit or FastAPI - To share your work

These six, combined with your current Pandas/NumPy/SciPy foundation, cover 95% of daily data science work. The others become relevant based on specific projects (NLP, time series, deep learning, etc.).

Would you like a comprehensive tutorial on any of these? I’m ready to deliver them in your preferred format (##, ###, #### with no numbering).

Prerequisites

  • Basic knowledge of…
  • Recommended reading…

Related Topics

Last updated on