Skip to content

rish1614/Statistical-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Statistical-Learning

Internship projects and notebooks on Statistical Learning under the guidance of Prof. Neelesh S. Upadhye, IIT Madras.

Statistical Learning Internship @ IIT Madras

Welcome to my official repository for the Summer Internship (May–July 2025) at the Department of Mathematics, IIT Madras under the guidance of Prof. Neelesh S. Upadhye.
This internship focuses on practical and theoretical aspects of Statistical Learning, involving data modeling, evaluation, and optimization techniques using real-world datasets.


About the Internship

  • Institute: Indian Institute of Technology, Madras
  • Mentor: Prof. Neelesh S. Upadhye
  • Duration: 18 May 2025 – 18 July 2025
  • Core Focus: Statistical Modeling, Machine Learning, Applied Regression, Model Evaluation
  • Base Environment: iitm-stats-learning/Conda_Env

Tasks Completed

Week 1: Boston Housing Dataset

  • Performed Exploratory Data Analysis (EDA)
  • Built OLS Regression Model to predict housing prices
  • Implemented 10-Fold Cross-Validation for evaluation
  • Visualized correlations, feature relationships, and residuals

Notebook: Rishabh_Shukla_Boston_HousingW1.ipynb


Week 2: Bias-Variance & Model Tuning

  • Studied and visualized bias-variance tradeoff
  • Compared underfitting vs overfitting models
  • Introduced polynomial regression
  • Performed model validation using RMSE and R²

Notebook: Rishabh_Shukla_Heart_DiseaseW2.ipynb


Week 3: Model Selection & Regularization

  • Explored Model Selection : Subset Selection,Stepwise Selection , Dimension Reduction
  • Explored regularization techniques: Ridge, Lasso, and Elastic-Net
  • Conducted feature engineering and data preprocessing for model input optimization
  • Implemented Grid Search for hyperparameter tuning
  • Provided a markdown-based theoretical explanation of the bias-variance tradeoff
  • Performed cross-validation to select optimal model parameters

Notebook: Rishabh_Shukla_W3.ipynb


Additional Task: In-Depth Ridge and Lasso Regression's Algorithm

  • Implemented Ridge Regression and Lasso Regression's algorithm from scratch using normal equation and coordinate descent method
  • Examined the effect of regularization strength (λ) on coefficient shrinkage using cross validation
  • Compared model performance for Ridge vs Lasso based on evaluation metrics
  • Visualized coefficient paths and selected features under different regularization penalties
  • Interpreted results in context of model sparsity and overfitting control

Notebook: Rishabh_Shukla_ridge_lasso_W4.ipynb


Week 4: Tree-Based Methods (Regression Tree, Random Forest, XGBoost)

  • Implemented Regression Trees from scratch including Recursive Binary Splitting and Cost-Complexity Pruning
  • Used cross-validation to tune the cost-complexity parameter α and prune the tree effectively
  • Visualized model structure and residual patterns with detailed diagnostic plots
  • Built Random Forest from scratch, including bootstrap sampling and feature bagging
  • Implemented XGBoost from scratch using sequential learning on residuals with shrinkage
  • Compared all models on performance metrics (RSS, MSE, R²)
  • Tuned hyperparameters using manual grid search for both Random Forest and XGBoost
  • Provided theory markdowns inspired by ISLR book and aligned implementation with those concepts

Notebook: Rishabh_Shukla_non_linear_w4.ipynb


Week 5: Capstone Project Kick-off & Model Comparison

  • Conducted a comprehensive comparison of all previously implemented models (OLS, Ridge, Lasso, Trees, XGBoost).
  • Initiated the capstone project: defined the problem statement, scope, and initial data pipeline.
  • Established project roles and responsibilities across Data, Modelling, Validation, MLOps, and Documentation.
  • Developed the first project module for data cleaning, including schema validation tests.
  • Created a project plan and set up a Kanban board for task management and tracking.
  • Submitted Pull Request #1 containing the initial data-cleaning module.

Notebook: Rishabh_Shukla_capstone_W5.ipynb & `Rishabh_Shukla_model_comparison_W5.ipynb


Week 6: Probabilistic Modeling with Bayesian Methods

  • Studied and implemented Bayesian linear and logistic regression from theoretical principles.
  • Compared different inference techniques, focusing on Variational Inference (VI) vs. MCMC.
  • Developed Bayesian models from scratch.
  • Generated and analyzed uncertainty plots to quantify model confidence in predictions.

Notebook: Rishabh_Shukla_bayesian_modeling.ipynb


Week 7: Final Presentation Preparation

  • Synthesized all key learnings, model results, and project outcomes from the entire internship.
  • Developed a comprehensive slide deck covering the progression from foundational statistical learning theory to advanced Bayesian modeling.
  • Structured the presentation narrative to highlight the practical insights gained from applying different models to real-world datasets(Diamond Dataset).
  • Conducted rehearsals to refine the content and ensure a clear and effective final delivery.

Deliverable: Final_Presentation (PDF)


Week 8: Final Report Compilation

  • Authored the final internship report, documenting all work completed.
  • Detailed the theoretical concepts, from-scratch implementation methods, and empirical results for every model studied.
  • Formatted the entire report using LaTeX to meet academic standards, including all figures, tables, and citations.
  • Submitted the final report as the primary deliverable, summarizing the complete scope of the internship at IIT Madras.

Deliverable: Final_Report (PDF)


Core Implementation Scripts

In addition to the weekly notebooks, the core algorithms were implemented from scratch in the following Python scripts:

  • All_Regression_Models.py: Contains from-scratch Python implementations of all supervised regression models covered in this internship, including OLS, Ridge, Lasso, Regression Trees, Random Forest, and XGBoost.
  • Probabilistic_Models.py: Contains from-scratch implementations of Bayesian models and their inference algorithms, including Bayesian Linear/Logistic Regression with Laplace, VI, and MCMC samplers.

Reading Resources


Environment Setup

This project uses the Conda environment defined here:
🔗 iitm-stats-learning/Conda_Env

To Reproduce Locally:

git clone https://github.com/iitm-stats-learning/Conda_Env.git
cd Conda_Env
conda env create -f stats_learning.yml
conda activate stats-learning
jupyter notebook

About

Internship projects and notebooks on Statistical Learning under the guidance of Prof. Neelesh S. Upadhye, IIT Madras.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors