Internship projects and notebooks on Statistical Learning under the guidance of Prof. Neelesh S. Upadhye, IIT Madras.
Welcome to my official repository for the Summer Internship (May–July 2025) at the Department of Mathematics, IIT Madras under the guidance of Prof. Neelesh S. Upadhye.
This internship focuses on practical and theoretical aspects of Statistical Learning, involving data modeling, evaluation, and optimization techniques using real-world datasets.
- Institute: Indian Institute of Technology, Madras
- Mentor: Prof. Neelesh S. Upadhye
- Duration: 18 May 2025 – 18 July 2025
- Core Focus: Statistical Modeling, Machine Learning, Applied Regression, Model Evaluation
- Base Environment: iitm-stats-learning/Conda_Env
- Performed Exploratory Data Analysis (EDA)
- Built OLS Regression Model to predict housing prices
- Implemented 10-Fold Cross-Validation for evaluation
- Visualized correlations, feature relationships, and residuals
Notebook: Rishabh_Shukla_Boston_HousingW1.ipynb
- Studied and visualized bias-variance tradeoff
- Compared underfitting vs overfitting models
- Introduced polynomial regression
- Performed model validation using RMSE and R²
Notebook: Rishabh_Shukla_Heart_DiseaseW2.ipynb
- Explored Model Selection : Subset Selection,Stepwise Selection , Dimension Reduction
- Explored regularization techniques: Ridge, Lasso, and Elastic-Net
- Conducted feature engineering and data preprocessing for model input optimization
- Implemented Grid Search for hyperparameter tuning
- Provided a markdown-based theoretical explanation of the bias-variance tradeoff
- Performed cross-validation to select optimal model parameters
Notebook: Rishabh_Shukla_W3.ipynb
- Implemented Ridge Regression and Lasso Regression's algorithm from scratch using normal equation and coordinate descent method
- Examined the effect of regularization strength (λ) on coefficient shrinkage using cross validation
- Compared model performance for Ridge vs Lasso based on evaluation metrics
- Visualized coefficient paths and selected features under different regularization penalties
- Interpreted results in context of model sparsity and overfitting control
Notebook: Rishabh_Shukla_ridge_lasso_W4.ipynb
- Implemented Regression Trees from scratch including Recursive Binary Splitting and Cost-Complexity Pruning
- Used cross-validation to tune the cost-complexity parameter α and prune the tree effectively
- Visualized model structure and residual patterns with detailed diagnostic plots
- Built Random Forest from scratch, including bootstrap sampling and feature bagging
- Implemented XGBoost from scratch using sequential learning on residuals with shrinkage
- Compared all models on performance metrics (RSS, MSE, R²)
- Tuned hyperparameters using manual grid search for both Random Forest and XGBoost
- Provided theory markdowns inspired by ISLR book and aligned implementation with those concepts
Notebook: Rishabh_Shukla_non_linear_w4.ipynb
- Conducted a comprehensive comparison of all previously implemented models (OLS, Ridge, Lasso, Trees, XGBoost).
- Initiated the capstone project: defined the problem statement, scope, and initial data pipeline.
- Established project roles and responsibilities across Data, Modelling, Validation, MLOps, and Documentation.
- Developed the first project module for data cleaning, including schema validation tests.
- Created a project plan and set up a Kanban board for task management and tracking.
- Submitted Pull Request #1 containing the initial data-cleaning module.
Notebook: Rishabh_Shukla_capstone_W5.ipynb & `Rishabh_Shukla_model_comparison_W5.ipynb
- Studied and implemented Bayesian linear and logistic regression from theoretical principles.
- Compared different inference techniques, focusing on Variational Inference (VI) vs. MCMC.
- Developed Bayesian models from scratch.
- Generated and analyzed uncertainty plots to quantify model confidence in predictions.
Notebook: Rishabh_Shukla_bayesian_modeling.ipynb
- Synthesized all key learnings, model results, and project outcomes from the entire internship.
- Developed a comprehensive slide deck covering the progression from foundational statistical learning theory to advanced Bayesian modeling.
- Structured the presentation narrative to highlight the practical insights gained from applying different models to real-world datasets(Diamond Dataset).
- Conducted rehearsals to refine the content and ensure a clear and effective final delivery.
Deliverable: Final_Presentation (PDF)
- Authored the final internship report, documenting all work completed.
- Detailed the theoretical concepts, from-scratch implementation methods, and empirical results for every model studied.
- Formatted the entire report using LaTeX to meet academic standards, including all figures, tables, and citations.
- Submitted the final report as the primary deliverable, summarizing the complete scope of the internship at IIT Madras.
Deliverable: Final_Report (PDF)
In addition to the weekly notebooks, the core algorithms were implemented from scratch in the following Python scripts:
All_Regression_Models.py: Contains from-scratch Python implementations of all supervised regression models covered in this internship, including OLS, Ridge, Lasso, Regression Trees, Random Forest, and XGBoost.Probabilistic_Models.py: Contains from-scratch implementations of Bayesian models and their inference algorithms, including Bayesian Linear/Logistic Regression with Laplace, VI, and MCMC samplers.
This project uses the Conda environment defined here:
🔗 iitm-stats-learning/Conda_Env
git clone https://github.com/iitm-stats-learning/Conda_Env.git
cd Conda_Env
conda env create -f stats_learning.yml
conda activate stats-learning
jupyter notebook