Supervised Learning

Regression vs Classification

Binary vs MultiClass vs MultiLabel classification

Linear Regression

Linear Relationship between dependent variable (Outcome/target) variable and one or more independent (predictor) variables.

Assumptions / Conditions

Assumption
Description

Linearity

Relationship between features and target is linear

Homoscedasticity

Constant variance of errors

No Multicollinearity

Features are not highly correlated

Independence

Observations are independent

Normality

Residuals are normally distributed

Target

y=β0+β1x1+β2x2+⋯+βnxn+ϵy = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n + \epsilon

Predicted

y^=β0^+β1^x1+β2^x2+⋯+βn^xn\hat{y} = \hat{\beta_0} + \hat{\beta_1}x_1 + \hat{\beta_2}x_2 + \dots + \hat{\beta_n}x_n

MSE (Cost function)

J(θ)=1n∑i=1n(y^i−yi)2J(\theta) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2

Code Sample

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Create dummy data
X, y = make_regression(n_samples=100, n_features=1, noise=15)

# Fit the model
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

Loss Functions

  • Mean Squared Error (MSE)

  • Mean Absolute Error (MAE)

Evaluation Metrics

  • R² Score

  • Root Mean Squared Error

  • MAE

Optimization Techniques

  • Gradient Descent

  • Normal Equation (Analytical Solution)

  • L1 (Lasso) & L2 (Ridge) Regularization

Common Issues

Issue
Description

Overfitting

Especially with too many features

Underfitting

When model is too simple

Outliers

Can distort the line

Collinearity

Leads to unstable coefficients

Assumption Violations

Leads to incorrect inferences

Further Reading: https://mlu-explain.github.io/linear-regression/


Logistic Regression

Predicts the probability of categorical variables (Classes) based on Input features.

It models the relationship using the logit (log-odds) function instead of a straight line.

Common application : Outcome can belong to one or two classes. Binary classification

Also extended to: predict multiple classes/labels (OneVsRestClassifier) . Mulit class classification, Mulit label classification

Assumptions / Conditions

Assumption
Description

Binary/Multinomial Outcome

The dependent variable is categorical

Linearity of Logit

Linear relationship between predictors and log-odds

No Multicollinearity

Features should not be highly correlated

Large Sample Size

Especially for rare outcomes

Independent Observations

Each observation is assumed to be independent

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

# Create dummy data
X, y = make_classification(n_samples=100, n_features=2, n_classes=2, random_state=42)

# Fit the model
model = LogisticRegression()
model.fit(X, y)
y_pred = model.predict(X)

Loss Functions

  • Binary Cross Entropy (Log Loss)

  • Categorical Cross Entropy (for multi-class)

Evaluation Metrics

  • Accuracy

  • Precision, Recall, F1 Score

  • ROC-AUC

  • Log Loss

Optimization Techniques

  • Gradient Descent

  • L2 Regularization (Ridge-like)

  • L1 Regularization (Lasso-like)

Common Issues

Issue
Description

Overfitting

Happens with high dimensional feature space

Non-linearity

Poor fit if decision boundary isn’t linear

Imbalanced Data

Skews performance, needs resampling

Multicollinearity

Makes coefficients unstable

Further Reading: https://mlu-explain.github.io/logistic-regression/


Decision Tree

A decision tree is used for both classification and regression tasks.

It splits data into subsets on the value of input features and then forms a tree structure to make decision.

Each node represents a decision.

Each branch represents an outcome of that decision.

Each leaf node represents a finial decision i.e. classification.

Structure: Root Node -> Branches -> leaf node -> branches->.......................

Assumptions / Conditions

Assumption
Description

Features are Independent

Each feature is considered separately for splitting.

Sufficient Data

Pruning or constraints are required to avoid overfitting.

Data is Clean

Noisy data may lead to overly complex trees.

Recursive Binary Splits

Splits are made recursively, usually into two branches.

Key concepts: refer sklearn

Concept
Description
How It Works / Why It Matters

Entropy

Measures disorder/uncertainty in the dataset.

High entropy = more mixed classes. We want to reduce entropy with every split to get purer subsets.

Information Gain

How much entropy decreases after splitting.

Used to decide which feature to split on; the feature that reduces uncertainty the most is chosen.

Gini Impurity

Measures the chance of misclassifying a sample.

Lower Gini = purer split. Fast to compute and works well in practice.

Gain ratio

Adjusts Information Gain by penalizing high-cardinality features.

Prevents bias towards features with many unique values.

Chi-Square

Statistical test to see if split is meaningful.

Bigger chi-square value = more significant split.

Max depth

Maximum depth of the tree.

Shallow trees generalize better, avoid overfitting.

Min sample split

Minimum samples required to split a node.

Prevents splits on small, noisy sets.

Min sample leaf

Minimum samples required to be at a leaf node.

Smooths predictions and reduces variance.

Max Features

Number of features to consider when looking for the best split.

Adds randomness, helps in ensemble methods like Random Forest.

Pruning (Post or Pre)

Removes unnecessary branches.

Reduces model complexity and improves generalization by trimming low-information parts of the tree.

Overfitting

When the tree memorizes training data.

Deep trees with few constraints tend to overfit. Fix using pruning or regularization.

Underfitting

When the tree is too simple to learn from the data.

Usually caused by too shallow trees or too strict constraints.

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load dataset
X, y = load_iris(return_X_y=True)

# Fit the model
model = DecisionTreeClassifier(max_depth=3)
model.fit(X, y)

# Visualize
plt.figure(figsize=(12, 8))
plot_tree(model, filled=True, feature_names=["Sepal Length", "Sepal Width", "Petal Length", "Petal Width"])
plt.show()

Loss Functions

  • Gini Impurity

  • Entropy (for classification)

  • MSE / MAE (for regression)

Evaluation Metrics

Classification
Regression

Accuracy

R² Score

Precision, Recall, F1

RMSE, MAE

Confusion Matrix

MSE

Optimization Techniques

  • Pruning (pre-pruning, post-pruning)

  • Max Depth, Min Samples Split/Leaf

  • Feature Selection Criteria

Common Issues

Issue
Description

Overfitting

Deep trees may fit noise in training data

Instability

Small changes in data can result in a very different tree

Bias towards features with more levels

May prefer variables with many categories

Low Generalization

Can perform poorly on unseen data without pruning

Further Reading: https://mlu-explain.github.io/decision-tree/


Ensemble Learning/Methods

Idea of combining multiple models ---> leads to better performance.

By aggregating predictions of different models, we can reduce variance and/or bais and imporve genralization, making it more performance effiecent.

Types: Bagging , Boosting, Stacking

When to use Bagging vs Boosting

Bagging: When you have a high variance model that is overfitting on training data.

Boosting: When you have a model with high bias and/or high variance, you need to improve.

Random Forest

Multiple decision trees are combined to make predictions are Random forest (Bagging- Bootstrap Aggregating)

Each tree is trained on a random subset of the data (with replacement).

At each split, a random subset of features is selected.

For classification, the final output is the majority vote.

For regression, the final output is the average prediction.

Assumptions / Conditions

Assumption / Condition
Description

Independence of features

Features should be informative and not highly correlated

Sufficient data

More data = better diversity in trees

No need for feature scaling

Random Forest is scale-invariant

Cost Function / Objective

Random Forest has no global cost function like MSE or cross-entropy. Each tree is built to minimize impurity (Gini or Entropy), and the final prediction is based on ensemble voting or averaging.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Common Issues

Issue
Description

Overfitting

With too many deep trees on noisy data

Interpretability

Hard to interpret full forest compared to a single tree

Bias in Imbalanced Datasets

Tends to favor majority class

Large Size

Can be slow or memory intensive with too many estimators

XGBoost (Extreme Gradient Boosting)

XGBoost is a fast, regularized, and scalable implementation of Gradient Boosted Decision Trees (GBDT). It builds trees sequentially, where each tree tries to correct the errors made by the previous ones.

Assumptions / Conditions

Assumption / Condition
Description

Additive Model

New trees correct residuals of previous ensemble

No need for feature scaling

Works well with raw or normalized data

Clean data preferred

Sensitive to outliers and noise

Independent features help

Redundant features may affect performance

import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train the model
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', max_depth=3, n_estimators=100)
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Common Issues

Issue
Description

Overfitting

Happens with too many trees or no regularization

High memory usage

Can be computationally expensive on large datasets

Sensitive to noisy labels

Can overfit if data is not clean

Harder to interpret

Compared to single decision trees

Further Reading: Official XGBoost Docs

K-Nearest Neighbour

kNN is a non-parmetric model becuase it makes assumptions that similar data points are locted close to each other in feature space.

New labels or values of new data points would be inferred based on its closest neighbors.

It is based on Euclidean, Manhattan, Minkowski or hamming distance

Assumption / Condition
Description

Instance-based learning

No explicit training; relies on storing and comparing instances.

Local similarity assumption

Assumes that similar points exist in close proximity in the feature space.

Sensitivity to feature scale

Distance-based metrics require features to be on similar scales.

Impact of irrelevant features

Irrelevant or noisy features can distort distance calculations.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train kNN model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Predict and evaluate
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Common Issue
Description

Poor scalability

Slow prediction with large datasets (must compare to all training points).

Curse of dimensionality

High-dimensional data can weaken the meaning of "closeness".

Requires feature normalization

Performance highly depends on the proper scaling of features.

Last updated