Supervised Learning
Regression vs Classification
Binary vs MultiClass vs MultiLabel classification
Linear Regression
Linear Relationship between dependent variable (Outcome/target) variable and one or more independent (predictor) variables.
Assumptions / Conditions
Linearity
Relationship between features and target is linear
Homoscedasticity
Constant variance of errors
No Multicollinearity
Features are not highly correlated
Independence
Observations are independent
Normality
Residuals are normally distributed
Target
Predicted
MSE (Cost function)
Code Sample
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# Create dummy data
X, y = make_regression(n_samples=100, n_features=1, noise=15)
# Fit the model
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
Loss Functions
Mean Squared Error (MSE)
Mean Absolute Error (MAE)
Evaluation Metrics
R² Score
Root Mean Squared Error
MAE
Optimization Techniques
Gradient Descent
Normal Equation (Analytical Solution)
L1 (Lasso) & L2 (Ridge) Regularization
Common Issues
Overfitting
Especially with too many features
Underfitting
When model is too simple
Outliers
Can distort the line
Collinearity
Leads to unstable coefficients
Assumption Violations
Leads to incorrect inferences
Further Reading: https://mlu-explain.github.io/linear-regression/
Logistic Regression
Predicts the probability of categorical variables (Classes) based on Input features.
It models the relationship using the logit (log-odds) function instead of a straight line.
Common application : Outcome can belong to one or two classes. Binary classification
Also extended to: predict multiple classes/labels (OneVsRestClassifier) . Mulit class classification, Mulit label classification
Assumptions / Conditions
Binary/Multinomial Outcome
The dependent variable is categorical
Linearity of Logit
Linear relationship between predictors and log-odds
No Multicollinearity
Features should not be highly correlated
Large Sample Size
Especially for rare outcomes
Independent Observations
Each observation is assumed to be independent
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
# Create dummy data
X, y = make_classification(n_samples=100, n_features=2, n_classes=2, random_state=42)
# Fit the model
model = LogisticRegression()
model.fit(X, y)
y_pred = model.predict(X)
Loss Functions
Binary Cross Entropy (Log Loss)
Categorical Cross Entropy (for multi-class)
Evaluation Metrics
Accuracy
Precision, Recall, F1 Score
ROC-AUC
Log Loss
Optimization Techniques
Gradient Descent
L2 Regularization (Ridge-like)
L1 Regularization (Lasso-like)
Common Issues
Overfitting
Happens with high dimensional feature space
Non-linearity
Poor fit if decision boundary isn’t linear
Imbalanced Data
Skews performance, needs resampling
Multicollinearity
Makes coefficients unstable
Further Reading: https://mlu-explain.github.io/logistic-regression/
Decision Tree
A decision tree is used for both classification and regression tasks.
It splits data into subsets on the value of input features and then forms a tree structure to make decision.
Each node represents a decision.
Each branch represents an outcome of that decision.
Each leaf node represents a finial decision i.e. classification.
Structure: Root Node -> Branches -> leaf node -> branches->.......................
Assumptions / Conditions
Features are Independent
Each feature is considered separately for splitting.
Sufficient Data
Pruning or constraints are required to avoid overfitting.
Data is Clean
Noisy data may lead to overly complex trees.
Recursive Binary Splits
Splits are made recursively, usually into two branches.
Key concepts: refer sklearn
Entropy
Measures disorder/uncertainty in the dataset.
High entropy = more mixed classes. We want to reduce entropy with every split to get purer subsets.
Information Gain
How much entropy decreases after splitting.
Used to decide which feature to split on; the feature that reduces uncertainty the most is chosen.
Gini Impurity
Measures the chance of misclassifying a sample.
Lower Gini = purer split. Fast to compute and works well in practice.
Gain ratio
Adjusts Information Gain by penalizing high-cardinality features.
Prevents bias towards features with many unique values.
Chi-Square
Statistical test to see if split is meaningful.
Bigger chi-square value = more significant split.
Max depth
Maximum depth of the tree.
Shallow trees generalize better, avoid overfitting.
Min sample split
Minimum samples required to split a node.
Prevents splits on small, noisy sets.
Min sample leaf
Minimum samples required to be at a leaf node.
Smooths predictions and reduces variance.
Max Features
Number of features to consider when looking for the best split.
Adds randomness, helps in ensemble methods like Random Forest.
Pruning (Post or Pre)
Removes unnecessary branches.
Reduces model complexity and improves generalization by trimming low-information parts of the tree.
Overfitting
When the tree memorizes training data.
Deep trees with few constraints tend to overfit. Fix using pruning or regularization.
Underfitting
When the tree is too simple to learn from the data.
Usually caused by too shallow trees or too strict constraints.
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
# Load dataset
X, y = load_iris(return_X_y=True)
# Fit the model
model = DecisionTreeClassifier(max_depth=3)
model.fit(X, y)
# Visualize
plt.figure(figsize=(12, 8))
plot_tree(model, filled=True, feature_names=["Sepal Length", "Sepal Width", "Petal Length", "Petal Width"])
plt.show()
Loss Functions
Gini Impurity
Entropy (for classification)
MSE / MAE (for regression)
Evaluation Metrics
Accuracy
R² Score
Precision, Recall, F1
RMSE, MAE
Confusion Matrix
MSE
Optimization Techniques
Pruning (pre-pruning, post-pruning)
Max Depth, Min Samples Split/Leaf
Feature Selection Criteria
Common Issues
Overfitting
Deep trees may fit noise in training data
Instability
Small changes in data can result in a very different tree
Bias towards features with more levels
May prefer variables with many categories
Low Generalization
Can perform poorly on unseen data without pruning
Further Reading: https://mlu-explain.github.io/decision-tree/
Ensemble Learning/Methods
Idea of combining multiple models ---> leads to better performance.
By aggregating predictions of different models, we can reduce variance and/or bais and imporve genralization, making it more performance effiecent.
Types: Bagging , Boosting, Stacking

When to use Bagging vs Boosting
Bagging: When you have a high variance model that is overfitting on training data.
Boosting: When you have a model with high bias and/or high variance, you need to improve.
Random Forest
Multiple decision trees are combined to make predictions are Random forest (Bagging- Bootstrap Aggregating)
Each tree is trained on a random subset of the data (with replacement).
At each split, a random subset of features is selected.
For classification, the final output is the majority vote.
For regression, the final output is the average prediction.
Assumptions / Conditions
Independence of features
Features should be informative and not highly correlated
Sufficient data
More data = better diversity in trees
No need for feature scaling
Random Forest is scale-invariant
Cost Function / Objective
Random Forest has no global cost function like MSE or cross-entropy. Each tree is built to minimize impurity (Gini or Entropy), and the final prediction is based on ensemble voting or averaging.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train the model
model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Common Issues
Overfitting
With too many deep trees on noisy data
Interpretability
Hard to interpret full forest compared to a single tree
Bias in Imbalanced Datasets
Tends to favor majority class
Large Size
Can be slow or memory intensive with too many estimators
Further Reading: MLU Explain - Random Forest
XGBoost (Extreme Gradient Boosting)
XGBoost is a fast, regularized, and scalable implementation of Gradient Boosted Decision Trees (GBDT). It builds trees sequentially, where each tree tries to correct the errors made by the previous ones.
Assumptions / Conditions
Additive Model
New trees correct residuals of previous ensemble
No need for feature scaling
Works well with raw or normalized data
Clean data preferred
Sensitive to outliers and noise
Independent features help
Redundant features may affect performance
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train the model
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', max_depth=3, n_estimators=100)
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Common Issues
Overfitting
Happens with too many trees or no regularization
High memory usage
Can be computationally expensive on large datasets
Sensitive to noisy labels
Can overfit if data is not clean
Harder to interpret
Compared to single decision trees
Further Reading: Official XGBoost Docs
K-Nearest Neighbour
kNN is a non-parmetric model becuase it makes assumptions that similar data points are locted close to each other in feature space.
New labels or values of new data points would be inferred based on its closest neighbors.
It is based on Euclidean, Manhattan, Minkowski or hamming distance
Instance-based learning
No explicit training; relies on storing and comparing instances.
Local similarity assumption
Assumes that similar points exist in close proximity in the feature space.
Sensitivity to feature scale
Distance-based metrics require features to be on similar scales.
Impact of irrelevant features
Irrelevant or noisy features can distort distance calculations.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Load dataset
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Feature scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train kNN model
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# Predict and evaluate
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
Poor scalability
Slow prediction with large datasets (must compare to all training points).
Curse of dimensionality
High-dimensional data can weaken the meaning of "closeness".
Requires feature normalization
Performance highly depends on the proper scaling of features.
Last updated