AI/ML

Machine Learning Interview Questions

Master your Machine Learning interviews with comprehensive questions and code examples covering fundamentals, algorithms, model evaluation, and advanced production techniques for freshers and experienced professionals.

April 11, 2026

45 mins read

I. Beginner Level

1. What is Machine Learning?

Machine Learning (ML) is a subset of Artificial Intelligence that enables computer systems to learn from data and improve their performance on tasks without being explicitly programmed. Instead of writing rules manually, ML algorithms discover patterns and build models from training data.

Example (training a simple model with scikit-learn):

python

1from sklearn.linear_model import LinearRegression
2import numpy as np
3
4# Training data: house size (sqft) -> price
5X = np.array([[500], [800], [1200], [1500], [2000]])
6y = np.array([150000, 200000, 280000, 320000, 400000])
7
8model = LinearRegression()
9model.fit(X, y)
10
11# Predict price for a 1000 sqft house
12print(model.predict([[1000]]))  # Output: [~245000]
13

Machine Learning powers applications like spam detection, recommendation engines, fraud detection, image recognition, and many more real-world systems.

2. What are the main types of Machine Learning?

Machine Learning is broadly categorized into three main types based on how models learn from data.

Supervised Learning: The model is trained on labeled data (input-output pairs). Examples: linear regression, decision trees, SVM.
Unsupervised Learning: The model finds hidden patterns in unlabeled data. Examples: K-Means clustering, PCA, autoencoders.
Reinforcement Learning: An agent learns to make decisions by receiving rewards or penalties from the environment. Examples: game-playing AI, robotics.

A fourth category, semi-supervised learning, combines labeled and unlabeled data and is growing in importance for real-world datasets where labeling is expensive.

3. What is supervised learning? Give an example.

Supervised learning is a core machine learning technique where a model is trained on a labeled dataset. In this paradigm, each training example consists of an input paired with a "ground truth" or correct output, allowing the model to learn a mathematical relationship between features and labels.

Example (email spam classification):

python

1from sklearn.naive_bayes import MultinomialNB
2from sklearn.feature_extraction.text import CountVectorizer
3
4# Training data: emails and labels (1=spam, 0=not spam)
5emails = ["Win a free iPhone now", "Meeting at 3pm", "Claim your prize", "Project update"]
6labels = [1, 0, 1, 0]
7
8vectorizer = CountVectorizer()
9X = vectorizer.fit_transform(emails)
10
11model = MultinomialNB()
12model.fit(X, labels)
13
14# Predict
15new_email = vectorizer.transform(["Free prize waiting for you"])
16print(model.predict(new_email))  # Output: [1] (spam)
17

Supervised learning is the most widely used ML paradigm and forms the foundation of most production ML systems.

4. What is unsupervised learning? Give an example.

Unsupervised learning is a type of ML where the model is trained on data that has no labels. The algorithm must discover the underlying structure, patterns, or groupings in the data on its own without any guidance.

Example (customer segmentation with K-Means):

python

1from sklearn.cluster import KMeans
2import numpy as np
3
4# Customer data: [age, spending_score]
5X = np.array([[25, 80], [30, 60], [45, 20], [50, 10], [22, 90], [35, 50]])
6
7kmeans = KMeans(n_clusters=2, random_state=42)
8kmeans.fit(X)
9
10print(kmeans.labels_)  # e.g., [1, 1, 0, 0, 1, 1]
11# Cluster 1: young high-spenders, Cluster 0: older low-spenders
12

Unsupervised learning is used when labeled data is expensive or unavailable, and is common in exploratory data analysis and dimensionality reduction.

5. What is a label in machine learning?

A label (also called the target variable or output variable) is the value the model is trained to predict. In supervised learning, each training example consists of input features and a corresponding label that represents the correct answer.

Example (identifying features vs. labels):

python

1import pandas as pd
2
3data = pd.DataFrame({
4    "age": [25, 30, 45],          # feature
5    "salary": [50000, 70000, 90000], # feature
6    "purchased": [0, 1, 1]          # label (target)
7})
8
9X = data[["age", "salary"]]  # features
10y = data["purchased"]         # label
11print(X)
12print(y)
13

Labels can be continuous values (for regression) or discrete categories (for classification). High-quality labels are critical - noisy or incorrect labels directly degrade model performance.

6. What is a feature in machine learning?

A feature is an individual measurable input variable used to make predictions. Features are the columns of your dataset that the model uses to learn patterns. The quality and relevance of features directly determine model performance - this is why feature engineering is so important.

Example (feature types):

python

1import pandas as pd
2
3df = pd.DataFrame({
4    "square_footage": [1200, 1500, 900],  # numerical feature
5    "location": ["urban", "suburban", "rural"],  # categorical feature
6    "has_garage": [True, False, True],  # binary feature
7    "price": [300000, 250000, 180000]   # label (target)
8})
9
10print(df.dtypes)
11print(df.head())
12

Features can be numerical, categorical, binary, text, or image-based. Choosing the right features and transforming them appropriately is called feature engineering.

7. What is a train-test split and why is it important?

A train-test split is the practice of dividing a dataset into two parts: a training set used to train the model, and a test set used to evaluate how well the model generalizes to unseen data. Without this split, you cannot reliably measure real-world performance.

Example:

python

1from sklearn.model_selection import train_test_split
2from sklearn.datasets import load_iris
3
4X, y = load_iris(return_X_y=True)
5
6# 80% training, 20% testing
7X_train, X_test, y_train, y_test = train_test_split(
8    X, y, test_size=0.2, random_state=42
9)
10
11print(f"Train size: {len(X_train)}")  # 120
12print(f"Test size:  {len(X_test)}")   # 30
13

A typical split is 80/20 or 70/30 (train/test). For smaller datasets, k-fold cross-validation is preferred over a simple split.

8. What is overfitting and how do you prevent it?

Overfitting occurs when a model learns the training data too well - including its noise and outliers - resulting in poor generalization to new, unseen data. An overfit model has very high training accuracy but significantly lower test accuracy.

Example (detecting overfitting by comparing train vs. test accuracy):

python

1from sklearn.tree import DecisionTreeClassifier
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4
5X, y = load_iris(return_X_y=True)
6X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
7
8# Overfit model (no depth limit)
9model = DecisionTreeClassifier()  # unlimited depth
10model.fit(X_train, y_train)
11print(f"Train acc: {model.score(X_train, y_train):.2f}")  # 1.00 (memorized)
12print(f"Test acc:  {model.score(X_test, y_test):.2f}")   # ~0.93 (drop)
13
14# Fix: limit depth (regularization)
15model_reg = DecisionTreeClassifier(max_depth=3)
16model_reg.fit(X_train, y_train)
17print(f"Regularized test acc: {model_reg.score(X_test, y_test):.2f}")  # better generalization
18

Use more training data to expose the model to more diverse examples.
Apply regularization (L1/L2, max_depth, min_samples_leaf).
Use cross-validation to detect the gap between training and validation performance.

The train-test accuracy gap is the primary signal of overfitting. Always monitor both metrics during model development.

9. What is underfitting in machine learning?

Underfitting happens when a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test sets. An underfit model has high bias and low variance.

Example (linear model underfitting non-linear data):

python

1import numpy as np
2from sklearn.linear_model import LinearRegression
3from sklearn.preprocessing import PolynomialFeatures
4from sklearn.metrics import mean_squared_error
5
6# Non-linear data: y = x^2 + noise
7X = np.linspace(-3, 3, 100).reshape(-1, 1)
8y = X.ravel() ** 2 + np.random.randn(100) * 0.5
9
10# Underfit: linear model on non-linear data
11linear = LinearRegression().fit(X, y)
12print(f"Linear MSE: {mean_squared_error(y, linear.predict(X)):.2f}")  # high error
13
14# Fix: polynomial features
15poly = PolynomialFeatures(degree=2)
16X_poly = poly.fit_transform(X)
17better = LinearRegression().fit(X_poly, y)
18print(f"Poly MSE:   {mean_squared_error(y, better.predict(X_poly)):.2f}")  # low error
19

To fix underfitting: use a more complex model, add more features, increase polynomial degree, or reduce regularization strength.

10. What is linear regression?

Linear regression is a supervised learning algorithm that models the relationship between one or more input features and a continuous output variable by fitting a straight line (or hyperplane) to the data. It minimizes the sum of squared errors between predictions and actual values.

Example:

python

1from sklearn.linear_model import LinearRegression
2from sklearn.metrics import mean_squared_error, r2_score
3import numpy as np
4
5# Years of experience -> Salary
6X = np.array([[1], [2], [3], [5], [7], [10]])
7y = np.array([40000, 50000, 60000, 80000, 100000, 130000])
8
9model = LinearRegression()
10model.fit(X, y)
11
12print(f"Coefficient: {model.coef_[0]:.2f}")    # slope
13print(f"Intercept:   {model.intercept_:.2f}")  # bias
14print(f"R2 Score:    {r2_score(y, model.predict(X)):.2f}")
15print(f"Predict 8yrs: {model.predict([[8]])[0]:.0f}")
16

Linear regression assumes a linear relationship between inputs and output, that residuals are normally distributed, and that features are not highly correlated (no multicollinearity).

11. What is logistic regression and when is it used?

Logistic regression is a supervised learning algorithm used for classification problems. Despite its name, it is a classifier, not a regressor. It models the probability that an input belongs to a given class using the sigmoid function, which maps any real number to a value between 0 and 1.

Example (binary classification - pass/fail):

python

1from sklearn.linear_model import LogisticRegression
2import numpy as np
3
4# Hours studied -> Pass (1) or Fail (0)
5X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
6y = np.array([0, 0, 0, 0, 1, 1, 1, 1])
7
8model = LogisticRegression()
9model.fit(X, y)
10
11print(model.predict([[3.5]]))           # likely 0 (fail)
12print(model.predict_proba([[5.5]]))     # probability of each class
13

Logistic regression is widely used for binary classification tasks such as spam detection, disease diagnosis, and credit risk assessment.

12. What is a decision tree in machine learning?

A decision tree is a supervised learning algorithm that makes predictions by learning a hierarchical sequence of if-else decision rules from the data. It splits the data at each node based on the feature that provides the best information gain or lowest Gini impurity.

Example:

python

1from sklearn.tree import DecisionTreeClassifier, export_text
2from sklearn.datasets import load_iris
3
4X, y = load_iris(return_X_y=True)
5
6model = DecisionTreeClassifier(max_depth=3, random_state=42)
7model.fit(X, y)
8
9# Print the learned tree rules
10print(export_text(model, feature_names=load_iris().feature_names))
11print(f"Accuracy: {model.score(X, y):.2f}")
12

Decision trees are highly interpretable but prone to overfitting. Limiting max_depth or using ensemble methods like Random Forest significantly improves robustness.

13. What is accuracy in machine learning and when is it misleading?

Accuracy is the ratio of correctly predicted samples to the total number of samples. It is one of the most commonly reported classification metrics, but it can be highly misleading when the dataset is imbalanced.

Example (imbalanced dataset - accuracy is misleading):

python

1from sklearn.metrics import accuracy_score, classification_report
2import numpy as np
3
4# 950 negatives, 50 positives (imbalanced)
5y_true = np.array([0] * 950 + [1] * 50)
6
7# Dumb model: always predict 0
8y_pred = np.zeros(1000, dtype=int)
9
10print(f"Accuracy: {accuracy_score(y_true, y_pred):.2f}")  # 0.95 - looks great!
11print(classification_report(y_true, y_pred))               # recall for class 1 = 0.00
12

On imbalanced datasets, always prefer precision, recall, F1-score, or AUC-ROC over accuracy as the primary evaluation metric.

14. What is a loss function in machine learning?

A loss function (also called a cost function or objective function) measures how far the model's predictions are from the actual values. During training, the algorithm minimizes the loss function to improve the model's accuracy.

Example (common loss functions implemented in Python):

python

1import numpy as np
2
3y_true = np.array([3.0, 5.0, 2.5, 7.0])
4y_pred = np.array([2.5, 5.0, 4.0, 6.0])
5
6# Mean Squared Error (regression)
7mse = np.mean((y_true - y_pred) ** 2)
8print(f"MSE: {mse:.4f}")
9
10# Mean Absolute Error (regression)
11mae = np.mean(np.abs(y_true - y_pred))
12print(f"MAE: {mae:.4f}")
13
14# Binary Cross-Entropy (classification)
15y_true_cls = np.array([1, 0, 1, 1])
16y_pred_proba = np.array([0.9, 0.1, 0.8, 0.7])
17bce = -np.mean(y_true_cls * np.log(y_pred_proba) + (1 - y_true_cls) * np.log(1 - y_pred_proba))
18print(f"BCE: {bce:.4f}")
19

MSE is used for regression, binary cross-entropy for binary classification, and categorical cross-entropy for multi-class classification. Choosing the right loss function is critical for model training.

15. What is gradient descent?

Gradient descent is an optimization algorithm used to minimize the loss function by iteratively updating model parameters in the direction opposite to the gradient (slope) of the loss. The size of each step is controlled by the learning rate.

Example (gradient descent from scratch):

python

1import numpy as np
2
3# Simple linear regression via gradient descent
4np.random.seed(42)
5X = np.random.randn(100)
6y = 3 * X + 2 + np.random.randn(100) * 0.5  # true: w=3, b=2
7
8w, b = 0.0, 0.0
9lr = 0.01
10
11for epoch in range(1000):
12    y_pred = w * X + b
13    loss = np.mean((y_pred - y) ** 2)
14    dw = np.mean(2 * (y_pred - y) * X)
15    db = np.mean(2 * (y_pred - y))
16    w -= lr * dw
17    b -= lr * db
18
19print(f"Learned w={w:.3f}, b={b:.3f}")  # close to w=3, b=2
20

Variants include Stochastic Gradient Descent (one sample at a time), Mini-Batch SGD (small batches), and adaptive optimizers like Adam and RMSProp which adjust the learning rate per parameter.

16. What is normalization and why is it needed in ML?

Normalization (also called feature scaling) is the process of transforming feature values to a common scale so that no single feature dominates others during model training. Many ML algorithms are sensitive to the scale of input features.

Example (Min-Max scaling and Standardization):

python

1from sklearn.preprocessing import MinMaxScaler, StandardScaler
2import numpy as np
3
4X = np.array([[1000, 1], [2000, 2], [3000, 3], [4000, 4]])
5
6# Min-Max Normalization: scales to [0, 1]
7minmax = MinMaxScaler()
8print("MinMax:\n", minmax.fit_transform(X))
9
10# Standardization: mean=0, std=1
11standard = StandardScaler()
12print("Standardized:\n", standard.fit_transform(X))
13

Use StandardScaler for algorithms that assume Gaussian distributions (SVM, PCA, logistic regression). Use MinMaxScaler when you need values in a fixed range (e.g., for neural networks). Always fit the scaler on training data only and transform both train and test sets.

17. What is a hyperparameter in machine learning?

A hyperparameter is a configuration setting for a machine learning algorithm that is set before training begins and is not learned from the data. Unlike model parameters (like weights), hyperparameters control the structure and training process of the model.

Example (common hyperparameters):

python

1from sklearn.ensemble import RandomForestClassifier
2
3# All these are hyperparameters (set before training)
4model = RandomForestClassifier(
5    n_estimators=100,   # number of trees
6    max_depth=5,        # maximum depth of each tree
7    min_samples_split=4, # min samples to split a node
8    max_features='sqrt', # features to consider per split
9    random_state=42
10)
11# model.coef_ and model.feature_importances_ are LEARNED parameters (after fit)
12

Finding the best hyperparameters is called hyperparameter tuning and is done using techniques like GridSearchCV, RandomizedSearchCV, or Bayesian optimization.

18. What is the K-Nearest Neighbors (KNN) algorithm?

K-Nearest Neighbors (KNN) is a simple, non-parametric algorithm that classifies a data point based on the majority class of its k nearest neighbors in the feature space. It uses a distance metric (usually Euclidean) to find neighbors. KNN has no explicit training phase - it memorizes the training data.

Example:

python

1from sklearn.neighbors import KNeighborsClassifier
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4
5X, y = load_iris(return_X_y=True)
6X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
7
8knn = KNeighborsClassifier(n_neighbors=5)
9knn.fit(X_train, y_train)
10print(f"KNN Accuracy: {knn.score(X_test, y_test):.2f}")
11

KNN is simple and effective for small datasets but becomes slow for large datasets due to its O(n) prediction complexity. Always normalize features before using KNN.

19. What is the Naive Bayes classifier?

Naive Bayes is a probabilistic classifier based on Bayes' theorem with the 'naive' assumption that all features are conditionally independent given the class label. Despite this strong assumption, it works surprisingly well for text classification and spam filtering.

Example (Gaussian Naive Bayes):

python

1from sklearn.naive_bayes import GaussianNB
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4
5X, y = load_iris(return_X_y=True)
6X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
7
8gnb = GaussianNB()
9gnb.fit(X_train, y_train)
10
11print(f"Naive Bayes Accuracy: {gnb.score(X_test, y_test):.2f}")
12print(f"Class priors: {gnb.class_prior_}")
13

Variants include GaussianNB for continuous features, MultinomialNB for text data, and BernoulliNB for binary features. It is fast and works well for high-dimensional data like text.

20. What is the difference between classification and regression?

Classification and regression are both supervised learning tasks, but they differ in the type of output they predict.

Classification: Predicts a discrete class label. Output is one of a finite set of categories. Examples: spam/not spam, cat/dog, digit 0-9.
Regression: Predicts a continuous numerical value. Output can be any real number. Examples: house price, temperature, stock value.

Example:

python

1from sklearn.linear_model import LogisticRegression, LinearRegression
2from sklearn.datasets import load_iris, load_diabetes
3
4# Classification example
5X_cls, y_cls = load_iris(return_X_y=True)
6clf = LogisticRegression(max_iter=200).fit(X_cls, y_cls)
7print("Classification predictions:", clf.predict(X_cls[:5]))  # [0, 0, 0, 0, 0]
8
9# Regression example
10X_reg, y_reg = load_diabetes(return_X_y=True)
11reg = LinearRegression().fit(X_reg, y_reg)
12print("Regression predictions:", reg.predict(X_reg[:3]).round(1))  # [206.1, 68.1, 176.9]
13

The type of label determines whether it is a classification or regression problem. Metrics also differ: use accuracy/F1 for classification and MSE/R2 for regression.

II. Intermediate Level

1. What is the bias-variance tradeoff?

The bias-variance tradeoff describes a fundamental tension in ML model design. Bias is error from incorrect assumptions; high bias causes underfitting. Variance is error from sensitivity to fluctuations in training data; high variance causes overfitting. Reducing one typically increases the other.

Example (visualizing bias-variance with learning curves):

python

1from sklearn.model_selection import learning_curve
2from sklearn.tree import DecisionTreeClassifier
3from sklearn.datasets import load_iris
4import numpy as np
5
6X, y = load_iris(return_X_y=True)
7
8for depth in [1, 5, None]:  # high bias, balanced, high variance
9    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
10    train_sizes, train_scores, val_scores = learning_curve(model, X, y, cv=5)
11    print(f"Depth={depth} | Train={train_scores[:,-1].mean():.2f} | Val={val_scores[:,-1].mean():.2f}")
12# depth=1: low train, low val (high bias / underfit)
13# depth=None: high train, lower val (high variance / overfit)
14# depth=5: balanced
15

The goal is to find the sweet spot - a model complex enough to learn the signal but constrained enough not to memorize the noise. Regularization, cross-validation, and ensemble methods help achieve this balance.

2. What is cross-validation and how does k-fold work?

Cross-validation is a resampling technique used to evaluate model performance more reliably than a single train-test split. K-fold cross-validation divides the data into k equal parts. The model trains on k-1 folds and validates on the remaining fold, repeating k times.

Example:

python

1from sklearn.model_selection import cross_val_score, StratifiedKFold
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import load_breast_cancer
4
5X, y = load_breast_cancer(return_X_y=True)
6model = RandomForestClassifier(n_estimators=100, random_state=42)
7
8# Stratified k-fold preserves class distribution in each fold
9skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
10scores = cross_val_score(model, X, y, cv=skf, scoring='f1')
11
12print(f"F1 per fold: {scores.round(3)}")
13print(f"Mean F1: {scores.mean():.3f} ± {scores.std():.3f}")
14

Always use StratifiedKFold for classification to ensure class proportions are maintained in each fold. The mean and standard deviation of scores give a reliable estimate of model performance.

3. Explain precision, recall, and F1-score with examples.

Precision, recall, and F1-score are classification metrics especially important for imbalanced datasets. They provide a more complete picture than accuracy alone.

Example:

python

1from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
2import numpy as np
3
4y_true = np.array([1, 1, 0, 1, 0, 0, 1, 0, 1, 0])
5y_pred = np.array([1, 0, 0, 1, 0, 1, 1, 0, 0, 0])
6
7print(f"Precision: {precision_score(y_true, y_pred):.2f}")  # TP / (TP + FP)
8print(f"Recall:    {recall_score(y_true, y_pred):.2f}")     # TP / (TP + FN)
9print(f"F1-Score:  {f1_score(y_true, y_pred):.2f}")        # harmonic mean of P and R
10print()
11print(classification_report(y_true, y_pred))
12

High precision needed: Spam detection (don't flag legitimate emails as spam).
High recall needed: Cancer detection (don't miss a true positive).
F1 balances both and is ideal when you need a single metric for an imbalanced dataset.

The choice between precision and recall depends on the cost of false positives vs. false negatives in your specific problem domain.

4. What is a confusion matrix? How do you compute it?

A confusion matrix is a table that visualizes the performance of a classification model by showing counts of true positives, true negatives, false positives, and false negatives. It is the foundation for computing all other classification metrics.

Example:

python

1from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import load_breast_cancer
4from sklearn.model_selection import train_test_split
5
6X, y = load_breast_cancer(return_X_y=True)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8
9model = RandomForestClassifier(random_state=42).fit(X_train, y_train)
10y_pred = model.predict(X_test)
11
12cm = confusion_matrix(y_test, y_pred)
13print("Confusion Matrix:")
14print(cm)
15# [[TN, FP],
16#  [FN, TP]]
17
18tn, fp, fn, tp = cm.ravel()
19print(f"TP={tp}, TN={tn}, FP={fp}, FN={fn}")
20

From the confusion matrix you can derive accuracy, precision, recall, F1-score, specificity, and the false positive rate - giving a complete picture of classifier performance.

5. How does Random Forest work and what makes it powerful?

Random Forest is an ensemble learning method that builds multiple decision trees during training and aggregates their predictions. It introduces two key sources of randomness: bootstrap sampling (bagging) and random feature selection at each split. This diversity among trees reduces overfitting dramatically.

Example (with feature importance):

python

1from sklearn.ensemble import RandomForestClassifier
2from sklearn.datasets import load_iris
3import pandas as pd
4
5X, y = load_iris(return_X_y=True)
6feature_names = load_iris().feature_names
7
8model = RandomForestClassifier(n_estimators=100, random_state=42)
9model.fit(X, y)
10
11# Feature importance
12imp = pd.Series(model.feature_importances_, index=feature_names).sort_values(ascending=False)
13print("Feature Importances:")
14print(imp)
15print(f"\nOOB Score: {RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42).fit(X,y).oob_score_:.3f}")
16

Random Forest provides built-in feature importance, handles missing values reasonably, and requires little preprocessing - making it one of the most versatile and reliable out-of-the-box ML algorithms.

6. What is gradient boosting and how does it differ from bagging?

Gradient boosting is an ensemble technique that builds models sequentially, where each new model corrects the residual errors of the previous one by fitting to the negative gradient of the loss function. Unlike bagging (Random Forest), boosting trains trees in sequence, not in parallel.

Bagging (Random Forest): Trains trees independently in parallel on random subsets. Reduces variance. Better when individual trees overfit.
Boosting (GBM, XGBoost): Trains trees sequentially, each correcting previous errors. Reduces bias. Usually achieves higher accuracy but is slower and more prone to overfitting on noisy data.

Example:

python

1from sklearn.ensemble import GradientBoostingClassifier
2from sklearn.datasets import load_breast_cancer
3from sklearn.model_selection import train_test_split
4
5X, y = load_breast_cancer(return_X_y=True)
6X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
7
8gbm = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
9gbm.fit(X_train, y_train)
10print(f"GBM Accuracy: {gbm.score(X_test, y_test):.3f}")
11

Gradient boosting is the algorithm behind XGBoost, LightGBM, and CatBoost - the dominant algorithms in structured/tabular data competitions.

7. How does a Support Vector Machine (SVM) work?

SVM finds the optimal hyperplane that separates classes with the maximum margin. Data points closest to the hyperplane are called support vectors and directly define the decision boundary. The kernel trick maps data to higher dimensions to handle non-linearly separable problems.

Example (comparing linear vs. RBF kernel):

python

1from sklearn.svm import SVC
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4from sklearn.preprocessing import StandardScaler
5
6X, y = load_iris(return_X_y=True)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8
9scaler = StandardScaler()
10X_train_s = scaler.fit_transform(X_train)
11X_test_s  = scaler.transform(X_test)
12
13for kernel in ['linear', 'rbf', 'poly']:
14    svm = SVC(kernel=kernel, C=1.0)
15    svm.fit(X_train_s, y_train)
16    print(f"Kernel={kernel:6s} | Accuracy={svm.score(X_test_s, y_test):.3f} | Support vectors={sum(svm.n_support_)}")
17

SVM works best with normalized features. Use a linear kernel for high-dimensional text data and an RBF kernel for most other tasks. The C parameter controls the bias-variance tradeoff.

8. How does K-Means clustering work? Write the algorithm.

K-Means partitions data into k clusters by iteratively assigning each point to its nearest centroid and recomputing centroids as the mean of assigned points. It converges when assignments no longer change.

Example (K-Means from scratch + using sklearn with Elbow method):

python

1import numpy as np
2from sklearn.cluster import KMeans
3from sklearn.datasets import make_blobs
4
5X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)
6
7# Elbow method to find optimal k
8inertias = []
9for k in range(1, 10):
10    km = KMeans(n_clusters=k, random_state=42, n_init='auto')
11    km.fit(X)
12    inertias.append(km.inertia_)
13    print(f"k={k} | inertia={km.inertia_:.1f}")
14
15# Best model with k=4
16best = KMeans(n_clusters=4, random_state=42, n_init='auto').fit(X)
17print(f"\nCluster labels sample: {best.labels_[:10]}")
18print(f"Centroids:\n{best.cluster_centers_.round(2)}")
19

K-Means assumes spherical clusters and is sensitive to outliers and initialization. Use the Elbow method or Silhouette score to choose k. KMeans++ initialization (default) improves convergence.

9. How does PCA reduce dimensionality?

PCA (Principal Component Analysis) is a linear dimensionality reduction technique that projects data onto a lower-dimensional space defined by the directions (principal components) of maximum variance. It uses eigendecomposition of the covariance matrix to find these directions.

Example (reducing Iris from 4D to 2D):

python

1from sklearn.decomposition import PCA
2from sklearn.preprocessing import StandardScaler
3from sklearn.datasets import load_iris
4import numpy as np
5
6X, y = load_iris(return_X_y=True)
7
8# Always standardize before PCA
9X_scaled = StandardScaler().fit_transform(X)
10
11pca = PCA(n_components=2)
12X_reduced = pca.fit_transform(X_scaled)
13
14print(f"Original shape: {X_scaled.shape}")   # (150, 4)
15print(f"Reduced shape:  {X_reduced.shape}")  # (150, 2)
16print(f"Explained variance ratio: {pca.explained_variance_ratio_.round(3)}")
17print(f"Total variance retained: {pca.explained_variance_ratio_.sum():.2%}")
18

PCA is unsupervised, so it maximizes variance regardless of class labels. For classification-aware reduction, Linear Discriminant Analysis (LDA) is preferred.

10. What is regularization? Explain L1 (Lasso) and L2 (Ridge) with code.

Regularization adds a penalty term to the loss function to discourage the model from learning overly large weights, which helps prevent overfitting. L1 adds the sum of absolute weights; L2 adds the sum of squared weights.

Example:

python

1from sklearn.linear_model import Ridge, Lasso, LinearRegression
2from sklearn.datasets import load_diabetes
3from sklearn.model_selection import train_test_split
4from sklearn.metrics import mean_squared_error
5
6X, y = load_diabetes(return_X_y=True)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8
9for name, model in [("Linear", LinearRegression()), ("Ridge (L2)", Ridge(alpha=1.0)), ("Lasso (L1)", Lasso(alpha=0.1))]:
10    model.fit(X_train, y_train)
11    mse = mean_squared_error(y_test, model.predict(X_test))
12    nonzero = (model.coef_ != 0).sum() if hasattr(model, 'coef_') else 'N/A'
13    print(f"{name:12s} | MSE={mse:.1f} | Non-zero coefs={nonzero}")
14# Lasso drives some coefficients to exactly 0 (feature selection)
15

Use L1 when you suspect only a few features are relevant (sparse solution). Use L2 when all features contribute and you want to shrink their influence. ElasticNet combines both penalties.

11. How do you handle missing data in a machine learning dataset?

Missing data is one of the most common real-world data issues. Handling it correctly prevents biased models and errors during training. The right approach depends on the amount and type of missingness.

Example (imputation strategies):

python

1import pandas as pd
2import numpy as np
3from sklearn.impute import SimpleImputer, KNNImputer
4
5df = pd.DataFrame({
6    'age':    [25, np.nan, 35, 45, np.nan],
7    'salary': [50000, 60000, np.nan, 80000, 90000],
8    'city':   ['NY', 'LA', np.nan, 'NY', 'LA']
9})
10
11# Strategy 1: Mean imputation (numerical)
12num_imputer = SimpleImputer(strategy='mean')
13print("Mean imputed:\n", num_imputer.fit_transform(df[['age', 'salary']]))
14
15# Strategy 2: Mode imputation (categorical)
16cat_imputer = SimpleImputer(strategy='most_frequent')
17print("Mode imputed:", cat_imputer.fit_transform(df[['city']]).ravel())
18
19# Strategy 3: KNN imputation (uses neighbor values)
20knn_imp = KNNImputer(n_neighbors=2)
21print("KNN imputed:\n", knn_imp.fit_transform(df[['age', 'salary']]))
22

Never impute on the test set using statistics from the test set - always fit imputers on training data only. For large amounts of missing data, consider adding a binary indicator column to flag which values were imputed.

12. What is feature selection and what techniques are used?

Feature selection is the process of choosing the most relevant subset of features for a model. It reduces overfitting, improves accuracy, and decreases training time by removing redundant and irrelevant features.

Example (filter, wrapper, and embedded methods):

python

1from sklearn.feature_selection import SelectKBest, f_classif, RFE, SelectFromModel
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import load_iris
4
5X, y = load_iris(return_X_y=True)
6names = load_iris().feature_names
7
8# Filter: Select top 2 features by ANOVA F-score
9filter_sel = SelectKBest(f_classif, k=2).fit(X, y)
10print("Filter selected:", [names[i] for i in filter_sel.get_support(indices=True)])
11
12# Wrapper: Recursive Feature Elimination
13rfe = RFE(RandomForestClassifier(n_estimators=10, random_state=42), n_features_to_select=2)
14rfe.fit(X, y)
15print("RFE selected:", [names[i] for i in rfe.get_support(indices=True)])
16
17# Embedded: L1-based selection (Random Forest importances)
18sfm = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))
19sfm.fit(X, y)
20print("Embedded selected:", [names[i] for i in sfm.get_support(indices=True)])
21

Filter methods are fastest, wrapper methods are most thorough but expensive, and embedded methods (like tree feature importance or Lasso) are a good practical middle ground.

13. How do you handle an imbalanced dataset?

An imbalanced dataset has a significant disparity in the number of samples per class, causing models to be biased toward the majority class. Several techniques can address this issue.

Example (SMOTE oversampling and class_weight):

python

1from sklearn.datasets import make_classification
2from sklearn.model_selection import train_test_split
3from sklearn.ensemble import RandomForestClassifier
4from sklearn.metrics import classification_report
5from imblearn.over_sampling import SMOTE
6
7X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10# Approach 1: class_weight='balanced' (built-in to most sklearn estimators)
11model_balanced = RandomForestClassifier(class_weight='balanced', random_state=42)
12model_balanced.fit(X_train, y_train)
13print("class_weight='balanced':")
14print(classification_report(y_test, model_balanced.predict(X_test)))
15
16# Approach 2: SMOTE (Synthetic Minority Oversampling Technique)
17smote = SMOTE(random_state=42)
18X_res, y_res = smote.fit_resample(X_train, y_train)
19model_smote = RandomForestClassifier(random_state=42).fit(X_res, y_res)
20print("After SMOTE:")
21print(classification_report(y_test, model_smote.predict(X_test)))
22

Always evaluate imbalanced models using F1, precision-recall AUC, or Matthews Correlation Coefficient rather than accuracy. SMOTE should only be applied to the training set, never to test data.

14. What is the ROC curve and AUC score?

The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (recall) against the False Positive Rate at various classification thresholds. The AUC (Area Under the Curve) is a single number summarizing the model's ability to discriminate between classes - 1.0 is perfect, 0.5 is random.

Example:

python

1from sklearn.metrics import roc_auc_score, roc_curve
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import load_breast_cancer
4from sklearn.model_selection import train_test_split
5
6X, y = load_breast_cancer(return_X_y=True)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8
9model = RandomForestClassifier(random_state=42).fit(X_train, y_train)
10y_proba = model.predict_proba(X_test)[:, 1]
11
12auc = roc_auc_score(y_test, y_proba)
13print(f"AUC-ROC: {auc:.4f}")
14
15fpr, tpr, thresholds = roc_curve(y_test, y_proba)
16print(f"Threshold at max F1 is around: {thresholds[len(thresholds)//2]:.3f}")
17

AUC-ROC is threshold-independent and works well for binary classification on imbalanced datasets. For multi-class problems, use the average of one-vs-rest AUC scores.

15. What are ensemble methods in machine learning?

Ensemble methods combine multiple base models to produce a stronger predictive model. The key principle is that diverse, slightly imperfect models can collectively outperform any individual model.

Bagging: Trains models independently on bootstrap samples and averages results (Random Forest). Reduces variance.
Boosting: Trains models sequentially, each correcting the previous errors (XGBoost, AdaBoost). Reduces bias.
Stacking: Uses predictions of base models as features for a meta-learner. Can combine very different model types.
Voting: Aggregates predictions from multiple classifiers by majority vote (hard) or averaged probabilities (soft).

Ensemble methods consistently win machine learning competitions and are widely used in industry for their robustness and accuracy.

16. What is one-hot encoding and when do you use it?

One-hot encoding converts a categorical variable into a set of binary columns - one per category - where only the column corresponding to the present category has a value of 1 and all others are 0. It prevents the model from assuming ordinal relationships between categories.

Example:

python

1import pandas as pd
2from sklearn.preprocessing import OneHotEncoder
3
4df = pd.DataFrame({'color': ['red', 'green', 'blue', 'green', 'red']})
5
6# pandas get_dummies
7print("pd.get_dummies:")
8print(pd.get_dummies(df, drop_first=True))  # drop_first avoids multicollinearity
9
10# sklearn OneHotEncoder
11enc = OneHotEncoder(sparse_output=False, drop='first')
12result = enc.fit_transform(df[['color']])
13print("\nSklearn OHE:")
14print(result)
15print("Categories:", enc.categories_)
16

Use one-hot encoding for nominal categories with low cardinality. For high-cardinality categoricals (e.g., city with 10,000 values), prefer target encoding or embedding layers.

17. What is feature engineering and give practical examples.

Feature engineering is the process of using domain knowledge to create, transform, or combine raw features into more informative representations for a model. Good feature engineering often has a bigger impact on performance than choosing a more complex algorithm.

Example (datetime and interaction features):

python

1import pandas as pd
2import numpy as np
3
4df = pd.DataFrame({
5    'purchase_date': pd.to_datetime(['2024-01-15', '2024-07-20', '2024-12-01']),
6    'price': [100, 250, 80],
7    'quantity': [2, 1, 5]
8})
9
10# Datetime features
11df['month'] = df['purchase_date'].dt.month
12df['day_of_week'] = df['purchase_date'].dt.dayofweek
13df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
14
15# Interaction feature
16df['total_value'] = df['price'] * df['quantity']
17
18# Log transform for skewed features
19df['log_price'] = np.log1p(df['price'])
20
21print(df[['month', 'is_weekend', 'total_value', 'log_price']])
22

Common feature engineering techniques include log transformations for skewed data, polynomial features, interaction terms, date decomposition, binning, and ratio features.

18. What is a learning rate and how does it affect training?

The learning rate is a hyperparameter that controls how large a step the optimizer takes when updating model weights during gradient descent. It is one of the most important hyperparameters to tune in any ML or deep learning model.

Example (effect of different learning rates):

python

1import numpy as np
2
3def gradient_descent(lr, n_steps=50):
4    w = 10.0  # start far from optimum (w=0)
5    losses = []
6    for _ in range(n_steps):
7        loss = w ** 2          # f(w) = w^2, minimum at w=0
8        grad = 2 * w           # df/dw = 2w
9        w -= lr * grad
10        losses.append(loss)
11    return w, losses[-1]
12
13for lr in [0.001, 0.1, 0.5, 1.1]:
14    final_w, final_loss = gradient_descent(lr)
15    status = "diverged!" if final_loss > 1e6 else f"w={final_w:.4f}"
16    print(f"lr={lr}  -> {status}, final_loss={final_loss:.4f}")
17

Too small a learning rate causes slow convergence; too large causes divergence. Learning rate schedulers (cosine annealing, step decay) and adaptive optimizers (Adam) help manage this.

19. What is early stopping in model training?

Early stopping is a regularization technique that halts training when the model's performance on a validation set stops improving - preventing the model from continuing to overfit the training data. It is widely used in gradient boosting and neural network training.

Example (early stopping with XGBoost):

python

1import xgboost as xgb
2from sklearn.datasets import load_breast_cancer
3from sklearn.model_selection import train_test_split
4
5X, y = load_breast_cancer(return_X_y=True)
6X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
7
8model = xgb.XGBClassifier(
9    n_estimators=1000,       # high upper bound
10    learning_rate=0.05,
11    early_stopping_rounds=20, # stop if no improvement for 20 rounds
12    eval_metric='logloss',
13    random_state=42
14)
15
16model.fit(
17    X_train, y_train,
18    eval_set=[(X_val, y_val)],
19    verbose=False
20)
21
22print(f"Best iteration: {model.best_iteration}")
23print(f"Validation accuracy: {model.score(X_val, y_val):.3f}")
24

Early stopping is a free regularizer - it requires no additional hyperparameter tuning and often significantly reduces overfitting in boosting and neural network models.

20. How do you perform hyperparameter tuning using GridSearchCV?

GridSearchCV exhaustively searches over a specified hyperparameter grid, training and evaluating a model for every combination using cross-validation. It automatically selects the best combination based on a scoring metric.

Example:

python

1from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import load_iris
4
5X, y = load_iris(return_X_y=True)
6
7param_grid = {
8    'n_estimators': [50, 100, 200],
9    'max_depth': [None, 3, 5],
10    'min_samples_split': [2, 5]
11}
12
13# Grid Search (exhaustive)
14grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
15grid.fit(X, y)
16print(f"Best params: {grid.best_params_}")
17print(f"Best CV accuracy: {grid.best_score_:.3f}")
18
19# Random Search (faster alternative for large grids)
20from scipy.stats import randint
21param_dist = {'n_estimators': randint(50, 300), 'max_depth': [None, 3, 5, 10]}
22rand = RandomizedSearchCV(RandomForestClassifier(random_state=42), param_dist, n_iter=20, cv=5, random_state=42)
23rand.fit(X, y)
24print(f"RandomizedSearch best: {rand.best_params_}")
25

Use GridSearchCV for small grids and RandomizedSearchCV for large ones. For even greater efficiency, consider Bayesian optimization with Optuna or Hyperopt.

III. Advanced

1. What is XGBoost and how does it differ from LightGBM?

XGBoost and LightGBM are both high-performance gradient boosting frameworks but differ in how they build trees. XGBoost uses level-wise tree growth (splits all nodes at the same depth), while LightGBM uses leaf-wise growth (splits the leaf with the largest loss reduction), which is more accurate but can overfit on small datasets.

Example (comparison):

python

1import xgboost as xgb
2import lightgbm as lgb
3from sklearn.datasets import load_breast_cancer
4from sklearn.model_selection import train_test_split
5import time
6
7X, y = load_breast_cancer(return_X_y=True)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10for name, model in [
11    ("XGBoost",  xgb.XGBClassifier(n_estimators=200, learning_rate=0.05, random_state=42, verbosity=0)),
12    ("LightGBM", lgb.LGBMClassifier(n_estimators=200, learning_rate=0.05, random_state=42, verbose=-1))
13]:
14    t0 = time.time()
15    model.fit(X_train, y_train)
16    t1 = time.time()
17    print(f"{name:8s} | Acc={model.score(X_test, y_test):.3f} | Time={t1-t0:.2f}s")
18# LightGBM is usually faster on larger datasets
19

LightGBM is generally faster and more memory-efficient for large datasets. XGBoost has been around longer and has a larger community. CatBoost is the best choice when you have many categorical features.

2. What is stacking in ensemble learning? How is it implemented?

Stacking (stacked generalization) is an ensemble technique where the predictions of multiple base models (level-0) are used as features for a meta-learner (level-1). The meta-learner learns how to best combine the base model predictions. Out-of-fold predictions are used to avoid data leakage.

Example:

python

1from sklearn.ensemble import StackingClassifier, RandomForestClassifier, GradientBoostingClassifier
2from sklearn.linear_model import LogisticRegression
3from sklearn.svm import SVC
4from sklearn.datasets import load_breast_cancer
5from sklearn.model_selection import train_test_split
6
7X, y = load_breast_cancer(return_X_y=True)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10# Level-0: diverse base models
11base_models = [
12    ('rf',  RandomForestClassifier(n_estimators=50, random_state=42)),
13    ('gbm', GradientBoostingClassifier(n_estimators=50, random_state=42)),
14    ('svm', SVC(probability=True, random_state=42))
15]
16
17# Level-1: meta-learner
18stack = StackingClassifier(
19    estimators=base_models,
20    final_estimator=LogisticRegression(),
21    cv=5,           # out-of-fold predictions
22    stack_method='predict_proba'
23)
24stack.fit(X_train, y_train)
25print(f"Stacking Accuracy: {stack.score(X_test, y_test):.3f}")
26

Stacking is powerful because diverse base models capture different patterns, and the meta-learner learns to trust each one differently. It consistently outperforms individual models in competitions.

3. How do you create a custom loss function in scikit-learn or XGBoost?

Custom loss functions allow you to optimize directly for the business metric that matters, rather than a standard surrogate like MSE or log loss. In XGBoost, you provide the gradient (first derivative) and hessian (second derivative) of your loss.

Example (custom asymmetric loss - penalize underestimation more than overestimation):

python

1import xgboost as xgb
2import numpy as np
3from sklearn.datasets import load_diabetes
4from sklearn.model_selection import train_test_split
5
6X, y = load_diabetes(return_X_y=True)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8
9def asymmetric_loss(y_pred, dtrain):
10    """Penalize underestimation 3x more than overestimation."""
11    y_true = dtrain.get_label()
12    residual = y_pred - y_true
13    alpha = 3.0  # weight for underestimation
14    grad = np.where(residual < 0, -2 * alpha * residual, -2 * residual)
15    hess = np.where(residual < 0, 2 * alpha, 2.0)
16    return grad, hess
17
18model = xgb.XGBRegressor(n_estimators=100, random_state=42)
19model.set_params(objective=asymmetric_loss)
20dtrain = xgb.DMatrix(X_train, label=y_train)
21model_custom = xgb.train({'eta': 0.1}, dtrain, num_boost_round=100, obj=asymmetric_loss)
22preds = model_custom.predict(xgb.DMatrix(X_test))
23print(f"Custom loss predictions sample: {preds[:5].round(1)}")
24

Custom losses are powerful when the cost of different types of errors is asymmetric - for example, in finance where underestimating risk is much more costly than overestimating it.

4. What are SHAP values and how do they explain model predictions?

SHAP (SHapley Additive exPlanations) values are a game-theory-based method for explaining individual model predictions. Each feature receives a SHAP value representing its contribution (positive or negative) to shifting the prediction from the model's average output.

Example:

python

1import shap
2import xgboost as xgb
3from sklearn.datasets import load_breast_cancer
4from sklearn.model_selection import train_test_split
5
6X, y = load_breast_cancer(return_X_y=True)
7feature_names = load_breast_cancer().feature_names
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10model = xgb.XGBClassifier(random_state=42, verbosity=0).fit(X_train, y_train)
11
12# Compute SHAP values
13explainer = shap.TreeExplainer(model)
14shap_values = explainer.shap_values(X_test)
15
16# Print top 3 features for the first prediction
17import pandas as pd
18shap_df = pd.Series(shap_values[0], index=feature_names).abs().sort_values(ascending=False)
19print("Top SHAP contributions for sample 0:")
20print(shap_df.head(3))
21

SHAP provides both global (feature importance across all samples) and local (per-sample) explanations, making it the gold standard for ML interpretability.

5. How do you approach a time series forecasting problem in ML?

Time series forecasting requires special treatment because observations are ordered in time and the standard i.i.d. assumption is violated. You must avoid future data leakage and respect temporal ordering in validation.

Example (lag features + TimeSeriesSplit):

python

1import pandas as pd
2import numpy as np
3from sklearn.model_selection import TimeSeriesSplit
4from sklearn.ensemble import GradientBoostingRegressor
5from sklearn.metrics import mean_absolute_error
6
7np.random.seed(42)
8dates = pd.date_range('2020-01-01', periods=365, freq='D')
9values = np.cumsum(np.random.randn(365)) + 100
10
11df = pd.DataFrame({'date': dates, 'value': values})
12
13# Create lag features (past observations as inputs)
14for lag in [1, 7, 14, 30]:
15    df[f'lag_{lag}'] = df['value'].shift(lag)
16df['rolling_mean_7'] = df['value'].shift(1).rolling(7).mean()
17df = df.dropna()
18
19X = df.drop(columns=['date', 'value'])
20y = df['value']
21
22# Time-aware cross-validation (no shuffling!)
23tscv = TimeSeriesSplit(n_splits=5)
24scores = []
25for train_idx, val_idx in tscv.split(X):
26    X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
27    y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
28    m = GradientBoostingRegressor(n_estimators=100, random_state=42).fit(X_tr, y_tr)
29    scores.append(mean_absolute_error(y_val, m.predict(X_val)))
30print(f"CV MAE: {np.mean(scores):.3f} ± {np.std(scores):.3f}")
31

Key principles for time series ML: always use TimeSeriesSplit, create lag and rolling window features, and never include future information in your training features.

6. What is a Pipeline in scikit-learn and why is it useful?

A Pipeline chains multiple preprocessing and modeling steps into a single object. It ensures that transformations are applied consistently during training and inference, prevents data leakage in cross-validation, and makes the code cleaner and more reproducible.

Example (full pipeline with preprocessing and model):

python

1from sklearn.pipeline import Pipeline
2from sklearn.compose import ColumnTransformer
3from sklearn.preprocessing import StandardScaler, OneHotEncoder
4from sklearn.ensemble import RandomForestClassifier
5from sklearn.model_selection import cross_val_score
6import pandas as pd
7import numpy as np
8
9# Synthetic dataset with mixed types
10np.random.seed(42)
11df = pd.DataFrame({
12    'age':      np.random.randint(20, 60, 200),
13    'salary':   np.random.randint(30000, 150000, 200),
14    'city':     np.random.choice(['NY', 'LA', 'Chicago'], 200),
15    'approved': np.random.randint(0, 2, 200)
16})
17
18X = df.drop('approved', axis=1)
19y = df['approved']
20
21numeric_features = ['age', 'salary']
22categorical_features = ['city']
23
24preprocessor = ColumnTransformer([
25    ('num', StandardScaler(), numeric_features),
26    ('cat', OneHotEncoder(drop='first'), categorical_features)
27])
28
29pipeline = Pipeline([
30    ('preprocessor', preprocessor),
31    ('classifier',   RandomForestClassifier(n_estimators=100, random_state=42))
32])
33
34scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
35print(f"Pipeline CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
36

Pipelines are essential for production ML systems. They prevent the most common source of data leakage - fitting preprocessors on the full dataset before cross-validation - and make model deployment simpler since the entire pipeline can be serialized as one object.

7. What is target encoding and when do you prefer it over one-hot encoding?

Target encoding replaces a categorical value with the mean of the target variable for that category. It handles high-cardinality categoricals (e.g., zip codes, user IDs) without exploding the feature space as one-hot encoding would.

Example (target encoding with smoothing to prevent overfitting):

python

1import pandas as pd
2import numpy as np
3
4np.random.seed(42)
5df = pd.DataFrame({
6    'city': np.random.choice(['NY', 'LA', 'Chicago', 'Houston', 'Phoenix'], 500),
7    'purchased': np.random.binomial(1, 0.4, 500)
8})
9
10# Compute target mean per category (with global mean smoothing)
11global_mean = df['purchased'].mean()
12smoothing = 10  # controls how much we trust global vs. local mean
13
14stats = df.groupby('city')['purchased'].agg(['mean', 'count'])
15stats['encoded'] = (
16    stats['mean'] * stats['count'] + global_mean * smoothing
17) / (stats['count'] + smoothing)
18
19df['city_encoded'] = df['city'].map(stats['encoded'])
20print(stats[['mean', 'count', 'encoded']].round(3))
21print("\nEncoded city sample:")
22print(df[['city', 'city_encoded']].drop_duplicates())
23

Always apply smoothing to target encoding to prevent overfitting for categories with few samples. In cross-validation, encode using only the training fold to avoid leakage. The category_encoders library provides a robust implementation.

8. What is the curse of dimensionality and how do you address it?

The curse of dimensionality refers to the exponential increase in data required to maintain statistical significance as the number of features grows. In high-dimensional spaces, data becomes sparse, distances lose meaning, and models overfit more easily.

Example (how KNN breaks down in high dimensions):

python

1import numpy as np
2from sklearn.neighbors import KNeighborsClassifier
3from sklearn.datasets import make_classification
4from sklearn.model_selection import cross_val_score
5
6# KNN accuracy degrades as dimensions increase
7for n_features in [2, 10, 50, 200, 500]:
8    X, y = make_classification(
9        n_samples=500, n_features=n_features,
10        n_informative=2, n_redundant=0, random_state=42
11    )
12    knn = KNeighborsClassifier(n_neighbors=5)
13    score = cross_val_score(knn, X, y, cv=5).mean()
14    print(f"Dims={n_features:4d} | KNN accuracy={score:.3f}")
15# Accuracy drops significantly as irrelevant dimensions grow
16

Address the curse of dimensionality with feature selection, PCA/dimensionality reduction, regularization, or collecting more data. Distance-based algorithms (KNN, SVM with RBF) suffer most.

9. What are Gaussian Mixture Models (GMM) and how do they differ from K-Means?

A Gaussian Mixture Model is a probabilistic clustering model that assumes data is generated from a mixture of several Gaussian distributions with unknown parameters. Unlike K-Means, GMM uses soft assignments - each point has a probability of belonging to each cluster - and can model elliptical cluster shapes.

Example:

python

1from sklearn.mixture import GaussianMixture
2from sklearn.datasets import make_blobs
3import numpy as np
4
5X, true_labels = make_blobs(n_samples=300, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)
6
7gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
8gmm.fit(X)
9
10labels = gmm.predict(X)
11probs  = gmm.predict_proba(X)  # soft assignments
12
13print(f"Cluster means:\n{gmm.means_.round(2)}")
14print(f"\nSample probabilities (first 3 points):\n{probs[:3].round(3)}")
15print(f"BIC (lower=better): {gmm.bic(X):.1f}")
16

GMMs are more flexible than K-Means but slower. Use BIC or AIC to select the number of components. GMMs also serve as a density estimation tool for anomaly detection.

10. What is DBSCAN clustering and what are its advantages?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are closely packed together and marks points in low-density regions as outliers. It does not require specifying the number of clusters in advance and can find arbitrarily shaped clusters.

Example:

python

1from sklearn.cluster import DBSCAN
2from sklearn.datasets import make_moons
3from sklearn.preprocessing import StandardScaler
4import numpy as np
5
6# Non-spherical clusters: make_moons
7X, y = make_moons(n_samples=300, noise=0.05, random_state=42)
8X = StandardScaler().fit_transform(X)
9
10db = DBSCAN(eps=0.3, min_samples=5)
11labels = db.fit_predict(X)
12
13print(f"Clusters found: {len(set(labels)) - (1 if -1 in labels else 0)}")
14print(f"Noise points:   {(labels == -1).sum()}")
15print(f"Unique labels:  {sorted(set(labels))}")  # -1 = noise
16

DBSCAN excels at detecting arbitrarily shaped clusters and identifying outliers automatically. The two hyperparameters eps (neighborhood radius) and min_samples (minimum density) must be chosen carefully, often using a k-distance graph.

11. What is isotonic regression and when is it used?

Isotonic regression fits a non-decreasing (monotonic) step function to data. It is commonly used for probability calibration - converting raw model scores into well-calibrated probabilities - and for scenarios where you know the relationship between input and output must be monotonic.

Example (calibrating classifier probabilities):

python

1from sklearn.calibration import CalibratedClassifierCV, calibration_curve
2from sklearn.ensemble import GradientBoostingClassifier
3from sklearn.datasets import make_classification
4from sklearn.model_selection import train_test_split
5
6X, y = make_classification(n_samples=2000, random_state=42)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
8
9# Uncalibrated model
10bgbm = GradientBoostingClassifier(n_estimators=50, random_state=42).fit(X_train, y_train)
11
12# Calibrate using isotonic regression
13cal_iso = CalibratedClassifierCV(GradientBoostingClassifier(n_estimators=50, random_state=42),
14                                  method='isotonic', cv=5)
15cal_iso.fit(X_train, y_train)
16
17# Compare calibration
18raw_proba  = bgbm.predict_proba(X_test)[:, 1]
19cal_proba  = cal_iso.predict_proba(X_test)[:, 1]
20
21frac_pos_raw, mean_pred_raw = calibration_curve(y_test, raw_proba, n_bins=10)
22frac_pos_cal, mean_pred_cal = calibration_curve(y_test, cal_proba, n_bins=10)
23
24import numpy as np
25print(f"Raw calibration error:  {np.mean(np.abs(frac_pos_raw - mean_pred_raw)):.4f}")
26print(f"Isotonic calibration error: {np.mean(np.abs(frac_pos_cal - mean_pred_cal)):.4f}")
27

Isotonic regression is better than Platt scaling (sigmoid calibration) when you have enough data and need a flexible calibration. Use it when predicted probabilities are used for business decisions or downstream ranking.

12. What is Bayesian optimization for hyperparameter tuning?

Bayesian optimization builds a probabilistic surrogate model (usually a Gaussian Process or Tree-structured Parzen Estimator) of the objective function (e.g., validation accuracy) and uses it to intelligently select the next hyperparameter configuration to evaluate, focusing on promising regions.

Example (Optuna):

python

1import optuna
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import load_iris
4from sklearn.model_selection import cross_val_score
5
6X, y = load_iris(return_X_y=True)
7optuna.logging.set_verbosity(optuna.logging.WARNING)
8
9def objective(trial):
10    n_estimators = trial.suggest_int('n_estimators', 50, 300)
11    max_depth    = trial.suggest_int('max_depth', 2, 10)
12    min_samples  = trial.suggest_int('min_samples_split', 2, 10)
13    model = RandomForestClassifier(
14        n_estimators=n_estimators,
15        max_depth=max_depth,
16        min_samples_split=min_samples,
17        random_state=42
18    )
19    return cross_val_score(model, X, y, cv=5, scoring='accuracy').mean()
20
21study = optuna.create_study(direction='maximize')
22study.optimize(objective, n_trials=50)
23
24print(f"Best params: {study.best_params}")
25print(f"Best CV accuracy: {study.best_value:.3f}")
26

Bayesian optimization requires far fewer evaluations than Grid or Random Search to find good hyperparameters, making it ideal when each model evaluation is expensive.

13. What is classifier calibration and how do you calibrate probabilities?

A well-calibrated classifier produces predicted probabilities that match the true frequencies of outcomes. For example, if a model predicts 80% probability for 100 samples, about 80 of them should actually be positive. Many models (like SVMs and Gradient Boosting) are not natively well-calibrated.

Example (Platt scaling vs. isotonic calibration):

python

1from sklearn.calibration import CalibratedClassifierCV
2from sklearn.svm import SVC
3from sklearn.datasets import make_classification
4from sklearn.model_selection import train_test_split
5from sklearn.metrics import brier_score_loss
6
7X, y = make_classification(n_samples=1000, random_state=42)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10# SVM doesn't output probabilities natively
11svm_sigmoid  = CalibratedClassifierCV(SVC(), method='sigmoid', cv=5)   # Platt scaling
12svm_isotonic = CalibratedClassifierCV(SVC(), method='isotonic', cv=5)  # Isotonic
13
14for name, model in [("Sigmoid (Platt)", svm_sigmoid), ("Isotonic", svm_isotonic)]:
15    model.fit(X_train, y_train)
16    proba = model.predict_proba(X_test)[:, 1]
17    brier = brier_score_loss(y_test, proba)  # lower is better
18    print(f"{name:20s} | Brier Score={brier:.4f}")
19

Use Platt scaling for small datasets and isotonic regression for larger ones. Calibrated probabilities are essential in applications where the probability itself drives business decisions, like credit risk or medical screening.

14. What is concept drift and how do you detect and handle it?

Concept drift occurs when the statistical relationship between input features and the target variable changes over time, causing a model's performance to degrade in production. It is one of the most common reasons deployed ML models fail silently.

Example (simulating and detecting drift with population stability index):

python

1import numpy as np
2from scipy.stats import ks_2samp
3
4np.random.seed(42)
5
6# Simulate production feature distribution shifting over time
7train_feature = np.random.normal(loc=0, scale=1, size=1000)
8production_no_drift   = np.random.normal(loc=0, scale=1, size=500)
9production_with_drift = np.random.normal(loc=2, scale=1.5, size=500)  # shifted mean
10
11# Kolmogorov-Smirnov test to detect distribution shift
12stat_ok, p_ok     = ks_2samp(train_feature, production_no_drift)
13stat_drift, p_drift = ks_2samp(train_feature, production_with_drift)
14
15print(f"No drift    - KS stat={stat_ok:.3f}, p={p_ok:.3f} -> {'DRIFT' if p_ok < 0.05 else 'OK'}")
16print(f"With drift  - KS stat={stat_drift:.3f}, p={p_drift:.3f} -> {'DRIFT' if p_drift < 0.05 else 'OK'}")
17

Handle concept drift by monitoring feature distributions (KS test, PSI), tracking prediction score distributions, setting up automated retraining pipelines, and using online or sliding-window training approaches.

15. What is multi-label classification and how is it implemented?

Multi-label classification is a task where each sample can belong to multiple classes simultaneously. For example, a news article can be tagged as both 'sports' and 'politics'. This is different from multi-class classification where each sample belongs to exactly one class.

Example:

python

1from sklearn.multioutput import MultiOutputClassifier
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import make_multilabel_classification
4from sklearn.model_selection import train_test_split
5from sklearn.metrics import hamming_loss, f1_score
6
7X, y = make_multilabel_classification(n_samples=1000, n_features=20, n_classes=5, random_state=42)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10# OneVsRest approach: one binary classifier per label
11model = MultiOutputClassifier(RandomForestClassifier(n_estimators=100, random_state=42))
12model.fit(X_train, y_train)
13y_pred = model.predict(X_test)
14
15print(f"Hamming Loss:  {hamming_loss(y_test, y_pred):.4f}")  # fraction of wrong labels
16print(f"F1 (micro):    {f1_score(y_test, y_pred, average='micro'):.3f}")
17print(f"F1 (samples):  {f1_score(y_test, y_pred, average='samples'):.3f}")
18

Use Hamming Loss and sample-averaged F1 for multi-label evaluation. Other approaches include classifier chains and label powerset, which capture label correlations that OneVsRest ignores.

16. What are common techniques for anomaly detection in ML?

Anomaly detection identifies data points that deviate significantly from normal behavior. It is widely used for fraud detection, network intrusion detection, and manufacturing quality control.

Example (Isolation Forest and Local Outlier Factor):

python

1from sklearn.ensemble import IsolationForest
2from sklearn.neighbors import LocalOutlierFactor
3from sklearn.datasets import make_blobs
4import numpy as np
5
6np.random.seed(42)
7X_normal, _ = make_blobs(n_samples=300, centers=1, cluster_std=0.5, random_state=42)
8X_outliers  = np.random.uniform(low=-10, high=10, size=(20, 2))
9X = np.vstack([X_normal, X_outliers])
10
11# Isolation Forest: anomaly score based on isolation depth
12iforest = IsolationForest(contamination=0.06, random_state=42)
13labels_if = iforest.fit_predict(X)  # -1 = anomaly, 1 = normal
14
15# Local Outlier Factor: density-based
16lof = LocalOutlierFactor(n_neighbors=20, contamination=0.06)
17labels_lof = lof.fit_predict(X)
18
19print(f"Isolation Forest anomalies: {(labels_if == -1).sum()}")
20print(f"LOF anomalies:              {(labels_lof == -1).sum()}")
21

Isolation Forest is fast and works well for high-dimensional data. LOF is better for datasets with clusters of different densities. One-class SVM and Autoencoders are alternatives for complex anomaly patterns.

17. What is semi-supervised learning and when is it used?

Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data during training. It is used when labeling data is expensive or time-consuming but unlabeled data is abundant - for example, medical image annotation or sentiment analysis.

Example (Label Spreading):

python

1from sklearn.semi_supervised import LabelSpreading
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4from sklearn.metrics import accuracy_score
5import numpy as np
6
7X, y = load_iris(return_X_y=True)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
9
10# Simulate scarce labels: only 10% of training labels are known
11y_semi = y_train.copy()
12rng = np.random.RandomState(42)
13mask = rng.rand(len(y_train)) > 0.10  # 90% unlabeled
14y_semi[mask] = -1  # -1 means unlabeled in sklearn's semi-supervised API
15
16label_spread = LabelSpreading(kernel='rbf', alpha=0.2)
17label_spread.fit(X_train, y_semi)
18y_pred = label_spread.predict(X_test)
19
20print(f"Labeled samples used: {(y_semi != -1).sum()} / {len(y_semi)}")
21print(f"Label Spreading Accuracy: {accuracy_score(y_test, y_pred):.3f}")
22

Semi-supervised learning is powerful in domains like NLP and computer vision, where pre-training on large unlabeled corpora followed by fine-tuning on labeled data (as in BERT) is the dominant paradigm.

18. What is federated learning and how does it preserve privacy?

Federated learning is a distributed ML approach where model training happens locally on each device or data silo, and only model updates (gradients), not raw data, are shared with a central server. This allows learning from sensitive data without centralizing it.

Each client trains locally and sends model updates to the server.
The server aggregates updates using FedAvg (weighted averaging of gradients).
Differential privacy adds noise to gradients before sharing to further protect individual records.

Example (simulating FedAvg locally):

python

1import numpy as np
2from sklearn.linear_model import SGDClassifier
3from sklearn.datasets import make_classification
4from sklearn.model_selection import train_test_split
5
6X, y = make_classification(n_samples=1000, random_state=42)
7
8# Split data across 3 simulated clients
9client_data = [(X[i::3], y[i::3]) for i in range(3)]
10
11def train_local(X_local, y_local, global_coef, global_intercept):
12    model = SGDClassifier(loss='log_loss', max_iter=5, random_state=42)
13    model.coef_ = global_coef.copy()
14    model.intercept_ = global_intercept.copy()
15    model.classes_ = np.array([0, 1])
16    model.partial_fit(X_local, y_local, classes=[0, 1])
17    return model.coef_, model.intercept_
18
19# Initialize global model
20global_model = SGDClassifier(loss='log_loss', random_state=42)
21global_model.fit(X, y)  # just to set coef_ shape
22
23# One round of FedAvg
24client_coefs, client_intercepts = [], []
25for Xc, yc in client_data:
26    c, i = train_local(Xc, yc, global_model.coef_, global_model.intercept_)
27    client_coefs.append(c)
28    client_intercepts.append(i)
29
30global_model.coef_      = np.mean(client_coefs, axis=0)
31global_model.intercept_ = np.mean(client_intercepts, axis=0)
32print(f"FedAvg round accuracy: {global_model.score(X, y):.3f}")
33

Federated learning is used by Google (Gboard keyboard prediction) and healthcare institutions to train models on sensitive patient data without compromising privacy regulations like GDPR and HIPAA.

19. What are counterfactual explanations in interpretable ML?

Counterfactual explanations answer the question: 'What is the minimum change to the input features that would flip the model's prediction?' They are highly actionable because they tell users what to change, not just why a decision was made.

Example (manual counterfactual search for a loan denial):

python

1from sklearn.ensemble import GradientBoostingClassifier
2from sklearn.datasets import make_classification
3from sklearn.model_selection import train_test_split
4import numpy as np
5
6X, y = make_classification(n_samples=500, n_features=4, random_state=42)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8model = GradientBoostingClassifier(random_state=42).fit(X_train, y_train)
9
10# Applicant denied (prediction=0), find minimal change to get approved (prediction=1)
11instance = X_test[0].copy()
12print(f"Original prediction: {model.predict([instance])[0]}")
13
14# Simple greedy counterfactual: perturb feature 0 until flip
15for delta in np.arange(0, 5, 0.1):
16    modified = instance.copy()
17    modified[0] += delta
18    if model.predict([modified])[0] == 1:
19        print(f"Counterfactual found: increase feature 0 by {delta:.2f}")
20        print(f"New prediction: {model.predict([modified])[0]}")
21        break
22

Counterfactual explanations are essential for algorithmic fairness, regulatory compliance (GDPR Article 22 right to explanation), and user-facing explanations in credit, healthcare, and insurance decisions.

20. How do you design a robust ML pipeline for production?

A production ML pipeline must be reproducible, scalable, monitored, and easily retrainable. It goes far beyond model training and includes data validation, feature engineering, model serving, and monitoring infrastructure.

Data ingestion & validation: Check schema, detect missing values and outliers (Great Expectations, Pandera).
Feature engineering pipeline: Use sklearn Pipelines or feature stores (Feast, Tecton) for consistent transformations.
Experiment tracking: Log all runs with MLflow, Weights & Biases, or Neptune.
Model serving: Deploy as REST API with FastAPI, BentoML, or Triton Inference Server.
Monitoring: Track prediction distribution, data drift, and business KPIs; trigger retraining on degradation.

Example (model serving with FastAPI):

python

1# app.py - serve a trained sklearn model as a REST API
2from fastapi import FastAPI
3from pydantic import BaseModel
4import joblib
5import numpy as np
6
7app = FastAPI()
8model = joblib.load("model.pkl")  # pre-trained sklearn pipeline
9
10class PredictRequest(BaseModel):
11    features: list[float]
12
13@app.post("/predict")
14def predict(request: PredictRequest):
15    X = np.array(request.features).reshape(1, -1)
16    prediction = model.predict(X)[0]
17    probability = model.predict_proba(X)[0].tolist()
18    return {"prediction": int(prediction), "probability": probability}
19
20# Run with: uvicorn app:app --reload
21

A robust production pipeline treats ML as software engineering - with versioning, testing, CI/CD, and observability built in from the start.

21. What is online learning (incremental learning) in machine learning?

Online learning (incremental learning) is a training paradigm where a model is updated continuously as new data arrives, one sample or mini-batch at a time, without retraining from scratch. It is essential when data streams continuously or when the full dataset is too large to fit in memory.

Example (incremental learning with partial_fit):

python

1from sklearn.linear_model import SGDClassifier
2from sklearn.datasets import make_classification
3from sklearn.metrics import accuracy_score
4import numpy as np
5
6np.random.seed(42)
7X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
8
9model = SGDClassifier(loss='log_loss', random_state=42)
10
11# Simulate streaming: update model with each mini-batch of 100 samples
12batch_size = 100
13for start in range(0, 4000, batch_size):
14    X_batch = X[start:start + batch_size]
15    y_batch = y[start:start + batch_size]
16    model.partial_fit(X_batch, y_batch, classes=[0, 1])  # incremental update
17
18# Evaluate on held-out test set
19y_pred = model.predict(X[4000:])
20print(f"Online learning accuracy: {accuracy_score(y[4000:], y_pred):.3f}")
21

Online learning algorithms (SGD-based models, River library) are essential for real-time recommendation systems, fraud detection, and any application where the data distribution changes continuously.

22. What is matrix factorization and how is it used in recommendation systems?

Matrix factorization decomposes a user-item interaction matrix (e.g., movie ratings) into two lower-rank matrices - user embeddings and item embeddings - whose dot product approximates the original matrix. The model can then predict unseen user-item interactions.

Example (matrix factorization with SGD from scratch):

python

1import numpy as np
2
3np.random.seed(42)
4# User-item rating matrix (0 = unrated)
5R = np.array([
6    [5, 3, 0, 1],
7    [4, 0, 4, 1],
8    [1, 1, 0, 5],
9    [0, 1, 5, 4],
10])
11
12n_users, n_items = R.shape
13n_factors = 2
14lr = 0.01
15lambda_reg = 0.1
16
17P = np.random.rand(n_users, n_factors)  # user embeddings
18Q = np.random.rand(n_items, n_factors)  # item embeddings
19
20for epoch in range(5000):
21    for u in range(n_users):
22        for i in range(n_items):
23            if R[u, i] > 0:
24                e = R[u, i] - P[u] @ Q[i]
25                P[u] += lr * (e * Q[i] - lambda_reg * P[u])
26                Q[i] += lr * (e * P[u] - lambda_reg * Q[i])
27
28R_pred = P @ Q.T
29print("Reconstructed matrix (rounded):")
30print(R_pred.round(1))
31

Matrix factorization is the foundation of collaborative filtering (used by Netflix, Spotify, Amazon). Modern implementations use Alternating Least Squares (ALS) or neural collaborative filtering for scale.

23. What is conformal prediction and how does it provide uncertainty estimates?

Conformal prediction is a framework for producing statistically valid prediction intervals or sets with guaranteed coverage. Unlike standard ML models that produce point predictions, conformal predictors output a prediction set and guarantee that the true label is included with a user-specified probability (e.g., 95%).

Example (prediction intervals with MAPIE):

python

1from mapie.regression import MapieRegressor
2from sklearn.ensemble import GradientBoostingRegressor
3from sklearn.datasets import load_diabetes
4from sklearn.model_selection import train_test_split
5import numpy as np
6
7X, y = load_diabetes(return_X_y=True)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10base_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
11mapie = MapieRegressor(base_model, method='plus', cv=5)
12mapie.fit(X_train, y_train)
13
14# 90% prediction intervals
15y_pred, y_intervals = mapie.predict(X_test, alpha=0.10)
16
17coverage = np.mean((y_test >= y_intervals[:, 0, 0]) & (y_test <= y_intervals[:, 1, 0]))
18print(f"Coverage (target=0.90): {coverage:.3f}")
19print(f"Average interval width: {np.mean(y_intervals[:, 1, 0] - y_intervals[:, 0, 0]):.1f}")
20

Conformal prediction is model-agnostic and distribution-free, providing rigorous uncertainty quantification without assumptions about the data generating process - ideal for safety-critical ML applications.

24. What advanced techniques exist for extreme class imbalance beyond SMOTE?

When class imbalance is extreme (e.g., 1 positive per 10,000 negatives), standard approaches like SMOTE are insufficient. Advanced techniques are needed to handle such scenarios in fraud detection, rare disease diagnosis, and anomaly detection.

Example (ADASYN + threshold optimization):

python

1from imblearn.over_sampling import ADASYN
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import make_classification
4from sklearn.model_selection import train_test_split
5from sklearn.metrics import f1_score, precision_recall_curve
6import numpy as np
7
8X, y = make_classification(n_samples=5000, weights=[0.99, 0.01], random_state=42)
9X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
10
11# ADASYN: adaptive synthetic sampling (focuses on harder minority examples)
12adasyn = ADASYN(random_state=42)
13X_res, y_res = adasyn.fit_resample(X_train, y_train)
14
15model = RandomForestClassifier(n_estimators=100, random_state=42)
16model.fit(X_res, y_res)
17y_proba = model.predict_proba(X_test)[:, 1]
18
19# Optimize decision threshold for best F1
20precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
21f1_scores = 2 * precision * recall / (precision + recall + 1e-9)
22best_thresh = thresholds[np.argmax(f1_scores)]
23y_pred_opt = (y_proba >= best_thresh).astype(int)
24
25print(f"Default threshold F1: {f1_score(y_test, model.predict(X_test)):.4f}")
26print(f"Optimal threshold ({best_thresh:.3f}) F1: {f1_score(y_test, y_pred_opt):.4f}")
27

For extreme imbalance, combine ADASYN or cost-sensitive learning with threshold optimization, use anomaly detection approaches (Isolation Forest), and always evaluate with precision-recall AUC rather than accuracy or standard ROC-AUC.

25. How do you monitor and maintain ML models in production?

Model monitoring is the practice of continuously tracking a deployed ML model's health, performance, and data quality after it has been released to production. Without monitoring, model degradation goes unnoticed and can cause significant business damage.

Model performance metrics: Track accuracy, F1, or custom business KPIs on labeled production data (when labels are available).
Data drift detection: Monitor input feature distributions using KS test, PSI, or Wasserstein distance.
Prediction drift: Track shifts in the distribution of model output scores or class probabilities.
Infrastructure metrics: Latency, throughput, memory usage, and error rates.

Example (lightweight drift monitor):

python

1import numpy as np
2from scipy.stats import ks_2samp
3import warnings
4
5class SimpleDriftMonitor:
6    def __init__(self, reference_data: np.ndarray, threshold: float = 0.05):
7        self.reference = reference_data
8        self.threshold = threshold
9
10    def check(self, new_data: np.ndarray, feature_names=None) -> dict:
11        alerts = {}
12        for i in range(new_data.shape[1]):
13            stat, p_value = ks_2samp(self.reference[:, i], new_data[:, i])
14            fname = feature_names[i] if feature_names else f"feature_{i}"
15            if p_value < self.threshold:
16                alerts[fname] = {"ks_stat": round(stat, 4), "p_value": round(p_value, 4)}
17        return alerts
18
19# Simulate reference vs. production data
20np.random.seed(42)
21reference   = np.random.randn(1000, 5)
22no_drift    = np.random.randn(200, 5)
23with_drift  = np.random.randn(200, 5)
24with_drift[:, 2] += 3  # feature 2 has shifted
25
26monitor = SimpleDriftMonitor(reference)
27print("No drift alerts:  ", monitor.check(no_drift))
28print("With drift alerts:", monitor.check(with_drift))
29

Set up automated alerts that trigger retraining pipelines when drift is detected. Tools like Evidently AI, WhyLogs, Arize, and NannyML provide production-grade ML monitoring with dashboards and alerting out of the box.

Full Stack

Next Js Interview Questions

Explore the most important Next.js interview questions including SSR, SSG, ISR, routing, performance optimization, and real-world implementation examples.

Backend

Python Interview Questions

Master your Python interviews with frequently asked questions covering basics, OOP, data structures, Django, and real coding scenarios for freshers and experienced professionals.

I. Beginner Level

1. What is Machine Learning?

2. What are the main types of Machine Learning?

3. What is supervised learning? Give an example.

4. What is unsupervised learning? Give an example.

5. What is a label in machine learning?

6. What is a feature in machine learning?

7. What is a train-test split and why is it important?

8. What is overfitting and how do you prevent it?

9. What is underfitting in machine learning?

10. What is linear regression?

11. What is logistic regression and when is it used?

12. What is a decision tree in machine learning?

13. What is accuracy in machine learning and when is it misleading?

14. What is a loss function in machine learning?

15. What is gradient descent?

16. What is normalization and why is it needed in ML?

17. What is a hyperparameter in machine learning?

18. What is the K-Nearest Neighbors (KNN) algorithm?

19. What is the Naive Bayes classifier?

20. What is the difference between classification and regression?

II. Intermediate Level

1. What is the bias-variance tradeoff?

2. What is cross-validation and how does k-fold work?

3. Explain precision, recall, and F1-score with examples.

4. What is a confusion matrix? How do you compute it?

5. How does Random Forest work and what makes it powerful?

6. What is gradient boosting and how does it differ from bagging?

7. How does a Support Vector Machine (SVM) work?

8. How does K-Means clustering work? Write the algorithm.

9. How does PCA reduce dimensionality?

10. What is regularization? Explain L1 (Lasso) and L2 (Ridge) with code.

11. How do you handle missing data in a machine learning dataset?

12. What is feature selection and what techniques are used?

13. How do you handle an imbalanced dataset?

14. What is the ROC curve and AUC score?

15. What are ensemble methods in machine learning?

16. What is one-hot encoding and when do you use it?

17. What is feature engineering and give practical examples.

18. What is a learning rate and how does it affect training?

19. What is early stopping in model training?

20. How do you perform hyperparameter tuning using GridSearchCV?

III. Advanced

1. What is XGBoost and how does it differ from LightGBM?

2. What is stacking in ensemble learning? How is it implemented?

3. How do you create a custom loss function in scikit-learn or XGBoost?

4. What are SHAP values and how do they explain model predictions?

5. How do you approach a time series forecasting problem in ML?

6. What is a Pipeline in scikit-learn and why is it useful?

7. What is target encoding and when do you prefer it over one-hot encoding?

8. What is the curse of dimensionality and how do you address it?

9. What are Gaussian Mixture Models (GMM) and how do they differ from K-Means?

10. What is DBSCAN clustering and what are its advantages?

11. What is isotonic regression and when is it used?

12. What is Bayesian optimization for hyperparameter tuning?

13. What is classifier calibration and how do you calibrate probabilities?

14. What is concept drift and how do you detect and handle it?

15. What is multi-label classification and how is it implemented?

16. What are common techniques for anomaly detection in ML?

17. What is semi-supervised learning and when is it used?

18. What is federated learning and how does it preserve privacy?

19. What are counterfactual explanations in interpretable ML?

20. How do you design a robust ML pipeline for production?

21. What is online learning (incremental learning) in machine learning?

22. What is matrix factorization and how is it used in recommendation systems?

23. What is conformal prediction and how does it provide uncertainty estimates?

24. What advanced techniques exist for extreme class imbalance beyond SMOTE?

25. How do you monitor and maintain ML models in production?

Related Articles

Next Js Interview Questions

Python Interview Questions