Machine Learning Interview Questions
Master your Machine Learning interviews with comprehensive questions and code examples covering fundamentals, algorithms, model evaluation, and advanced production techniques for freshers and experienced professionals.
I. Beginner Level
Machine Learning (ML) is a subset of Artificial Intelligence that enables computer systems to learn from data and improve their performance on tasks without being explicitly programmed. Instead of writing rules manually, ML algorithms discover patterns and build models from training data.
Example (training a simple model with scikit-learn):
1from sklearn.linear_model import LinearRegression
2import numpy as np
3
4# Training data: house size (sqft) -> price
5X = np.array([[500], [800], [1200], [1500], [2000]])
6y = np.array([150000, 200000, 280000, 320000, 400000])
7
8model = LinearRegression()
9model.fit(X, y)
10
11# Predict price for a 1000 sqft house
12print(model.predict([[1000]])) # Output: [~245000]
13Machine Learning powers applications like spam detection, recommendation engines, fraud detection, image recognition, and many more real-world systems.
2. What are the main types of Machine Learning?
Machine Learning is broadly categorized into three main types based on how models learn from data.
Supervised Learning: The model is trained on labeled data (input-output pairs). Examples: linear regression, decision trees, SVM.
Unsupervised Learning: The model finds hidden patterns in unlabeled data. Examples: K-Means clustering, PCA, autoencoders.
Reinforcement Learning: An agent learns to make decisions by receiving rewards or penalties from the environment. Examples: game-playing AI, robotics.
A fourth category, semi-supervised learning, combines labeled and unlabeled data and is growing in importance for real-world datasets where labeling is expensive.
3. What is supervised learning? Give an example.
Supervised learning is a type of ML where the model is trained on a dataset that contains both input features and the correct output labels. The model learns to map inputs to outputs by minimizing prediction errors on the training data.
Example (email spam classification):
1from sklearn.naive_bayes import MultinomialNB
2from sklearn.feature_extraction.text import CountVectorizer
3
4# Training data: emails and labels (1=spam, 0=not spam)
5emails = ["Win a free iPhone now", "Meeting at 3pm", "Claim your prize", "Project update"]
6labels = [1, 0, 1, 0]
7
8vectorizer = CountVectorizer()
9X = vectorizer.fit_transform(emails)
10
11model = MultinomialNB()
12model.fit(X, labels)
13
14# Predict
15new_email = vectorizer.transform(["Free prize waiting for you"])
16print(model.predict(new_email)) # Output: [1] (spam)
17Supervised learning is the most widely used ML paradigm and forms the foundation of most production ML systems.
4. What is unsupervised learning? Give an example.
Unsupervised learning is a type of ML where the model is trained on data that has no labels. The algorithm must discover the underlying structure, patterns, or groupings in the data on its own without any guidance.
Example (customer segmentation with K-Means):
1from sklearn.cluster import KMeans
2import numpy as np
3
4# Customer data: [age, spending_score]
5X = np.array([[25, 80], [30, 60], [45, 20], [50, 10], [22, 90], [35, 50]])
6
7kmeans = KMeans(n_clusters=2, random_state=42)
8kmeans.fit(X)
9
10print(kmeans.labels_) # e.g., [1, 1, 0, 0, 1, 1]
11# Cluster 1: young high-spenders, Cluster 0: older low-spenders
12Unsupervised learning is used when labeled data is expensive or unavailable, and is common in exploratory data analysis and dimensionality reduction.
5. What is a label in machine learning?
A label (also called the target variable or output variable) is the value the model is trained to predict. In supervised learning, each training example consists of input features and a corresponding label that represents the correct answer.
Example (identifying features vs. labels):
1import pandas as pd
2
3data = pd.DataFrame({
4 "age": [25, 30, 45], # feature
5 "salary": [50000, 70000, 90000], # feature
6 "purchased": [0, 1, 1] # label (target)
7})
8
9X = data[["age", "salary"]] # features
10y = data["purchased"] # label
11print(X)
12print(y)
13Labels can be continuous values (for regression) or discrete categories (for classification). High-quality labels are critical — noisy or incorrect labels directly degrade model performance.
6. What is a feature in machine learning?
A feature is an individual measurable input variable used to make predictions. Features are the columns of your dataset that the model uses to learn patterns. The quality and relevance of features directly determine model performance — this is why feature engineering is so important.
Example (feature types):
1import pandas as pd
2
3df = pd.DataFrame({
4 "square_footage": [1200, 1500, 900], # numerical feature
5 "location": ["urban", "suburban", "rural"], # categorical feature
6 "has_garage": [True, False, True], # binary feature
7 "price": [300000, 250000, 180000] # label (target)
8})
9
10print(df.dtypes)
11print(df.head())
12Features can be numerical, categorical, binary, text, or image-based. Choosing the right features and transforming them appropriately is called feature engineering.
7. What is a train-test split and why is it important?
A train-test split is the practice of dividing a dataset into two parts: a training set used to train the model, and a test set used to evaluate how well the model generalizes to unseen data. Without this split, you cannot reliably measure real-world performance.
Example:
1from sklearn.model_selection import train_test_split
2from sklearn.datasets import load_iris
3
4X, y = load_iris(return_X_y=True)
5
6# 80% training, 20% testing
7X_train, X_test, y_train, y_test = train_test_split(
8 X, y, test_size=0.2, random_state=42
9)
10
11print(f"Train size: {len(X_train)}") # 120
12print(f"Test size: {len(X_test)}") # 30
13A typical split is 80/20 or 70/30 (train/test). For smaller datasets, k-fold cross-validation is preferred over a simple split.
8. What is overfitting and how do you prevent it?
Overfitting occurs when a model learns the training data too well — including its noise and outliers — resulting in poor generalization to new, unseen data. An overfit model has very high training accuracy but significantly lower test accuracy.
Example (detecting overfitting by comparing train vs. test accuracy):
1from sklearn.tree import DecisionTreeClassifier
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4
5X, y = load_iris(return_X_y=True)
6X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
7
8# Overfit model (no depth limit)
9model = DecisionTreeClassifier() # unlimited depth
10model.fit(X_train, y_train)
11print(f"Train acc: {model.score(X_train, y_train):.2f}") # 1.00 (memorized)
12print(f"Test acc: {model.score(X_test, y_test):.2f}") # ~0.93 (drop)
13
14# Fix: limit depth (regularization)
15model_reg = DecisionTreeClassifier(max_depth=3)
16model_reg.fit(X_train, y_train)
17print(f"Regularized test acc: {model_reg.score(X_test, y_test):.2f}") # better generalization
18Use more training data to expose the model to more diverse examples.
Apply regularization (L1/L2, max_depth, min_samples_leaf).
Use cross-validation to detect the gap between training and validation performance.
The train-test accuracy gap is the primary signal of overfitting. Always monitor both metrics during model development.
9. What is underfitting in machine learning?
Underfitting happens when a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test sets. An underfit model has high bias and low variance.
Example (linear model underfitting non-linear data):
1import numpy as np
2from sklearn.linear_model import LinearRegression
3from sklearn.preprocessing import PolynomialFeatures
4from sklearn.metrics import mean_squared_error
5
6# Non-linear data: y = x^2 + noise
7X = np.linspace(-3, 3, 100).reshape(-1, 1)
8y = X.ravel() ** 2 + np.random.randn(100) * 0.5
9
10# Underfit: linear model on non-linear data
11linear = LinearRegression().fit(X, y)
12print(f"Linear MSE: {mean_squared_error(y, linear.predict(X)):.2f}") # high error
13
14# Fix: polynomial features
15poly = PolynomialFeatures(degree=2)
16X_poly = poly.fit_transform(X)
17better = LinearRegression().fit(X_poly, y)
18print(f"Poly MSE: {mean_squared_error(y, better.predict(X_poly)):.2f}") # low error
19To fix underfitting: use a more complex model, add more features, increase polynomial degree, or reduce regularization strength.
10. What is linear regression?
Linear regression is a supervised learning algorithm that models the relationship between one or more input features and a continuous output variable by fitting a straight line (or hyperplane) to the data. It minimizes the sum of squared errors between predictions and actual values.
Example:
1from sklearn.linear_model import LinearRegression
2from sklearn.metrics import mean_squared_error, r2_score
3import numpy as np
4
5# Years of experience -> Salary
6X = np.array([[1], [2], [3], [5], [7], [10]])
7y = np.array([40000, 50000, 60000, 80000, 100000, 130000])
8
9model = LinearRegression()
10model.fit(X, y)
11
12print(f"Coefficient: {model.coef_[0]:.2f}") # slope
13print(f"Intercept: {model.intercept_:.2f}") # bias
14print(f"R2 Score: {r2_score(y, model.predict(X)):.2f}")
15print(f"Predict 8yrs: {model.predict([[8]])[0]:.0f}")
16Linear regression assumes a linear relationship between inputs and output, that residuals are normally distributed, and that features are not highly correlated (no multicollinearity).
11. What is logistic regression and when is it used?
Logistic regression is a supervised learning algorithm used for classification problems. Despite its name, it is a classifier, not a regressor. It models the probability that an input belongs to a given class using the sigmoid function, which maps any real number to a value between 0 and 1.
Example (binary classification — pass/fail):
1from sklearn.linear_model import LogisticRegression
2import numpy as np
3
4# Hours studied -> Pass (1) or Fail (0)
5X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
6y = np.array([0, 0, 0, 0, 1, 1, 1, 1])
7
8model = LogisticRegression()
9model.fit(X, y)
10
11print(model.predict([[3.5]])) # likely 0 (fail)
12print(model.predict_proba([[5.5]])) # probability of each class
13Logistic regression is widely used for binary classification tasks such as spam detection, disease diagnosis, and credit risk assessment.
12. What is a decision tree in machine learning?
A decision tree is a supervised learning algorithm that makes predictions by learning a hierarchical sequence of if-else decision rules from the data. It splits the data at each node based on the feature that provides the best information gain or lowest Gini impurity.
Example:
1from sklearn.tree import DecisionTreeClassifier, export_text
2from sklearn.datasets import load_iris
3
4X, y = load_iris(return_X_y=True)
5
6model = DecisionTreeClassifier(max_depth=3, random_state=42)
7model.fit(X, y)
8
9# Print the learned tree rules
10print(export_text(model, feature_names=load_iris().feature_names))
11print(f"Accuracy: {model.score(X, y):.2f}")
12Decision trees are highly interpretable but prone to overfitting. Limiting max_depth or using ensemble methods like Random Forest significantly improves robustness.
13. What is accuracy in machine learning and when is it misleading?
Accuracy is the ratio of correctly predicted samples to the total number of samples. It is one of the most commonly reported classification metrics, but it can be highly misleading when the dataset is imbalanced.
Example (imbalanced dataset — accuracy is misleading):
1from sklearn.metrics import accuracy_score, classification_report
2import numpy as np
3
4# 950 negatives, 50 positives (imbalanced)
5y_true = np.array([0] * 950 + [1] * 50)
6
7# Dumb model: always predict 0
8y_pred = np.zeros(1000, dtype=int)
9
10print(f"Accuracy: {accuracy_score(y_true, y_pred):.2f}") # 0.95 — looks great!
11print(classification_report(y_true, y_pred)) # recall for class 1 = 0.00
12On imbalanced datasets, always prefer precision, recall, F1-score, or AUC-ROC over accuracy as the primary evaluation metric.
14. What is a loss function in machine learning?
A loss function (also called a cost function or objective function) measures how far the model's predictions are from the actual values. During training, the algorithm minimizes the loss function to improve the model's accuracy.
Example (common loss functions implemented in Python):
1import numpy as np
2
3y_true = np.array([3.0, 5.0, 2.5, 7.0])
4y_pred = np.array([2.5, 5.0, 4.0, 6.0])
5
6# Mean Squared Error (regression)
7mse = np.mean((y_true - y_pred) ** 2)
8print(f"MSE: {mse:.4f}")
9
10# Mean Absolute Error (regression)
11mae = np.mean(np.abs(y_true - y_pred))
12print(f"MAE: {mae:.4f}")
13
14# Binary Cross-Entropy (classification)
15y_true_cls = np.array([1, 0, 1, 1])
16y_pred_proba = np.array([0.9, 0.1, 0.8, 0.7])
17bce = -np.mean(y_true_cls * np.log(y_pred_proba) + (1 - y_true_cls) * np.log(1 - y_pred_proba))
18print(f"BCE: {bce:.4f}")
19MSE is used for regression, binary cross-entropy for binary classification, and categorical cross-entropy for multi-class classification. Choosing the right loss function is critical for model training.
15. What is gradient descent?
Gradient descent is an optimization algorithm used to minimize the loss function by iteratively updating model parameters in the direction opposite to the gradient (slope) of the loss. The size of each step is controlled by the learning rate.
Example (gradient descent from scratch):
1import numpy as np
2
3# Simple linear regression via gradient descent
4np.random.seed(42)
5X = np.random.randn(100)
6y = 3 * X + 2 + np.random.randn(100) * 0.5 # true: w=3, b=2
7
8w, b = 0.0, 0.0
9lr = 0.01
10
11for epoch in range(1000):
12 y_pred = w * X + b
13 loss = np.mean((y_pred - y) ** 2)
14 dw = np.mean(2 * (y_pred - y) * X)
15 db = np.mean(2 * (y_pred - y))
16 w -= lr * dw
17 b -= lr * db
18
19print(f"Learned w={w:.3f}, b={b:.3f}") # close to w=3, b=2
20Variants include Stochastic Gradient Descent (one sample at a time), Mini-Batch SGD (small batches), and adaptive optimizers like Adam and RMSProp which adjust the learning rate per parameter.
16. What is normalization and why is it needed in ML?
Normalization (also called feature scaling) is the process of transforming feature values to a common scale so that no single feature dominates others during model training. Many ML algorithms are sensitive to the scale of input features.
Example (Min-Max scaling and Standardization):
1from sklearn.preprocessing import MinMaxScaler, StandardScaler
2import numpy as np
3
4X = np.array([[1000, 1], [2000, 2], [3000, 3], [4000, 4]])
5
6# Min-Max Normalization: scales to [0, 1]
7minmax = MinMaxScaler()
8print("MinMax:\n", minmax.fit_transform(X))
9
10# Standardization: mean=0, std=1
11standard = StandardScaler()
12print("Standardized:\n", standard.fit_transform(X))
13Use StandardScaler for algorithms that assume Gaussian distributions (SVM, PCA, logistic regression). Use MinMaxScaler when you need values in a fixed range (e.g., for neural networks). Always fit the scaler on training data only and transform both train and test sets.
17. What is a hyperparameter in machine learning?
A hyperparameter is a configuration setting for a machine learning algorithm that is set before training begins and is not learned from the data. Unlike model parameters (like weights), hyperparameters control the structure and training process of the model.
Example (common hyperparameters):
1from sklearn.ensemble import RandomForestClassifier
2
3# All these are hyperparameters (set before training)
4model = RandomForestClassifier(
5 n_estimators=100, # number of trees
6 max_depth=5, # maximum depth of each tree
7 min_samples_split=4, # min samples to split a node
8 max_features='sqrt', # features to consider per split
9 random_state=42
10)
11# model.coef_ and model.feature_importances_ are LEARNED parameters (after fit)
12Finding the best hyperparameters is called hyperparameter tuning and is done using techniques like GridSearchCV, RandomizedSearchCV, or Bayesian optimization.
18. What is the K-Nearest Neighbors (KNN) algorithm?
K-Nearest Neighbors (KNN) is a simple, non-parametric algorithm that classifies a data point based on the majority class of its k nearest neighbors in the feature space. It uses a distance metric (usually Euclidean) to find neighbors. KNN has no explicit training phase — it memorizes the training data.
Example:
1from sklearn.neighbors import KNeighborsClassifier
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4
5X, y = load_iris(return_X_y=True)
6X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
7
8knn = KNeighborsClassifier(n_neighbors=5)
9knn.fit(X_train, y_train)
10print(f"KNN Accuracy: {knn.score(X_test, y_test):.2f}")
11KNN is simple and effective for small datasets but becomes slow for large datasets due to its O(n) prediction complexity. Always normalize features before using KNN.
19. What is the Naive Bayes classifier?
Naive Bayes is a probabilistic classifier based on Bayes' theorem with the 'naive' assumption that all features are conditionally independent given the class label. Despite this strong assumption, it works surprisingly well for text classification and spam filtering.
Example (Gaussian Naive Bayes):
1from sklearn.naive_bayes import GaussianNB
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4
5X, y = load_iris(return_X_y=True)
6X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
7
8gnb = GaussianNB()
9gnb.fit(X_train, y_train)
10
11print(f"Naive Bayes Accuracy: {gnb.score(X_test, y_test):.2f}")
12print(f"Class priors: {gnb.class_prior_}")
13Variants include GaussianNB for continuous features, MultinomialNB for text data, and BernoulliNB for binary features. It is fast and works well for high-dimensional data like text.
20. What is the difference between classification and regression?
Classification and regression are both supervised learning tasks, but they differ in the type of output they predict.
Classification: Predicts a discrete class label. Output is one of a finite set of categories. Examples: spam/not spam, cat/dog, digit 0-9.
Regression: Predicts a continuous numerical value. Output can be any real number. Examples: house price, temperature, stock value.
Example:
1from sklearn.linear_model import LogisticRegression, LinearRegression
2from sklearn.datasets import load_iris, load_diabetes
3
4# Classification example
5X_cls, y_cls = load_iris(return_X_y=True)
6clf = LogisticRegression(max_iter=200).fit(X_cls, y_cls)
7print("Classification predictions:", clf.predict(X_cls[:5])) # [0, 0, 0, 0, 0]
8
9# Regression example
10X_reg, y_reg = load_diabetes(return_X_y=True)
11reg = LinearRegression().fit(X_reg, y_reg)
12print("Regression predictions:", reg.predict(X_reg[:3]).round(1)) # [206.1, 68.1, 176.9]
13The type of label determines whether it is a classification or regression problem. Metrics also differ: use accuracy/F1 for classification and MSE/R2 for regression.
II. Intermediate Level
1. What is the bias-variance tradeoff?
The bias-variance tradeoff describes a fundamental tension in ML model design. Bias is error from incorrect assumptions; high bias causes underfitting. Variance is error from sensitivity to fluctuations in training data; high variance causes overfitting. Reducing one typically increases the other.
Example (visualizing bias-variance with learning curves):
1from sklearn.model_selection import learning_curve
2from sklearn.tree import DecisionTreeClassifier
3from sklearn.datasets import load_iris
4import numpy as np
5
6X, y = load_iris(return_X_y=True)
7
8for depth in [1, 5, None]: # high bias, balanced, high variance
9 model = DecisionTreeClassifier(max_depth=depth, random_state=42)
10 train_sizes, train_scores, val_scores = learning_curve(model, X, y, cv=5)
11 print(f"Depth={depth} | Train={train_scores[:,-1].mean():.2f} | Val={val_scores[:,-1].mean():.2f}")
12# depth=1: low train, low val (high bias / underfit)
13# depth=None: high train, lower val (high variance / overfit)
14# depth=5: balanced
15The goal is to find the sweet spot — a model complex enough to learn the signal but constrained enough not to memorize the noise. Regularization, cross-validation, and ensemble methods help achieve this balance.
2. What is cross-validation and how does k-fold work?
Cross-validation is a resampling technique used to evaluate model performance more reliably than a single train-test split. K-fold cross-validation divides the data into k equal parts. The model trains on k-1 folds and validates on the remaining fold, repeating k times.
Example:
1from sklearn.model_selection import cross_val_score, StratifiedKFold
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import load_breast_cancer
4
5X, y = load_breast_cancer(return_X_y=True)
6model = RandomForestClassifier(n_estimators=100, random_state=42)
7
8# Stratified k-fold preserves class distribution in each fold
9skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
10scores = cross_val_score(model, X, y, cv=skf, scoring='f1')
11
12print(f"F1 per fold: {scores.round(3)}")
13print(f"Mean F1: {scores.mean():.3f} ± {scores.std():.3f}")
14Always use StratifiedKFold for classification to ensure class proportions are maintained in each fold. The mean and standard deviation of scores give a reliable estimate of model performance.
3. Explain precision, recall, and F1-score with examples.
Precision, recall, and F1-score are classification metrics especially important for imbalanced datasets. They provide a more complete picture than accuracy alone.
Example:
1from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
2import numpy as np
3
4y_true = np.array([1, 1, 0, 1, 0, 0, 1, 0, 1, 0])
5y_pred = np.array([1, 0, 0, 1, 0, 1, 1, 0, 0, 0])
6
7print(f"Precision: {precision_score(y_true, y_pred):.2f}") # TP / (TP + FP)
8print(f"Recall: {recall_score(y_true, y_pred):.2f}") # TP / (TP + FN)
9print(f"F1-Score: {f1_score(y_true, y_pred):.2f}") # harmonic mean of P and R
10print()
11print(classification_report(y_true, y_pred))
12High precision needed: Spam detection (don't flag legitimate emails as spam).
High recall needed: Cancer detection (don't miss a true positive).
F1 balances both and is ideal when you need a single metric for an imbalanced dataset.
The choice between precision and recall depends on the cost of false positives vs. false negatives in your specific problem domain.
4. What is a confusion matrix? How do you compute it?
A confusion matrix is a table that visualizes the performance of a classification model by showing counts of true positives, true negatives, false positives, and false negatives. It is the foundation for computing all other classification metrics.
Example:
1from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import load_breast_cancer
4from sklearn.model_selection import train_test_split
5
6X, y = load_breast_cancer(return_X_y=True)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8
9model = RandomForestClassifier(random_state=42).fit(X_train, y_train)
10y_pred = model.predict(X_test)
11
12cm = confusion_matrix(y_test, y_pred)
13print("Confusion Matrix:")
14print(cm)
15# [[TN, FP],
16# [FN, TP]]
17
18tn, fp, fn, tp = cm.ravel()
19print(f"TP={tp}, TN={tn}, FP={fp}, FN={fn}")
20From the confusion matrix you can derive accuracy, precision, recall, F1-score, specificity, and the false positive rate — giving a complete picture of classifier performance.
5. How does Random Forest work and what makes it powerful?
Random Forest is an ensemble learning method that builds multiple decision trees during training and aggregates their predictions. It introduces two key sources of randomness: bootstrap sampling (bagging) and random feature selection at each split. This diversity among trees reduces overfitting dramatically.
Example (with feature importance):
1from sklearn.ensemble import RandomForestClassifier
2from sklearn.datasets import load_iris
3import pandas as pd
4
5X, y = load_iris(return_X_y=True)
6feature_names = load_iris().feature_names
7
8model = RandomForestClassifier(n_estimators=100, random_state=42)
9model.fit(X, y)
10
11# Feature importance
12imp = pd.Series(model.feature_importances_, index=feature_names).sort_values(ascending=False)
13print("Feature Importances:")
14print(imp)
15print(f"\nOOB Score: {RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42).fit(X,y).oob_score_:.3f}")
16Random Forest provides built-in feature importance, handles missing values reasonably, and requires little preprocessing — making it one of the most versatile and reliable out-of-the-box ML algorithms.
6. What is gradient boosting and how does it differ from bagging?
Gradient boosting is an ensemble technique that builds models sequentially, where each new model corrects the residual errors of the previous one by fitting to the negative gradient of the loss function. Unlike bagging (Random Forest), boosting trains trees in sequence, not in parallel.
Bagging (Random Forest): Trains trees independently in parallel on random subsets. Reduces variance. Better when individual trees overfit.
Boosting (GBM, XGBoost): Trains trees sequentially, each correcting previous errors. Reduces bias. Usually achieves higher accuracy but is slower and more prone to overfitting on noisy data.
Example:
1from sklearn.ensemble import GradientBoostingClassifier
2from sklearn.datasets import load_breast_cancer
3from sklearn.model_selection import train_test_split
4
5X, y = load_breast_cancer(return_X_y=True)
6X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
7
8gbm = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
9gbm.fit(X_train, y_train)
10print(f"GBM Accuracy: {gbm.score(X_test, y_test):.3f}")
11Gradient boosting is the algorithm behind XGBoost, LightGBM, and CatBoost — the dominant algorithms in structured/tabular data competitions.
7. How does a Support Vector Machine (SVM) work?
SVM finds the optimal hyperplane that separates classes with the maximum margin. Data points closest to the hyperplane are called support vectors and directly define the decision boundary. The kernel trick maps data to higher dimensions to handle non-linearly separable problems.
Example (comparing linear vs. RBF kernel):
1from sklearn.svm import SVC
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4from sklearn.preprocessing import StandardScaler
5
6X, y = load_iris(return_X_y=True)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8
9scaler = StandardScaler()
10X_train_s = scaler.fit_transform(X_train)
11X_test_s = scaler.transform(X_test)
12
13for kernel in ['linear', 'rbf', 'poly']:
14 svm = SVC(kernel=kernel, C=1.0)
15 svm.fit(X_train_s, y_train)
16 print(f"Kernel={kernel:6s} | Accuracy={svm.score(X_test_s, y_test):.3f} | Support vectors={sum(svm.n_support_)}")
17SVM works best with normalized features. Use a linear kernel for high-dimensional text data and an RBF kernel for most other tasks. The C parameter controls the bias-variance tradeoff.
8. How does K-Means clustering work? Write the algorithm.
K-Means partitions data into k clusters by iteratively assigning each point to its nearest centroid and recomputing centroids as the mean of assigned points. It converges when assignments no longer change.
Example (K-Means from scratch + using sklearn with Elbow method):
1import numpy as np
2from sklearn.cluster import KMeans
3from sklearn.datasets import make_blobs
4
5X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)
6
7# Elbow method to find optimal k
8inertias = []
9for k in range(1, 10):
10 km = KMeans(n_clusters=k, random_state=42, n_init='auto')
11 km.fit(X)
12 inertias.append(km.inertia_)
13 print(f"k={k} | inertia={km.inertia_:.1f}")
14
15# Best model with k=4
16best = KMeans(n_clusters=4, random_state=42, n_init='auto').fit(X)
17print(f"\nCluster labels sample: {best.labels_[:10]}")
18print(f"Centroids:\n{best.cluster_centers_.round(2)}")
19K-Means assumes spherical clusters and is sensitive to outliers and initialization. Use the Elbow method or Silhouette score to choose k. KMeans++ initialization (default) improves convergence.
9. How does PCA reduce dimensionality?
PCA (Principal Component Analysis) is a linear dimensionality reduction technique that projects data onto a lower-dimensional space defined by the directions (principal components) of maximum variance. It uses eigendecomposition of the covariance matrix to find these directions.
Example (reducing Iris from 4D to 2D):
1from sklearn.decomposition import PCA
2from sklearn.preprocessing import StandardScaler
3from sklearn.datasets import load_iris
4import numpy as np
5
6X, y = load_iris(return_X_y=True)
7
8# Always standardize before PCA
9X_scaled = StandardScaler().fit_transform(X)
10
11pca = PCA(n_components=2)
12X_reduced = pca.fit_transform(X_scaled)
13
14print(f"Original shape: {X_scaled.shape}") # (150, 4)
15print(f"Reduced shape: {X_reduced.shape}") # (150, 2)
16print(f"Explained variance ratio: {pca.explained_variance_ratio_.round(3)}")
17print(f"Total variance retained: {pca.explained_variance_ratio_.sum():.2%}")
18PCA is unsupervised, so it maximizes variance regardless of class labels. For classification-aware reduction, Linear Discriminant Analysis (LDA) is preferred.
10. What is regularization? Explain L1 (Lasso) and L2 (Ridge) with code.
Regularization adds a penalty term to the loss function to discourage the model from learning overly large weights, which helps prevent overfitting. L1 adds the sum of absolute weights; L2 adds the sum of squared weights.
Example:
1from sklearn.linear_model import Ridge, Lasso, LinearRegression
2from sklearn.datasets import load_diabetes
3from sklearn.model_selection import train_test_split
4from sklearn.metrics import mean_squared_error
5
6X, y = load_diabetes(return_X_y=True)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8
9for name, model in [("Linear", LinearRegression()), ("Ridge (L2)", Ridge(alpha=1.0)), ("Lasso (L1)", Lasso(alpha=0.1))]:
10 model.fit(X_train, y_train)
11 mse = mean_squared_error(y_test, model.predict(X_test))
12 nonzero = (model.coef_ != 0).sum() if hasattr(model, 'coef_') else 'N/A'
13 print(f"{name:12s} | MSE={mse:.1f} | Non-zero coefs={nonzero}")
14# Lasso drives some coefficients to exactly 0 (feature selection)
15Use L1 when you suspect only a few features are relevant (sparse solution). Use L2 when all features contribute and you want to shrink their influence. ElasticNet combines both penalties.
11. How do you handle missing data in a machine learning dataset?
Missing data is one of the most common real-world data issues. Handling it correctly prevents biased models and errors during training. The right approach depends on the amount and type of missingness.
Example (imputation strategies):
1import pandas as pd
2import numpy as np
3from sklearn.impute import SimpleImputer, KNNImputer
4
5df = pd.DataFrame({
6 'age': [25, np.nan, 35, 45, np.nan],
7 'salary': [50000, 60000, np.nan, 80000, 90000],
8 'city': ['NY', 'LA', np.nan, 'NY', 'LA']
9})
10
11# Strategy 1: Mean imputation (numerical)
12num_imputer = SimpleImputer(strategy='mean')
13print("Mean imputed:\n", num_imputer.fit_transform(df[['age', 'salary']]))
14
15# Strategy 2: Mode imputation (categorical)
16cat_imputer = SimpleImputer(strategy='most_frequent')
17print("Mode imputed:", cat_imputer.fit_transform(df[['city']]).ravel())
18
19# Strategy 3: KNN imputation (uses neighbor values)
20knn_imp = KNNImputer(n_neighbors=2)
21print("KNN imputed:\n", knn_imp.fit_transform(df[['age', 'salary']]))
22Never impute on the test set using statistics from the test set — always fit imputers on training data only. For large amounts of missing data, consider adding a binary indicator column to flag which values were imputed.
12. What is feature selection and what techniques are used?
Feature selection is the process of choosing the most relevant subset of features for a model. It reduces overfitting, improves accuracy, and decreases training time by removing redundant and irrelevant features.
Example (filter, wrapper, and embedded methods):
1from sklearn.feature_selection import SelectKBest, f_classif, RFE, SelectFromModel
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import load_iris
4
5X, y = load_iris(return_X_y=True)
6names = load_iris().feature_names
7
8# Filter: Select top 2 features by ANOVA F-score
9filter_sel = SelectKBest(f_classif, k=2).fit(X, y)
10print("Filter selected:", [names[i] for i in filter_sel.get_support(indices=True)])
11
12# Wrapper: Recursive Feature Elimination
13rfe = RFE(RandomForestClassifier(n_estimators=10, random_state=42), n_features_to_select=2)
14rfe.fit(X, y)
15print("RFE selected:", [names[i] for i in rfe.get_support(indices=True)])
16
17# Embedded: L1-based selection (Random Forest importances)
18sfm = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))
19sfm.fit(X, y)
20print("Embedded selected:", [names[i] for i in sfm.get_support(indices=True)])
21Filter methods are fastest, wrapper methods are most thorough but expensive, and embedded methods (like tree feature importance or Lasso) are a good practical middle ground.
13. How do you handle an imbalanced dataset?
An imbalanced dataset has a significant disparity in the number of samples per class, causing models to be biased toward the majority class. Several techniques can address this issue.
Example (SMOTE oversampling and class_weight):
1from sklearn.datasets import make_classification
2from sklearn.model_selection import train_test_split
3from sklearn.ensemble import RandomForestClassifier
4from sklearn.metrics import classification_report
5from imblearn.over_sampling import SMOTE
6
7X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10# Approach 1: class_weight='balanced' (built-in to most sklearn estimators)
11model_balanced = RandomForestClassifier(class_weight='balanced', random_state=42)
12model_balanced.fit(X_train, y_train)
13print("class_weight='balanced':")
14print(classification_report(y_test, model_balanced.predict(X_test)))
15
16# Approach 2: SMOTE (Synthetic Minority Oversampling Technique)
17smote = SMOTE(random_state=42)
18X_res, y_res = smote.fit_resample(X_train, y_train)
19model_smote = RandomForestClassifier(random_state=42).fit(X_res, y_res)
20print("After SMOTE:")
21print(classification_report(y_test, model_smote.predict(X_test)))
22Always evaluate imbalanced models using F1, precision-recall AUC, or Matthews Correlation Coefficient rather than accuracy. SMOTE should only be applied to the training set, never to test data.
14. What is the ROC curve and AUC score?
The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate (recall) against the False Positive Rate at various classification thresholds. The AUC (Area Under the Curve) is a single number summarizing the model's ability to discriminate between classes — 1.0 is perfect, 0.5 is random.
Example:
1from sklearn.metrics import roc_auc_score, roc_curve
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import load_breast_cancer
4from sklearn.model_selection import train_test_split
5
6X, y = load_breast_cancer(return_X_y=True)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8
9model = RandomForestClassifier(random_state=42).fit(X_train, y_train)
10y_proba = model.predict_proba(X_test)[:, 1]
11
12auc = roc_auc_score(y_test, y_proba)
13print(f"AUC-ROC: {auc:.4f}")
14
15fpr, tpr, thresholds = roc_curve(y_test, y_proba)
16print(f"Threshold at max F1 is around: {thresholds[len(thresholds)//2]:.3f}")
17AUC-ROC is threshold-independent and works well for binary classification on imbalanced datasets. For multi-class problems, use the average of one-vs-rest AUC scores.
15. What are ensemble methods in machine learning?
Ensemble methods combine multiple base models to produce a stronger predictive model. The key principle is that diverse, slightly imperfect models can collectively outperform any individual model.
Bagging: Trains models independently on bootstrap samples and averages results (Random Forest). Reduces variance.
Boosting: Trains models sequentially, each correcting the previous errors (XGBoost, AdaBoost). Reduces bias.
Stacking: Uses predictions of base models as features for a meta-learner. Can combine very different model types.
Voting: Aggregates predictions from multiple classifiers by majority vote (hard) or averaged probabilities (soft).
Ensemble methods consistently win machine learning competitions and are widely used in industry for their robustness and accuracy.
16. What is one-hot encoding and when do you use it?
One-hot encoding converts a categorical variable into a set of binary columns — one per category — where only the column corresponding to the present category has a value of 1 and all others are 0. It prevents the model from assuming ordinal relationships between categories.
Example:
1import pandas as pd
2from sklearn.preprocessing import OneHotEncoder
3
4df = pd.DataFrame({'color': ['red', 'green', 'blue', 'green', 'red']})
5
6# pandas get_dummies
7print("pd.get_dummies:")
8print(pd.get_dummies(df, drop_first=True)) # drop_first avoids multicollinearity
9
10# sklearn OneHotEncoder
11enc = OneHotEncoder(sparse_output=False, drop='first')
12result = enc.fit_transform(df[['color']])
13print("\nSklearn OHE:")
14print(result)
15print("Categories:", enc.categories_)
16Use one-hot encoding for nominal categories with low cardinality. For high-cardinality categoricals (e.g., city with 10,000 values), prefer target encoding or embedding layers.
17. What is feature engineering and give practical examples.
Feature engineering is the process of using domain knowledge to create, transform, or combine raw features into more informative representations for a model. Good feature engineering often has a bigger impact on performance than choosing a more complex algorithm.
Example (datetime and interaction features):
1import pandas as pd
2import numpy as np
3
4df = pd.DataFrame({
5 'purchase_date': pd.to_datetime(['2024-01-15', '2024-07-20', '2024-12-01']),
6 'price': [100, 250, 80],
7 'quantity': [2, 1, 5]
8})
9
10# Datetime features
11df['month'] = df['purchase_date'].dt.month
12df['day_of_week'] = df['purchase_date'].dt.dayofweek
13df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
14
15# Interaction feature
16df['total_value'] = df['price'] * df['quantity']
17
18# Log transform for skewed features
19df['log_price'] = np.log1p(df['price'])
20
21print(df[['month', 'is_weekend', 'total_value', 'log_price']])
22Common feature engineering techniques include log transformations for skewed data, polynomial features, interaction terms, date decomposition, binning, and ratio features.
18. What is a learning rate and how does it affect training?
The learning rate is a hyperparameter that controls how large a step the optimizer takes when updating model weights during gradient descent. It is one of the most important hyperparameters to tune in any ML or deep learning model.
Example (effect of different learning rates):
1import numpy as np
2
3def gradient_descent(lr, n_steps=50):
4 w = 10.0 # start far from optimum (w=0)
5 losses = []
6 for _ in range(n_steps):
7 loss = w ** 2 # f(w) = w^2, minimum at w=0
8 grad = 2 * w # df/dw = 2w
9 w -= lr * grad
10 losses.append(loss)
11 return w, losses[-1]
12
13for lr in [0.001, 0.1, 0.5, 1.1]:
14 final_w, final_loss = gradient_descent(lr)
15 status = "diverged!" if final_loss > 1e6 else f"w={final_w:.4f}"
16 print(f"lr={lr} -> {status}, final_loss={final_loss:.4f}")
17Too small a learning rate causes slow convergence; too large causes divergence. Learning rate schedulers (cosine annealing, step decay) and adaptive optimizers (Adam) help manage this.
19. What is early stopping in model training?
Early stopping is a regularization technique that halts training when the model's performance on a validation set stops improving — preventing the model from continuing to overfit the training data. It is widely used in gradient boosting and neural network training.
Example (early stopping with XGBoost):
1import xgboost as xgb
2from sklearn.datasets import load_breast_cancer
3from sklearn.model_selection import train_test_split
4
5X, y = load_breast_cancer(return_X_y=True)
6X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
7
8model = xgb.XGBClassifier(
9 n_estimators=1000, # high upper bound
10 learning_rate=0.05,
11 early_stopping_rounds=20, # stop if no improvement for 20 rounds
12 eval_metric='logloss',
13 random_state=42
14)
15
16model.fit(
17 X_train, y_train,
18 eval_set=[(X_val, y_val)],
19 verbose=False
20)
21
22print(f"Best iteration: {model.best_iteration}")
23print(f"Validation accuracy: {model.score(X_val, y_val):.3f}")
24Early stopping is a free regularizer — it requires no additional hyperparameter tuning and often significantly reduces overfitting in boosting and neural network models.
20. How do you perform hyperparameter tuning using GridSearchCV?
GridSearchCV exhaustively searches over a specified hyperparameter grid, training and evaluating a model for every combination using cross-validation. It automatically selects the best combination based on a scoring metric.
Example:
1from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import load_iris
4
5X, y = load_iris(return_X_y=True)
6
7param_grid = {
8 'n_estimators': [50, 100, 200],
9 'max_depth': [None, 3, 5],
10 'min_samples_split': [2, 5]
11}
12
13# Grid Search (exhaustive)
14grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy', n_jobs=-1)
15grid.fit(X, y)
16print(f"Best params: {grid.best_params_}")
17print(f"Best CV accuracy: {grid.best_score_:.3f}")
18
19# Random Search (faster alternative for large grids)
20from scipy.stats import randint
21param_dist = {'n_estimators': randint(50, 300), 'max_depth': [None, 3, 5, 10]}
22rand = RandomizedSearchCV(RandomForestClassifier(random_state=42), param_dist, n_iter=20, cv=5, random_state=42)
23rand.fit(X, y)
24print(f"RandomizedSearch best: {rand.best_params_}")
25Use GridSearchCV for small grids and RandomizedSearchCV for large ones. For even greater efficiency, consider Bayesian optimization with Optuna or Hyperopt.
III. Advanced
1. What is XGBoost and how does it differ from LightGBM?
XGBoost and LightGBM are both high-performance gradient boosting frameworks but differ in how they build trees. XGBoost uses level-wise tree growth (splits all nodes at the same depth), while LightGBM uses leaf-wise growth (splits the leaf with the largest loss reduction), which is more accurate but can overfit on small datasets.
Example (comparison):
1import xgboost as xgb
2import lightgbm as lgb
3from sklearn.datasets import load_breast_cancer
4from sklearn.model_selection import train_test_split
5import time
6
7X, y = load_breast_cancer(return_X_y=True)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10for name, model in [
11 ("XGBoost", xgb.XGBClassifier(n_estimators=200, learning_rate=0.05, random_state=42, verbosity=0)),
12 ("LightGBM", lgb.LGBMClassifier(n_estimators=200, learning_rate=0.05, random_state=42, verbose=-1))
13]:
14 t0 = time.time()
15 model.fit(X_train, y_train)
16 t1 = time.time()
17 print(f"{name:8s} | Acc={model.score(X_test, y_test):.3f} | Time={t1-t0:.2f}s")
18# LightGBM is usually faster on larger datasets
19LightGBM is generally faster and more memory-efficient for large datasets. XGBoost has been around longer and has a larger community. CatBoost is the best choice when you have many categorical features.
2. What is stacking in ensemble learning? How is it implemented?
Stacking (stacked generalization) is an ensemble technique where the predictions of multiple base models (level-0) are used as features for a meta-learner (level-1). The meta-learner learns how to best combine the base model predictions. Out-of-fold predictions are used to avoid data leakage.
Example:
1from sklearn.ensemble import StackingClassifier, RandomForestClassifier, GradientBoostingClassifier
2from sklearn.linear_model import LogisticRegression
3from sklearn.svm import SVC
4from sklearn.datasets import load_breast_cancer
5from sklearn.model_selection import train_test_split
6
7X, y = load_breast_cancer(return_X_y=True)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10# Level-0: diverse base models
11base_models = [
12 ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
13 ('gbm', GradientBoostingClassifier(n_estimators=50, random_state=42)),
14 ('svm', SVC(probability=True, random_state=42))
15]
16
17# Level-1: meta-learner
18stack = StackingClassifier(
19 estimators=base_models,
20 final_estimator=LogisticRegression(),
21 cv=5, # out-of-fold predictions
22 stack_method='predict_proba'
23)
24stack.fit(X_train, y_train)
25print(f"Stacking Accuracy: {stack.score(X_test, y_test):.3f}")
26Stacking is powerful because diverse base models capture different patterns, and the meta-learner learns to trust each one differently. It consistently outperforms individual models in competitions.
3. How do you create a custom loss function in scikit-learn or XGBoost?
Custom loss functions allow you to optimize directly for the business metric that matters, rather than a standard surrogate like MSE or log loss. In XGBoost, you provide the gradient (first derivative) and hessian (second derivative) of your loss.
Example (custom asymmetric loss — penalize underestimation more than overestimation):
1import xgboost as xgb
2import numpy as np
3from sklearn.datasets import load_diabetes
4from sklearn.model_selection import train_test_split
5
6X, y = load_diabetes(return_X_y=True)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8
9def asymmetric_loss(y_pred, dtrain):
10 """Penalize underestimation 3x more than overestimation."""
11 y_true = dtrain.get_label()
12 residual = y_pred - y_true
13 alpha = 3.0 # weight for underestimation
14 grad = np.where(residual < 0, -2 * alpha * residual, -2 * residual)
15 hess = np.where(residual < 0, 2 * alpha, 2.0)
16 return grad, hess
17
18model = xgb.XGBRegressor(n_estimators=100, random_state=42)
19model.set_params(objective=asymmetric_loss)
20dtrain = xgb.DMatrix(X_train, label=y_train)
21model_custom = xgb.train({'eta': 0.1}, dtrain, num_boost_round=100, obj=asymmetric_loss)
22preds = model_custom.predict(xgb.DMatrix(X_test))
23print(f"Custom loss predictions sample: {preds[:5].round(1)}")
24Custom losses are powerful when the cost of different types of errors is asymmetric — for example, in finance where underestimating risk is much more costly than overestimating it.
4. What are SHAP values and how do they explain model predictions?
SHAP (SHapley Additive exPlanations) values are a game-theory-based method for explaining individual model predictions. Each feature receives a SHAP value representing its contribution (positive or negative) to shifting the prediction from the model's average output.
Example:
1import shap
2import xgboost as xgb
3from sklearn.datasets import load_breast_cancer
4from sklearn.model_selection import train_test_split
5
6X, y = load_breast_cancer(return_X_y=True)
7feature_names = load_breast_cancer().feature_names
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10model = xgb.XGBClassifier(random_state=42, verbosity=0).fit(X_train, y_train)
11
12# Compute SHAP values
13explainer = shap.TreeExplainer(model)
14shap_values = explainer.shap_values(X_test)
15
16# Print top 3 features for the first prediction
17import pandas as pd
18shap_df = pd.Series(shap_values[0], index=feature_names).abs().sort_values(ascending=False)
19print("Top SHAP contributions for sample 0:")
20print(shap_df.head(3))
21SHAP provides both global (feature importance across all samples) and local (per-sample) explanations, making it the gold standard for ML interpretability.
5. How do you approach a time series forecasting problem in ML?
Time series forecasting requires special treatment because observations are ordered in time and the standard i.i.d. assumption is violated. You must avoid future data leakage and respect temporal ordering in validation.
Example (lag features + TimeSeriesSplit):
1import pandas as pd
2import numpy as np
3from sklearn.model_selection import TimeSeriesSplit
4from sklearn.ensemble import GradientBoostingRegressor
5from sklearn.metrics import mean_absolute_error
6
7np.random.seed(42)
8dates = pd.date_range('2020-01-01', periods=365, freq='D')
9values = np.cumsum(np.random.randn(365)) + 100
10
11df = pd.DataFrame({'date': dates, 'value': values})
12
13# Create lag features (past observations as inputs)
14for lag in [1, 7, 14, 30]:
15 df[f'lag_{lag}'] = df['value'].shift(lag)
16df['rolling_mean_7'] = df['value'].shift(1).rolling(7).mean()
17df = df.dropna()
18
19X = df.drop(columns=['date', 'value'])
20y = df['value']
21
22# Time-aware cross-validation (no shuffling!)
23tscv = TimeSeriesSplit(n_splits=5)
24scores = []
25for train_idx, val_idx in tscv.split(X):
26 X_tr, X_val = X.iloc[train_idx], X.iloc[val_idx]
27 y_tr, y_val = y.iloc[train_idx], y.iloc[val_idx]
28 m = GradientBoostingRegressor(n_estimators=100, random_state=42).fit(X_tr, y_tr)
29 scores.append(mean_absolute_error(y_val, m.predict(X_val)))
30print(f"CV MAE: {np.mean(scores):.3f} ± {np.std(scores):.3f}")
31Key principles for time series ML: always use TimeSeriesSplit, create lag and rolling window features, and never include future information in your training features.
6. What is a Pipeline in scikit-learn and why is it useful?
A Pipeline chains multiple preprocessing and modeling steps into a single object. It ensures that transformations are applied consistently during training and inference, prevents data leakage in cross-validation, and makes the code cleaner and more reproducible.
Example (full pipeline with preprocessing and model):
1from sklearn.pipeline import Pipeline
2from sklearn.compose import ColumnTransformer
3from sklearn.preprocessing import StandardScaler, OneHotEncoder
4from sklearn.ensemble import RandomForestClassifier
5from sklearn.model_selection import cross_val_score
6import pandas as pd
7import numpy as np
8
9# Synthetic dataset with mixed types
10np.random.seed(42)
11df = pd.DataFrame({
12 'age': np.random.randint(20, 60, 200),
13 'salary': np.random.randint(30000, 150000, 200),
14 'city': np.random.choice(['NY', 'LA', 'Chicago'], 200),
15 'approved': np.random.randint(0, 2, 200)
16})
17
18X = df.drop('approved', axis=1)
19y = df['approved']
20
21numeric_features = ['age', 'salary']
22categorical_features = ['city']
23
24preprocessor = ColumnTransformer([
25 ('num', StandardScaler(), numeric_features),
26 ('cat', OneHotEncoder(drop='first'), categorical_features)
27])
28
29pipeline = Pipeline([
30 ('preprocessor', preprocessor),
31 ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
32])
33
34scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
35print(f"Pipeline CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
36Pipelines are essential for production ML systems. They prevent the most common source of data leakage — fitting preprocessors on the full dataset before cross-validation — and make model deployment simpler since the entire pipeline can be serialized as one object.
7. What is target encoding and when do you prefer it over one-hot encoding?
Target encoding replaces a categorical value with the mean of the target variable for that category. It handles high-cardinality categoricals (e.g., zip codes, user IDs) without exploding the feature space as one-hot encoding would.
Example (target encoding with smoothing to prevent overfitting):
1import pandas as pd
2import numpy as np
3
4np.random.seed(42)
5df = pd.DataFrame({
6 'city': np.random.choice(['NY', 'LA', 'Chicago', 'Houston', 'Phoenix'], 500),
7 'purchased': np.random.binomial(1, 0.4, 500)
8})
9
10# Compute target mean per category (with global mean smoothing)
11global_mean = df['purchased'].mean()
12smoothing = 10 # controls how much we trust global vs. local mean
13
14stats = df.groupby('city')['purchased'].agg(['mean', 'count'])
15stats['encoded'] = (
16 stats['mean'] * stats['count'] + global_mean * smoothing
17) / (stats['count'] + smoothing)
18
19df['city_encoded'] = df['city'].map(stats['encoded'])
20print(stats[['mean', 'count', 'encoded']].round(3))
21print("\nEncoded city sample:")
22print(df[['city', 'city_encoded']].drop_duplicates())
23Always apply smoothing to target encoding to prevent overfitting for categories with few samples. In cross-validation, encode using only the training fold to avoid leakage. The category_encoders library provides a robust implementation.
8. What is the curse of dimensionality and how do you address it?
The curse of dimensionality refers to the exponential increase in data required to maintain statistical significance as the number of features grows. In high-dimensional spaces, data becomes sparse, distances lose meaning, and models overfit more easily.
Example (how KNN breaks down in high dimensions):
1import numpy as np
2from sklearn.neighbors import KNeighborsClassifier
3from sklearn.datasets import make_classification
4from sklearn.model_selection import cross_val_score
5
6# KNN accuracy degrades as dimensions increase
7for n_features in [2, 10, 50, 200, 500]:
8 X, y = make_classification(
9 n_samples=500, n_features=n_features,
10 n_informative=2, n_redundant=0, random_state=42
11 )
12 knn = KNeighborsClassifier(n_neighbors=5)
13 score = cross_val_score(knn, X, y, cv=5).mean()
14 print(f"Dims={n_features:4d} | KNN accuracy={score:.3f}")
15# Accuracy drops significantly as irrelevant dimensions grow
16Address the curse of dimensionality with feature selection, PCA/dimensionality reduction, regularization, or collecting more data. Distance-based algorithms (KNN, SVM with RBF) suffer most.
9. What are Gaussian Mixture Models (GMM) and how do they differ from K-Means?
A Gaussian Mixture Model is a probabilistic clustering model that assumes data is generated from a mixture of several Gaussian distributions with unknown parameters. Unlike K-Means, GMM uses soft assignments — each point has a probability of belonging to each cluster — and can model elliptical cluster shapes.
Example:
1from sklearn.mixture import GaussianMixture
2from sklearn.datasets import make_blobs
3import numpy as np
4
5X, true_labels = make_blobs(n_samples=300, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)
6
7gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
8gmm.fit(X)
9
10labels = gmm.predict(X)
11probs = gmm.predict_proba(X) # soft assignments
12
13print(f"Cluster means:\n{gmm.means_.round(2)}")
14print(f"\nSample probabilities (first 3 points):\n{probs[:3].round(3)}")
15print(f"BIC (lower=better): {gmm.bic(X):.1f}")
16GMMs are more flexible than K-Means but slower. Use BIC or AIC to select the number of components. GMMs also serve as a density estimation tool for anomaly detection.
10. What is DBSCAN clustering and what are its advantages?
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups points that are closely packed together and marks points in low-density regions as outliers. It does not require specifying the number of clusters in advance and can find arbitrarily shaped clusters.
Example:
1from sklearn.cluster import DBSCAN
2from sklearn.datasets import make_moons
3from sklearn.preprocessing import StandardScaler
4import numpy as np
5
6# Non-spherical clusters: make_moons
7X, y = make_moons(n_samples=300, noise=0.05, random_state=42)
8X = StandardScaler().fit_transform(X)
9
10db = DBSCAN(eps=0.3, min_samples=5)
11labels = db.fit_predict(X)
12
13print(f"Clusters found: {len(set(labels)) - (1 if -1 in labels else 0)}")
14print(f"Noise points: {(labels == -1).sum()}")
15print(f"Unique labels: {sorted(set(labels))}") # -1 = noise
16DBSCAN excels at detecting arbitrarily shaped clusters and identifying outliers automatically. The two hyperparameters eps (neighborhood radius) and min_samples (minimum density) must be chosen carefully, often using a k-distance graph.
11. What is isotonic regression and when is it used?
Isotonic regression fits a non-decreasing (monotonic) step function to data. It is commonly used for probability calibration — converting raw model scores into well-calibrated probabilities — and for scenarios where you know the relationship between input and output must be monotonic.
Example (calibrating classifier probabilities):
1from sklearn.calibration import CalibratedClassifierCV, calibration_curve
2from sklearn.ensemble import GradientBoostingClassifier
3from sklearn.datasets import make_classification
4from sklearn.model_selection import train_test_split
5
6X, y = make_classification(n_samples=2000, random_state=42)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
8
9# Uncalibrated model
10bgbm = GradientBoostingClassifier(n_estimators=50, random_state=42).fit(X_train, y_train)
11
12# Calibrate using isotonic regression
13cal_iso = CalibratedClassifierCV(GradientBoostingClassifier(n_estimators=50, random_state=42),
14 method='isotonic', cv=5)
15cal_iso.fit(X_train, y_train)
16
17# Compare calibration
18raw_proba = bgbm.predict_proba(X_test)[:, 1]
19cal_proba = cal_iso.predict_proba(X_test)[:, 1]
20
21frac_pos_raw, mean_pred_raw = calibration_curve(y_test, raw_proba, n_bins=10)
22frac_pos_cal, mean_pred_cal = calibration_curve(y_test, cal_proba, n_bins=10)
23
24import numpy as np
25print(f"Raw calibration error: {np.mean(np.abs(frac_pos_raw - mean_pred_raw)):.4f}")
26print(f"Isotonic calibration error: {np.mean(np.abs(frac_pos_cal - mean_pred_cal)):.4f}")
27Isotonic regression is better than Platt scaling (sigmoid calibration) when you have enough data and need a flexible calibration. Use it when predicted probabilities are used for business decisions or downstream ranking.
12. What is Bayesian optimization for hyperparameter tuning?
Bayesian optimization builds a probabilistic surrogate model (usually a Gaussian Process or Tree-structured Parzen Estimator) of the objective function (e.g., validation accuracy) and uses it to intelligently select the next hyperparameter configuration to evaluate, focusing on promising regions.
Example (Optuna):
1import optuna
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import load_iris
4from sklearn.model_selection import cross_val_score
5
6X, y = load_iris(return_X_y=True)
7optuna.logging.set_verbosity(optuna.logging.WARNING)
8
9def objective(trial):
10 n_estimators = trial.suggest_int('n_estimators', 50, 300)
11 max_depth = trial.suggest_int('max_depth', 2, 10)
12 min_samples = trial.suggest_int('min_samples_split', 2, 10)
13 model = RandomForestClassifier(
14 n_estimators=n_estimators,
15 max_depth=max_depth,
16 min_samples_split=min_samples,
17 random_state=42
18 )
19 return cross_val_score(model, X, y, cv=5, scoring='accuracy').mean()
20
21study = optuna.create_study(direction='maximize')
22study.optimize(objective, n_trials=50)
23
24print(f"Best params: {study.best_params}")
25print(f"Best CV accuracy: {study.best_value:.3f}")
26Bayesian optimization requires far fewer evaluations than Grid or Random Search to find good hyperparameters, making it ideal when each model evaluation is expensive.
13. What is classifier calibration and how do you calibrate probabilities?
A well-calibrated classifier produces predicted probabilities that match the true frequencies of outcomes. For example, if a model predicts 80% probability for 100 samples, about 80 of them should actually be positive. Many models (like SVMs and Gradient Boosting) are not natively well-calibrated.
Example (Platt scaling vs. isotonic calibration):
1from sklearn.calibration import CalibratedClassifierCV
2from sklearn.svm import SVC
3from sklearn.datasets import make_classification
4from sklearn.model_selection import train_test_split
5from sklearn.metrics import brier_score_loss
6
7X, y = make_classification(n_samples=1000, random_state=42)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10# SVM doesn't output probabilities natively
11svm_sigmoid = CalibratedClassifierCV(SVC(), method='sigmoid', cv=5) # Platt scaling
12svm_isotonic = CalibratedClassifierCV(SVC(), method='isotonic', cv=5) # Isotonic
13
14for name, model in [("Sigmoid (Platt)", svm_sigmoid), ("Isotonic", svm_isotonic)]:
15 model.fit(X_train, y_train)
16 proba = model.predict_proba(X_test)[:, 1]
17 brier = brier_score_loss(y_test, proba) # lower is better
18 print(f"{name:20s} | Brier Score={brier:.4f}")
19Use Platt scaling for small datasets and isotonic regression for larger ones. Calibrated probabilities are essential in applications where the probability itself drives business decisions, like credit risk or medical screening.
14. What is concept drift and how do you detect and handle it?
Concept drift occurs when the statistical relationship between input features and the target variable changes over time, causing a model's performance to degrade in production. It is one of the most common reasons deployed ML models fail silently.
Example (simulating and detecting drift with population stability index):
1import numpy as np
2from scipy.stats import ks_2samp
3
4np.random.seed(42)
5
6# Simulate production feature distribution shifting over time
7train_feature = np.random.normal(loc=0, scale=1, size=1000)
8production_no_drift = np.random.normal(loc=0, scale=1, size=500)
9production_with_drift = np.random.normal(loc=2, scale=1.5, size=500) # shifted mean
10
11# Kolmogorov-Smirnov test to detect distribution shift
12stat_ok, p_ok = ks_2samp(train_feature, production_no_drift)
13stat_drift, p_drift = ks_2samp(train_feature, production_with_drift)
14
15print(f"No drift - KS stat={stat_ok:.3f}, p={p_ok:.3f} -> {'DRIFT' if p_ok < 0.05 else 'OK'}")
16print(f"With drift - KS stat={stat_drift:.3f}, p={p_drift:.3f} -> {'DRIFT' if p_drift < 0.05 else 'OK'}")
17Handle concept drift by monitoring feature distributions (KS test, PSI), tracking prediction score distributions, setting up automated retraining pipelines, and using online or sliding-window training approaches.
15. What is multi-label classification and how is it implemented?
Multi-label classification is a task where each sample can belong to multiple classes simultaneously. For example, a news article can be tagged as both 'sports' and 'politics'. This is different from multi-class classification where each sample belongs to exactly one class.
Example:
1from sklearn.multioutput import MultiOutputClassifier
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import make_multilabel_classification
4from sklearn.model_selection import train_test_split
5from sklearn.metrics import hamming_loss, f1_score
6
7X, y = make_multilabel_classification(n_samples=1000, n_features=20, n_classes=5, random_state=42)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10# OneVsRest approach: one binary classifier per label
11model = MultiOutputClassifier(RandomForestClassifier(n_estimators=100, random_state=42))
12model.fit(X_train, y_train)
13y_pred = model.predict(X_test)
14
15print(f"Hamming Loss: {hamming_loss(y_test, y_pred):.4f}") # fraction of wrong labels
16print(f"F1 (micro): {f1_score(y_test, y_pred, average='micro'):.3f}")
17print(f"F1 (samples): {f1_score(y_test, y_pred, average='samples'):.3f}")
18Use Hamming Loss and sample-averaged F1 for multi-label evaluation. Other approaches include classifier chains and label powerset, which capture label correlations that OneVsRest ignores.
16. What are common techniques for anomaly detection in ML?
Anomaly detection identifies data points that deviate significantly from normal behavior. It is widely used for fraud detection, network intrusion detection, and manufacturing quality control.
Example (Isolation Forest and Local Outlier Factor):
1from sklearn.ensemble import IsolationForest
2from sklearn.neighbors import LocalOutlierFactor
3from sklearn.datasets import make_blobs
4import numpy as np
5
6np.random.seed(42)
7X_normal, _ = make_blobs(n_samples=300, centers=1, cluster_std=0.5, random_state=42)
8X_outliers = np.random.uniform(low=-10, high=10, size=(20, 2))
9X = np.vstack([X_normal, X_outliers])
10
11# Isolation Forest: anomaly score based on isolation depth
12iforest = IsolationForest(contamination=0.06, random_state=42)
13labels_if = iforest.fit_predict(X) # -1 = anomaly, 1 = normal
14
15# Local Outlier Factor: density-based
16lof = LocalOutlierFactor(n_neighbors=20, contamination=0.06)
17labels_lof = lof.fit_predict(X)
18
19print(f"Isolation Forest anomalies: {(labels_if == -1).sum()}")
20print(f"LOF anomalies: {(labels_lof == -1).sum()}")
21Isolation Forest is fast and works well for high-dimensional data. LOF is better for datasets with clusters of different densities. One-class SVM and Autoencoders are alternatives for complex anomaly patterns.
17. What is semi-supervised learning and when is it used?
Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data during training. It is used when labeling data is expensive or time-consuming but unlabeled data is abundant — for example, medical image annotation or sentiment analysis.
Example (Label Spreading):
1from sklearn.semi_supervised import LabelSpreading
2from sklearn.datasets import load_iris
3from sklearn.model_selection import train_test_split
4from sklearn.metrics import accuracy_score
5import numpy as np
6
7X, y = load_iris(return_X_y=True)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
9
10# Simulate scarce labels: only 10% of training labels are known
11y_semi = y_train.copy()
12rng = np.random.RandomState(42)
13mask = rng.rand(len(y_train)) > 0.10 # 90% unlabeled
14y_semi[mask] = -1 # -1 means unlabeled in sklearn's semi-supervised API
15
16label_spread = LabelSpreading(kernel='rbf', alpha=0.2)
17label_spread.fit(X_train, y_semi)
18y_pred = label_spread.predict(X_test)
19
20print(f"Labeled samples used: {(y_semi != -1).sum()} / {len(y_semi)}")
21print(f"Label Spreading Accuracy: {accuracy_score(y_test, y_pred):.3f}")
22Semi-supervised learning is powerful in domains like NLP and computer vision, where pre-training on large unlabeled corpora followed by fine-tuning on labeled data (as in BERT) is the dominant paradigm.
18. What is federated learning and how does it preserve privacy?
Federated learning is a distributed ML approach where model training happens locally on each device or data silo, and only model updates (gradients), not raw data, are shared with a central server. This allows learning from sensitive data without centralizing it.
Each client trains locally and sends model updates to the server.
The server aggregates updates using FedAvg (weighted averaging of gradients).
Differential privacy adds noise to gradients before sharing to further protect individual records.
Example (simulating FedAvg locally):
1import numpy as np
2from sklearn.linear_model import SGDClassifier
3from sklearn.datasets import make_classification
4from sklearn.model_selection import train_test_split
5
6X, y = make_classification(n_samples=1000, random_state=42)
7
8# Split data across 3 simulated clients
9client_data = [(X[i::3], y[i::3]) for i in range(3)]
10
11def train_local(X_local, y_local, global_coef, global_intercept):
12 model = SGDClassifier(loss='log_loss', max_iter=5, random_state=42)
13 model.coef_ = global_coef.copy()
14 model.intercept_ = global_intercept.copy()
15 model.classes_ = np.array([0, 1])
16 model.partial_fit(X_local, y_local, classes=[0, 1])
17 return model.coef_, model.intercept_
18
19# Initialize global model
20global_model = SGDClassifier(loss='log_loss', random_state=42)
21global_model.fit(X, y) # just to set coef_ shape
22
23# One round of FedAvg
24client_coefs, client_intercepts = [], []
25for Xc, yc in client_data:
26 c, i = train_local(Xc, yc, global_model.coef_, global_model.intercept_)
27 client_coefs.append(c)
28 client_intercepts.append(i)
29
30global_model.coef_ = np.mean(client_coefs, axis=0)
31global_model.intercept_ = np.mean(client_intercepts, axis=0)
32print(f"FedAvg round accuracy: {global_model.score(X, y):.3f}")
33Federated learning is used by Google (Gboard keyboard prediction) and healthcare institutions to train models on sensitive patient data without compromising privacy regulations like GDPR and HIPAA.
19. What are counterfactual explanations in interpretable ML?
Counterfactual explanations answer the question: 'What is the minimum change to the input features that would flip the model's prediction?' They are highly actionable because they tell users what to change, not just why a decision was made.
Example (manual counterfactual search for a loan denial):
1from sklearn.ensemble import GradientBoostingClassifier
2from sklearn.datasets import make_classification
3from sklearn.model_selection import train_test_split
4import numpy as np
5
6X, y = make_classification(n_samples=500, n_features=4, random_state=42)
7X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
8model = GradientBoostingClassifier(random_state=42).fit(X_train, y_train)
9
10# Applicant denied (prediction=0), find minimal change to get approved (prediction=1)
11instance = X_test[0].copy()
12print(f"Original prediction: {model.predict([instance])[0]}")
13
14# Simple greedy counterfactual: perturb feature 0 until flip
15for delta in np.arange(0, 5, 0.1):
16 modified = instance.copy()
17 modified[0] += delta
18 if model.predict([modified])[0] == 1:
19 print(f"Counterfactual found: increase feature 0 by {delta:.2f}")
20 print(f"New prediction: {model.predict([modified])[0]}")
21 break
22Counterfactual explanations are essential for algorithmic fairness, regulatory compliance (GDPR Article 22 right to explanation), and user-facing explanations in credit, healthcare, and insurance decisions.
20. How do you design a robust ML pipeline for production?
A production ML pipeline must be reproducible, scalable, monitored, and easily retrainable. It goes far beyond model training and includes data validation, feature engineering, model serving, and monitoring infrastructure.
Data ingestion & validation: Check schema, detect missing values and outliers (Great Expectations, Pandera).
Feature engineering pipeline: Use sklearn Pipelines or feature stores (Feast, Tecton) for consistent transformations.
Experiment tracking: Log all runs with MLflow, Weights & Biases, or Neptune.
Model serving: Deploy as REST API with FastAPI, BentoML, or Triton Inference Server.
Monitoring: Track prediction distribution, data drift, and business KPIs; trigger retraining on degradation.
Example (model serving with FastAPI):
1# app.py — serve a trained sklearn model as a REST API
2from fastapi import FastAPI
3from pydantic import BaseModel
4import joblib
5import numpy as np
6
7app = FastAPI()
8model = joblib.load("model.pkl") # pre-trained sklearn pipeline
9
10class PredictRequest(BaseModel):
11 features: list[float]
12
13@app.post("/predict")
14def predict(request: PredictRequest):
15 X = np.array(request.features).reshape(1, -1)
16 prediction = model.predict(X)[0]
17 probability = model.predict_proba(X)[0].tolist()
18 return {"prediction": int(prediction), "probability": probability}
19
20# Run with: uvicorn app:app --reload
21A robust production pipeline treats ML as software engineering — with versioning, testing, CI/CD, and observability built in from the start.
21. What is online learning (incremental learning) in machine learning?
Online learning (incremental learning) is a training paradigm where a model is updated continuously as new data arrives, one sample or mini-batch at a time, without retraining from scratch. It is essential when data streams continuously or when the full dataset is too large to fit in memory.
Example (incremental learning with partial_fit):
1from sklearn.linear_model import SGDClassifier
2from sklearn.datasets import make_classification
3from sklearn.metrics import accuracy_score
4import numpy as np
5
6np.random.seed(42)
7X, y = make_classification(n_samples=5000, n_features=20, random_state=42)
8
9model = SGDClassifier(loss='log_loss', random_state=42)
10
11# Simulate streaming: update model with each mini-batch of 100 samples
12batch_size = 100
13for start in range(0, 4000, batch_size):
14 X_batch = X[start:start + batch_size]
15 y_batch = y[start:start + batch_size]
16 model.partial_fit(X_batch, y_batch, classes=[0, 1]) # incremental update
17
18# Evaluate on held-out test set
19y_pred = model.predict(X[4000:])
20print(f"Online learning accuracy: {accuracy_score(y[4000:], y_pred):.3f}")
21Online learning algorithms (SGD-based models, River library) are essential for real-time recommendation systems, fraud detection, and any application where the data distribution changes continuously.
22. What is matrix factorization and how is it used in recommendation systems?
Matrix factorization decomposes a user-item interaction matrix (e.g., movie ratings) into two lower-rank matrices — user embeddings and item embeddings — whose dot product approximates the original matrix. The model can then predict unseen user-item interactions.
Example (matrix factorization with SGD from scratch):
1import numpy as np
2
3np.random.seed(42)
4# User-item rating matrix (0 = unrated)
5R = np.array([
6 [5, 3, 0, 1],
7 [4, 0, 4, 1],
8 [1, 1, 0, 5],
9 [0, 1, 5, 4],
10])
11
12n_users, n_items = R.shape
13n_factors = 2
14lr = 0.01
15lambda_reg = 0.1
16
17P = np.random.rand(n_users, n_factors) # user embeddings
18Q = np.random.rand(n_items, n_factors) # item embeddings
19
20for epoch in range(5000):
21 for u in range(n_users):
22 for i in range(n_items):
23 if R[u, i] > 0:
24 e = R[u, i] - P[u] @ Q[i]
25 P[u] += lr * (e * Q[i] - lambda_reg * P[u])
26 Q[i] += lr * (e * P[u] - lambda_reg * Q[i])
27
28R_pred = P @ Q.T
29print("Reconstructed matrix (rounded):")
30print(R_pred.round(1))
31Matrix factorization is the foundation of collaborative filtering (used by Netflix, Spotify, Amazon). Modern implementations use Alternating Least Squares (ALS) or neural collaborative filtering for scale.
23. What is conformal prediction and how does it provide uncertainty estimates?
Conformal prediction is a framework for producing statistically valid prediction intervals or sets with guaranteed coverage. Unlike standard ML models that produce point predictions, conformal predictors output a prediction set and guarantee that the true label is included with a user-specified probability (e.g., 95%).
Example (prediction intervals with MAPIE):
1from mapie.regression import MapieRegressor
2from sklearn.ensemble import GradientBoostingRegressor
3from sklearn.datasets import load_diabetes
4from sklearn.model_selection import train_test_split
5import numpy as np
6
7X, y = load_diabetes(return_X_y=True)
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9
10base_model = GradientBoostingRegressor(n_estimators=100, random_state=42)
11mapie = MapieRegressor(base_model, method='plus', cv=5)
12mapie.fit(X_train, y_train)
13
14# 90% prediction intervals
15y_pred, y_intervals = mapie.predict(X_test, alpha=0.10)
16
17coverage = np.mean((y_test >= y_intervals[:, 0, 0]) & (y_test <= y_intervals[:, 1, 0]))
18print(f"Coverage (target=0.90): {coverage:.3f}")
19print(f"Average interval width: {np.mean(y_intervals[:, 1, 0] - y_intervals[:, 0, 0]):.1f}")
20Conformal prediction is model-agnostic and distribution-free, providing rigorous uncertainty quantification without assumptions about the data generating process — ideal for safety-critical ML applications.
24. What advanced techniques exist for extreme class imbalance beyond SMOTE?
When class imbalance is extreme (e.g., 1 positive per 10,000 negatives), standard approaches like SMOTE are insufficient. Advanced techniques are needed to handle such scenarios in fraud detection, rare disease diagnosis, and anomaly detection.
Example (ADASYN + threshold optimization):
1from imblearn.over_sampling import ADASYN
2from sklearn.ensemble import RandomForestClassifier
3from sklearn.datasets import make_classification
4from sklearn.model_selection import train_test_split
5from sklearn.metrics import f1_score, precision_recall_curve
6import numpy as np
7
8X, y = make_classification(n_samples=5000, weights=[0.99, 0.01], random_state=42)
9X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
10
11# ADASYN: adaptive synthetic sampling (focuses on harder minority examples)
12adasyn = ADASYN(random_state=42)
13X_res, y_res = adasyn.fit_resample(X_train, y_train)
14
15model = RandomForestClassifier(n_estimators=100, random_state=42)
16model.fit(X_res, y_res)
17y_proba = model.predict_proba(X_test)[:, 1]
18
19# Optimize decision threshold for best F1
20precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
21f1_scores = 2 * precision * recall / (precision + recall + 1e-9)
22best_thresh = thresholds[np.argmax(f1_scores)]
23y_pred_opt = (y_proba >= best_thresh).astype(int)
24
25print(f"Default threshold F1: {f1_score(y_test, model.predict(X_test)):.4f}")
26print(f"Optimal threshold ({best_thresh:.3f}) F1: {f1_score(y_test, y_pred_opt):.4f}")
27For extreme imbalance, combine ADASYN or cost-sensitive learning with threshold optimization, use anomaly detection approaches (Isolation Forest), and always evaluate with precision-recall AUC rather than accuracy or standard ROC-AUC.
25. How do you monitor and maintain ML models in production?
Model monitoring is the practice of continuously tracking a deployed ML model's health, performance, and data quality after it has been released to production. Without monitoring, model degradation goes unnoticed and can cause significant business damage.
Model performance metrics: Track accuracy, F1, or custom business KPIs on labeled production data (when labels are available).
Data drift detection: Monitor input feature distributions using KS test, PSI, or Wasserstein distance.
Prediction drift: Track shifts in the distribution of model output scores or class probabilities.
Infrastructure metrics: Latency, throughput, memory usage, and error rates.
Example (lightweight drift monitor):
1import numpy as np
2from scipy.stats import ks_2samp
3import warnings
4
5class SimpleDriftMonitor:
6 def __init__(self, reference_data: np.ndarray, threshold: float = 0.05):
7 self.reference = reference_data
8 self.threshold = threshold
9
10 def check(self, new_data: np.ndarray, feature_names=None) -> dict:
11 alerts = {}
12 for i in range(new_data.shape[1]):
13 stat, p_value = ks_2samp(self.reference[:, i], new_data[:, i])
14 fname = feature_names[i] if feature_names else f"feature_{i}"
15 if p_value < self.threshold:
16 alerts[fname] = {"ks_stat": round(stat, 4), "p_value": round(p_value, 4)}
17 return alerts
18
19# Simulate reference vs. production data
20np.random.seed(42)
21reference = np.random.randn(1000, 5)
22no_drift = np.random.randn(200, 5)
23with_drift = np.random.randn(200, 5)
24with_drift[:, 2] += 3 # feature 2 has shifted
25
26monitor = SimpleDriftMonitor(reference)
27print("No drift alerts: ", monitor.check(no_drift))
28print("With drift alerts:", monitor.check(with_drift))
29Set up automated alerts that trigger retraining pipelines when drift is detected. Tools like Evidently AI, WhyLogs, Arize, and NannyML provide production-grade ML monitoring with dashboards and alerting out of the box.
Related Articles
React JS Interview Questions
Prepare for your React interview with the most asked questions for freshers and experienced developers. Covers hooks, lifecycle, performance optimization, and real-world scenarios.
Full StackNext Js Interview Questions
Explore the most important Next.js interview questions including SSR, SSG, ISR, routing, performance optimization, and real-world implementation examples.