ML_Practical_2025

1 Index Sr. No. Practicals Date Pg. No. Sign. 1 Data Pre - processing and Exploration a Load a CSV dataset. Handle missing values, inconsistent formatting, and outliers. b Load a dataset, calculate descriptive summary statistics, create visualizations using different graphs, and identify potential features and target variables Note : Explore Univariate and Bivariate graphs (Matplotlib) and Seaborn for visualization. c C reate or Explore datasets to use all pre - processing routines like label encoding, scaling, and binarization. 2 Testing Hypothesis a Implement and demonstrate the FIND - S algorithm for finding the most specific hypothesis based on a given set of training data samples. Read the training data from a. CSV file and generate the final specific hypothesis. (Create your dataset) 3 Linear M odels a Simple Linear Regression Fit a linear regression model on a dataset. Interpret coefficients, make predictions, and evaluate performance using metrics like R - squared and MSE b Multiple Linear Regression Extend linear regression to multiple features. Handle feature selection and potential multicollinearity c Regularized Linear Models (Ridge, Lasso, ElasticNet) Implement regression variants like LASSO and Ridge on any generated dataset 4 Discriminative Models a Logistic Regression Perform binary classification using logistic regression. Calculate accuracy, precision, recall, and understand the ROC curve b Implement and demonstrate k - nearest Neighbor algorithm. Read the training data from a .CSV file and build the model to classify a test sample. Print both correct and wrong predictions. c Build a decision tree classifier or regressor. Control hyperparameters like tree depth to avoid overfitting . Visualize the tree. d Implement a Support Vector Machine for any relevant dataset. e Train a random forest ensemble. Experiment with the number of trees and feature sampling. Compare performance to a single decision tree. 2 f Implement a gradient boosting machine (e.g., XGBoost). Tune hyperparameters and explore feature importance. 5 Generative Models a Implement and demonstrate the working of a Naive Bayesian classifier using a sample data set. Build the model to classify a test sample. b Implement Hidden Markov Models using hmmlearn 6 Probabilistic Models a Implement Bayesian Linear Regre ssion to explore prior and posterior distribution. b Implement Gaussian Mixture Models for density estimation and unsupervised clustering 7 Model Evaluation and Hyperparameter Tuning a Implement cross - validation techniques (k - fold, stratified, etc.) for robust model evaluation b Systematically explore combinations of hyperparameters to optimize model performance.(use grid and randomized search) 8 Bayesian Learning a Implement Bayesian Learning using inferences 9 Deep Generative Models a Set up a generator network to produce samples and a discriminator network to distinguish between real and generated data. (Use a simple small dataset) 10 Develop an API to deploy your model and perform predictions 3 PRACTICAL NO : 1 Data Pre - processing and Exploration a. Load a CSV dataset. Handle missing values, inconsistent formatting, and outliers. Code: # Step 1: Import libraries import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt # Step 2: Load dataset (Titanic dataset from seaborn) df = sns.load_dataset("titanic") # Display first 5 rows print("First 5 rows of dataset:") print( df.head()) # Step 3: Handle Missing Values print(" \ nMissing values before cleaning:") print(df.isnull().sum()) # Fill missing 'age' with median df['age'] = df['age'].fillna(df['age'].median()) # Fill missing 'embarked' with mode (most common value) df[' embarked'] = df['embarked'].fillna(df['embarked'].mode()[0]) # Drop column with too many missing values (deck) df = df.drop(columns=['deck']) print(" \ nMissing values after cleaning:") print(df.isnull().sum()) # Step 4: Inconsistent Formatting Example df ['sex'] = df['sex'].str.lower() # make all lowercase # Step 5: Handle Outliers using IQR method for "fare" Q1 = df['fare'].quantile(0.25) Q3 = df['fare'].quantile(0.75) IQR = Q3 - Q1 # keep only rows within 1.5*IQR df = df[(df['fare'] >= Q1 - 1.5*IQR) & (df['fare'] <= Q3 + 1.5*IQR)] print(" \ nData after removing outliers in fare:") print(df['fare'].describe()) Output: 4 5 b. Load a dataset, calculate descriptive summary statistics, create visualizations using different graphs, and identify potential features and target variables Note : Explore Univariate and Bivariate graphs (Matplotlib) and Seaborn for visualization. Code: # Step 1: Descriptive Statistics print(" \ nSummary Statistics:") print(df.describe(include="all")) # Step 2: Univariate Visualization plt.figure(figsize=(12,5)) plt.subplot(1,2,1) sns.histplot(df['age'], bins=20, kde=True, color="blue") plt.title("Age Distribution") plt.subplot(1,2,2) sns.countplot(x="sex", data=df, palette="Set2") plt.title("Gender Count") plt.show() # Step 3: Bivariate Visualization plt.figure(figsize=(12,5)) plt.subplot(1,2,1) sns.boxplot(x="sex", y="age", data=df) plt.title("Age vs Gender") plt.subplot(1,2,2) sns.barplot(x="class", y="fare", data=df, estimator=np.mean, ci=None) plt.title("Average Fare by Class") plt.show() # Step 4: Correlation Heatmap plt.figure(figsize=(8,5)) sns.heatmap(df.corr(numeric_only=True), annot =True, cmap="coolwarm") plt.title("Correlation Heatmap") plt.show() # Step 5: Identify Features & Target # Suppose we want to predict "survived" features = df.drop(columns=['survived']) target = df['survived'] print(" \ nFeatures shape:", features.shape) pr int("Target shape:", target.shape) Output: 6 7 c. Create or Explore datasets to use all pre - processing routines like label encoding, scaling, and binarization. Code: from sklearn.preprocessing import LabelEncoder, StandardScaler, Binarizer # Step 1: Label Encoding (Convert categorical - > numbers) df_encoded = df.copy() le = LabelEncoder() df_encoded['sex'] = le.fit_transform(df_encoded['sex']) df_encoded['embarked'] = le.fit_transform(df_encoded['embarked']) df_encoded['class'] = le.fit_transform( df_encoded['class']) print(" \ nAfter Label Encoding:") print(df_encoded[['sex','embarked','class']].head()) # Step 2: Scaling numerical features scaler = StandardScaler() df_encoded[['age','fare']] = scaler.fit_transform(df_encoded[['age','fare']]) print(" \ nAfter Scaling (age, fare):") print(df_encoded[['age','fare']].head()) # Step 3: Binarization (Convert numerical to 0/1) binarizer = Binarizer(threshold=0) df_encoded['is_adult'] = binarizer.fit_transform(df_encoded[['age']]) # age > 0 - > 1 print(" \ nAfter Binarization (is_adult):") print(df_encoded[['age','is_adult']].head()) 8 Output: 9 PRACTICAL NO : 2 Testing Hypothesis a. Implement and demonstrate the FIND - S algorithm for finding the most specific hypothesis based on a given set of training data samples. Read the training data from a. CSV file and generate the final specific hypothesis. (Create your dataset) Code: impo rt pandas as pd # ----------------------------- # 1. CREATE AND SAVE DATASET # ----------------------------- data = { "Sky": ["Sunny", "Sunny", "Rainy", "Sunny"], "AirTemp": ["Warm", "Warm", "Cold", "Warm"], "Humidity": ["Normal", "High", "High", "High"], "Wind": ["Strong", "Strong", "Strong", "Strong"], "Water": ["Warm", "Warm", "Warm", "Cool"], "Forecast": ["Same", "Same", "Change", "Change"], "EnjoySport": ["Yes", "Yes", "No", "Yes"] } df = pd.DataFrame(data) df.to_csv(" training_data.csv", index=False) # ----------------------------- # 2. LOAD CSV FILE # ----------------------------- data = pd.read_csv("training_data.csv") print("Training Dataset: \ n") print(data, " \ n") # ----------------------------- # 3. FIND - S ALGORIT HM # ----------------------------- def find_s_algorithm(df): # Filter positive examples positive_examples = df[df['EnjoySport'] == 'Yes'].drop('EnjoySport', axis=1).values # Initialize hypothesis with first positive sample hypothesis = positive_examples[0].copy() # Compare and generalize for sample in positive_examples[1:]: for i in range(len(hypothesis)): if sample[i] != hypothesis[i]: hypothesis[i] = "?" return hypothesis # Run algorithm final_hypothesis = find_s_algorithm(data) # ----------------------------- # 4. DISPLAY FINAL HYPOTHESIS 10 # ----------------------------- print("Final Most Specific Hypothesis:") print(final_hypothesis) Output: 11 PRACTICAL NO : 3 Linear Models a. Simple Linear Regression Fit a linear regression model on a dataset. Interpret coefficients, make predictions, and evaluate performance using metrics like R - squared and MSE Code: import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score # ----------------------------- # 1. CREATE SAMPLE DATASET # ----------------------------- # Example: Study Hours vs Marks data = { "Hours": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], "Marks": [35, 40, 50, 55, 65, 70, 75, 80, 85, 90] } df = pd.DataFrame(data) # Prepare features (X) and target (y) X = df[["Hours"]] y = df["Marks"] # -------------- --------------- # 2. TRAIN LINEAR REGRESSION # ----------------------------- model = LinearRegression() model.fit(X, y) # ----------------------------- # 3. EXTRACT COEFFICIENTS # ----------------------------- slope = model.coef_[0] intercept = model.intercept_ print("==== Linear Regression Model ====") print(f"Slope (Coefficient): {slope}") print(f"Intercept: {intercept}") # ----------------------------- # 4. MAKE PREDICTIONS # ----------------------------- y_pred = model.predict(X) print(" \ nP redicted Marks: \ n", y_pred) # ----------------------------- # 5. EVALUATE MODEL # ----------------------------- mse = mean_squared_error(y, y_pred) rmse = np.sqrt(mse) r2 = r2_score(y, y_pred) 12 print(" \ n==== Model Evaluation ====") print(f"MSE (Mean Squa red Error): {mse}") print(f"RMSE (Root MSE): {rmse}") print(f"R² Score: {r2}") # ----------------------------- # 6. PLOT RESULT # ----------------------------- plt.scatter(X, y) # actual data plt.plot(X, y_pred) # regression line plt.title ("Simple Linear Regression") plt.xlabel("Study Hours") plt.ylabel("Marks") plt.grid() plt.show() Output: 13 b. Multiple Linear Regression Extend linear regression to multiple features. Handle feature selection and potential multicollinearity. Code : import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score # ----------------------------- # 1. Small Dataset # ----------------------------- data = { "Area": [800, 1000, 1200, 1500, 1800], "Bedrooms": [2, 2, 3, 3, 4], "Age": [10, 8, 5, 3, 2], "Price": [50, 60, 72, 85, 95] } df = pd.DataFrame(data) X = df[["Area", "Bedrooms", "Age"]] y = df["Price"] # ----------------------------- # 2. Train Model # --------------- -------------- model = LinearRegression() model.fit(X, y) # ----------------------------- # 3. Output Coefficients # ----------------------------- print("Coefficients:", model.coef_) print("Intercept:", model.intercept_) # ----------------------------- # 4. Predictions # ----------------------------- y_pred = model.predict(X) print(" \ nPredicted Prices:", y_pred) # ----------------------------- # 5. Evaluation # ----------------------------- mse = mean_squared_error(y, y_pred) r2 = r2_score(y, y_pred) print(" \ nMSE:", mse) print("R²:", r2) 14 Output: c. Regularized Linear Models (Ridge, Lasso, ElasticNet) Implement regression variants like LASSO and Ridge on any generated dataset Code: from sklearn.linear_model import Ridge, Lasso, ElasticNet from sklearn.metrics import r2_score import numpy as np # Dataset (very small) X = np.array([[1],[2],[3],[4],[5]]) y = np.array([3,5,7,9,11]) # y = 2x + 1 # Ridge ridge = Ridge(alpha=1).fit(X, y) print("Ridge Coef:", ridge.coef_, "R²:", r2_score(y, ridge.predict(X))) # Lasso lasso = Lasso(alpha=0.1).fit(X, y) print("Lasso Coef:", lasso.coef_, "R²:", r2_score(y, lasso.predict(X))) # ElasticNet elastic = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X, y) print("ElasticNet Coef:", elastic.coef_, "R²:", r2_ score(y, elastic.predict(X))) Output: 15 PRACTICAL NO : 4 Discriminative Models a. Logistic Regression Perform binary classification using logistic regression. Calculate accuracy, precision, recall, and understand the ROC curve Code: import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_curve, auc # ----------------------------- # 1. Small Binary Dataset # ----------------------------- # Feature → Hours studied # Label → Pass(1) / Fail(0) X = np.array([[1],[2],[3],[4],[5],[6]]) y = np.array([0,0,0,1,1,1]) # ----------------------------- # 2. Train Logistic Regression # ----------------------------- model = LogisticRegression() model.fit(X, y) # Predictions y_pred = model.predict(X) y_prob = model.predict_proba(X)[:,1] # probability of class 1 # ----------------------------- # 3. Evaluation Metrics # ----------------------------- print("Accuracy :", accuracy_score(y, y_pred)) print("Precision:", precision_score(y, y_pred)) print("Recall :", recall_score(y, y_pred)) # ----------------------------- # 4. ROC Curve # ----------------------------- fpr, tpr, _ = roc_curve(y, y_prob) roc_auc = auc(fpr, tp r) plt.plot(fpr, tpr) plt.plot([0,1], [0,1], ' -- ') plt.title(f"ROC Curve (AUC = {roc_auc:.2f})") plt.xlabel("False Positive Rate") plt.ylabel("True Positive Rate") plt.grid() plt.show() Output: 16 b. Implement and demonstrate k - nearest Neighbor algorithm. Read the training data from a .CSV file and build the model to classify a test sample. Print both correct and wrong predictions. Code: import pandas as pd from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split # --------------------------------------- # 1. CREATE & SAVE SMALL CSV DATASET # --------------------------------------- data = { "Feature1": [1,2,3,4,5,6], "Feature2": [2,3,3,5,6,7], "Label": [0,0,0,1,1,1] } df = pd.DataF rame(data) df.to_csv("knn_data.csv", index=False) # --------------------------------------- # 2. READ CSV FILE # --------------------------------------- data = pd.read_csv("knn_data.csv") X = data[["Feature1","Feature2"]] y = data["Label"] 17 # --------------------------------------- # 3. TRAIN – TEST SPLIT # --------------------------------------- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # --------------------------------------- # 4. TRAIN KNN MODEL # --------------------------------------- model = KNeighborsClassifier(n_neighbors=3) model.fit(X_train, y_train) # --------------------------------------- # 5. PREDICT TEST SAMPLES # --------------------------------------- y_pred = model.predict(X_ test) print("Test Samples: \ n", X_test) print(" \ nPredictions:", y_pred.tolist()) print("Actual Labels:", y_test.tolist()) # --------------------------------------- # 6. PRINT CORRECT & WRONG PREDICTIONS # --------------------------------------- correct = [] wrong = [] for actual, predicted in zip(y_test, y_pred): if actual == predicted: correct.append((actual, predicted)) else: wrong.append((actual, predicted)) print(" \ nCorrect Predictions:", correct) print("Wrong Predictions:", wrong) Output: 18 c. Build a decision tree classifier or regressor. Control hyperparameters like tree depth to avoid overfitting. Visualize the tree. Code: import pandas as pd from sklearn.tree import DecisionTreeClassifier, plot_tree import matplotlib.pyplot as plt # --------------------------------------- # 1. Small Dataset # --------------------------------------- data = { "Age": [22,25,47,52,46,56], "Salary": [25000,45000,50000,60000,70000,80000], "Buy": ["No","No","Yes","Yes" ,"Yes","Yes"] } df = pd.DataFrame(data) X = df[["Age","Salary"]] y = df["Buy"] # --------------------------------------- # 2. Decision Tree (Controlled Depth) # --------------------------------------- model = DecisionTreeClassifier(max_depth=3) model.fit (X, y) # --------------------------------------- # 3. Test Predictions # --------------------------------------- test = pd.DataFrame({"Age":[30,50], "Salary":[40000,75000]}) pred = model.predict(test) print("Test Data: \ n", test) print("Predictions:", pred ) # --------------------------------------- # 4. Visualize the Tree # --------------------------------------- plt.figure(figsize=(10,5)) plot_tree(model, feature_names=["Age","Salary"], class_names=["No","Yes"], filled=True) plt.show() Output : 19 d. Imple ment a Support Vector Machine for any relevant dataset. Code: import numpy as np from sklearn.svm import SVC from sklearn.metrics import accuracy_score # ----------------------------- # 1. Small Dataset # ----------------------------- # Feature: Hours studied # Label: Pass(1) / Fail(0) X = np.array([[1],[2],[3],[4],[5],[6]]) y = np.array([0,0,0,1,1,1]) # ----------------------------- # 2. Train SVM Model # ----------------------------- model = SVC(kernel="linear") model.fit(X, y) # ------------------ ----------- # 3. Test Data # ----------------------------- test = np.array([[2],[4],[6]]) pred = model.predict(test) # ----------------------------- # 4. Output # ----------------------------- print("Test Samples: \ n", test) print("Predictions:", pred) # Calculate accuracy on original dataset y_pred_full = model.predict(X) print("Accuracy:", accuracy_score(y, y_pred_full)) Output: e. Train a random forest ensemble. Experiment with the number of trees and feature sampling. Compare performance to a single decision tree. Code: import pandas as pd from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForest Classifier 20 from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split # ----------------------------- # 1. Small Dataset # ----------------------------- data = { "Feature1": [1,2,3,4,5,6,7,8], "Feature2": [2,1, 3,5,6,7,8,9], "Label": [0,0,0,1,1,1,1,1] } df = pd.DataFrame(data) X = df[["Feature1","Feature2"]] y = df["Label"] # ----------------------------- # 2. Train - Test Split # ----------------------------- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) # ----------------------------- # 3. Single Decision Tree # ----------------------------- dt = DecisionTreeClassifier(max_depth=3) dt.fit(X_train, y_train) dt_pred = dt.predict(X_test) print("Decisio n Tree Accuracy:", accuracy_score(y_test, dt_pred)) # ----------------------------- # 4. Random Forest # ----------------------------- rf = RandomForestClassifier(n_estimators=5, max_features="sqrt", random_state=42) rf.fit(X_train, y_train) rf_pred = rf.predict(X_test) print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred)) Output: f. Implement a gradient boosting machine (e.g., XGBoost). Tune hyperparameters and explore feature importance. Code: import pandas as pd import numpy as np from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt # -----------------------------