Train Test Splits, Cross Validation, and Linear Regression Introduction We will be working with a data set based on housing prices in Ames, Iowa. It was compiled for educational use to be a modernized and expanded alternative to the well-known Boston Housing dataset. This version of the data set has had some missing values filled for convenience. There are an extensive number of features, so they've been described in the table below. Predictor • SalePrice: The property's sale price in dollars. Features • MoSold: Month Sold • YrSold: Year Sold <li>SaleType: Type of sale</li> <li>SaleCondition: Condition of sale</li><br> <li>MSSubClass: The building class</li> <li>MSZoning: The general zoning classification</li><br> <li>Neighborhood: Physical locations within Ames city limits</li> <li>Street: Type of road access</li> <li>Alley: Type of alley access</li><br> <li>LotArea: Lot size in square feet</li> <li>LotConfig: Lot configuration</li> <li>LotFrontage: Linear feet of street connected to property</li> <li>LotShape: General shape of property</li><br> <li>LandSlope: Slope of property</li> <li>LandContour: Flatness of the property</li><br> <li>YearBuilt: Original construction date</li> <li>YearRemodAdd: Remodel date</li> <li>OverallQual: Overall material and finish quality</li> <li>OverallCond: Overall condition rating</li><br> <li>Utilities: Type of utilities available</li> <li>Foundation: Type of foundation</li> <li>Functional: Home functionality rating</li><br> <li>BldgType: Type of dwelling</li> <li>HouseStyle: Style of dwelling</li><br> <li>1stFlrSF: First Floor square feet</li> <li>2ndFlrSF: Second floor square feet</li> <li>LowQualFinSF: Low quality finished square feet (all floors)</li> <li>GrLivArea: Above grade (ground) living area square feet</li> <li>TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)</li><br> <li>Condition1: Proximity to main road or railroad</li> <li>Condition2: Proximity to main road or railroad (if a second is present)</li><br> <li>RoofStyle: Type of roof</li> <li>RoofMatl: Roof material</li><br> <li>ExterQual: Exterior material quality</li> <li>ExterCond: Present condition of the material on the exterior</li> <li>Exterior1st: Exterior covering on house</li> <li>Exterior2nd: Exterior covering on house (if more than one material)</li><br><br> </ul> </td> <td valign="top"> <ul> <li>MasVnrType: Masonry veneer type</li> <li>MasVnrArea: Masonry veneer area in square feet</li><br> <li>WoodDeckSF: Wood deck area in square feet</li> <li>OpenPorchSF: Open porch area in square feet</li> <li>EnclosedPorch: Enclosed porch area in square feet</li> <li>3SsnPorch: Three season porch area in square feet</li> <li>ScreenPorch: Screen porch area in square feet</li><br> <li>PoolArea: Pool area in square feet</li> <li>PoolQC: Pool quality</li> <li>Fence: Fence quality</li> <li>PavedDrive: Paved driveway</li><br> <li>GarageType: Garage location</li> <li>GarageYrBlt: Year garage was built</li> <li>GarageFinish: Interior finish of the garage</li> <li>GarageCars: Size of garage in car capacity</li> <li>GarageArea: Size of garage in square feet</li> <li>GarageQual: Garage quality</li> <li>GarageCond: Garage condition</li><br> <li>Heating: Type of heating</li> <li>HeatingQC: Heating quality and condition</li> <li>CentralAir: Central air conditioning</li> <li>Electrical: Electrical system</li><br> <li>FullBath: Full bathrooms above grade</li> <li>HalfBath: Half baths above grade</li><br> <li>BedroomAbvGr: Number of bedrooms above basement level</li><br> <li>KitchenAbvGr: Number of kitchens</li> <li>KitchenQual: Kitchen quality</li><br> <li>Fireplaces: Number of fireplaces</li> <li>FireplaceQu: Fireplace quality</li><br> <li>MiscFeature: Miscellaneous feature not covered in other categories</li> <li>MiscVal: Value of miscellaneous feature</li><br> <li>BsmtQual: Height of the basement</li> <li>BsmtCond: General condition of the basement</li> <li>BsmtExposure: Walkout or garden level basement walls</li> <li>BsmtFinType1: Quality of basement finished area</li> <li>BsmtFinSF1: Type 1 finished square feet</li> <li>BsmtFinType2: Quality of second finished area (if present)</li> <li>BsmtFinSF2: Type 2 finished square feet</li> <li>BsmtUnfSF: Unfinished square feet of basement area</li> <li>BsmtFullBath: Basement full bathrooms</li> <li>BsmtHalfBath: Basement half bathrooms</li> <li>TotalBsmtSF: Total square feet of basement area</li> </ul> </td> </tr> Question 1 • Import the data using Pandas and examine the shape. There are 79 feature columns plus the predictor, the sale price ( SalePrice ). • There are three different types: integers ( int64 ), floats ( float64 ), and strings ( object , categoricals). Examine how many there are of each data type. import pandas as pd data = pd read_csv('Ames_Housing_Sales.csv') print(data shape) (1379, 80) data head() 1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExpo 0 856.0 854.0 0.0 NaN 3 1Fam TA 1 1262.0 0.0 0.0 NaN 3 1Fam TA 2 920.0 866.0 0.0 NaN 3 1Fam TA 3 961.0 756.0 0.0 NaN 3 1Fam Gd 4 1145.0 1053.0 0.0 NaN 4 1Fam TA 5 rows × 80 columns Question 2 As discussed in the lecture, a significant challenge, particularly when dealing with data that have many columns, is ensuring each column gets encoded correctly. This is particularly true with data columns that are ordered categoricals (ordinals) vs unordered categoricals. Unordered categoricals should be one-hot encoded, however this can significantly increase the number of features and creates features that are highly correlated with each other. Determine how many total features would be present, relative to what currently exists, if all string (object) features are one-hot encoded. Recall that the total In [ ]: In [9]: In [10]: Out[10]: number of one-hot encoded columns is n-1 , where n is the number of categories. data dtypes value_counts() object 43 float64 21 int64 16 Name: count, dtype: int64 Question 3 Let's create a new data set where all of the above categorical features will be one- hot encoded. We can fit this data and see how it affects the results. • Used the dataframe .copy() method to create a completely separate copy of the dataframe for one-hot encoding • On this new dataframe, one-hot encode each of the appropriate columns and add it back to the dataframe. Be sure to drop the original column. • For the data that are not one-hot encoded, drop the columns that are string categoricals. For the first step, numerically encoding the string categoricals, either Scikit-learn;s LabelEncoder or DictVectorizer can be used. However, the former is probably easier since it doesn't require specifying a numerical value for each category, and we are going to one-hot encode all of the numerical values anyway. (Can you think of a time when DictVectorizer might be preferred?) mask = data dtypes == object categorial_cols = data columns[mask] mask 1stFlrSF False 2ndFlrSF False 3SsnPorch False Alley True BedroomAbvGr False ... WoodDeckSF False YearBuilt False YearRemodAdd False YrSold False SalePrice False Length: 80, dtype: bool In [11]: Out[11]: In [12]: In [13]: Out[13]: data 1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond 0 856.0 854.0 0.0 NaN 3 1Fam TA 1 1262.0 0.0 0.0 NaN 3 1Fam TA 2 920.0 866.0 0.0 NaN 3 1Fam TA 3 961.0 756.0 0.0 NaN 3 1Fam Gd 4 1145.0 1053.0 0.0 NaN 4 1Fam TA ... ... ... ... ... ... ... ... 1374 953.0 694.0 0.0 NaN 3 1Fam NaN 1375 2073.0 0.0 0.0 NaN 3 1Fam TA 1376 1188.0 1152.0 0.0 NaN 4 1Fam Gd 1377 1078.0 0.0 0.0 NaN 2 1Fam TA 1378 1256.0 0.0 0.0 NaN 3 1Fam TA 1379 rows × 80 columns data_en = data copy() from sklearn.preprocessing import LabelEncoder le = LabelEncoder() for col in categorial_cols: data_en[col] = le fit_transform(data_en[col] astype(str)) data_en = pd get_dummies(data_en, columns = categorial_cols) data_en In [14]: Out[14]: In [21]: In [22]: In [ ]: In [27]: In [28]: In [29]: 1stFlrSF 2ndFlrSF 3SsnPorch BedroomAbvGr BsmtFinSF1 BsmtFinSF2 BsmtF 0 856.0 854.0 0.0 3 706.0 0.0 1 1262.0 0.0 0.0 3 978.0 0.0 2 920.0 866.0 0.0 3 486.0 0.0 3 961.0 756.0 0.0 3 216.0 0.0 4 1145.0 1053.0 0.0 4 655.0 0.0 ... ... ... ... ... ... ... 1374 953.0 694.0 0.0 3 0.0 0.0 1375 2073.0 0.0 0.0 3 790.0 163.0 1376 1188.0 1152.0 0.0 4 275.0 0.0 1377 1078.0 0.0 0.0 2 49.0 1029.0 1378 1256.0 0.0 0.0 3 830.0 290.0 1379 rows × 295 columns data_en columns Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'BedroomAbvGr', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtUnfSF', 'EnclosedPorch', ... 'SaleType_3', 'SaleType_4', 'SaleType_5', 'SaleType_6', 'SaleType_7', 'SaleType_8', 'Street_0', 'Street_1', 'Utilities_0', 'Utilities_1'], dtype='object', length=295) Question 4 • Create train and test splits of both data sets. To ensure the data gets split the same way, use the same random_state in each of the two splits. • For each data set, fit a basic linear regression model on the training data. • Calculate the mean squared error on both the train and test sets for the respective models. Which model produces smaller error on the test data and why? X = data drop('SalePrice', axis = 1) y = data['SalePrice'] X = X select_dtypes(exclude = ['object']) X_encoded = data_en drop('SalePrice', axis = 1) y_encoded = data_en['SalePrice'] Out[29]: In [31]: Out[31]: In [39]: from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2, random_state = 42 ) X_train_enc, X_test_enc, y_train_enc, y_test_enc = train_test_split( X_encoded, y_encoded, test_size = 0.2, random_state = 42 ) lr = LinearRegression() lr fit(X_train,y_train) lr_enc = LinearRegression() lr_enc fit(X_train_enc, y_train_enc) ▾ ? i LinearRegression LinearRegression() y_train_pred = lr predict(X_train) y_test_pred = lr predict(X_test) y_train_pred_enc = lr_enc predict(X_train_enc) y_test_pred_enc = lr_enc predict(X_test_enc) mse_train = mean_squared_error(y_train, y_train_pred) mse_test = mean_squared_error(y_test, y_test_pred) mse_train_enc = mean_squared_error(y_train_enc, y_train_pred_enc) mse_test_enc = mean_squared_error(y_test_enc, y_test_pred_enc) print("Original Train MSE:", mse_train) print("Original Test MSE:", mse_test) print("Encoded Train MSE:", mse_train_enc) print("Encoded Test MSE:", mse_test_enc) Original Train MSE: 1122754038.2088377 Original Test MSE: 1480325423.7118464 Encoded Train MSE: 352257750.4835763 Encoded Test MSE: 8.222462936227387e+18 print(X_test_enc isnull() sum() sum()) import numpy as np print(np isinf(X_test_enc) sum() sum()) print(X_train_enc shape, X_test_enc shape) print(X_test_enc describe()) In [40]: In [41]: In [43]: Out[43]: In [44]: In [45]: In [46]: 0 0 (1103, 294) (276, 294) 1stFlrSF 2ndFlrSF 3SsnPorch BedroomAbvGr BsmtFinSF1 \ count 276.000000 276.000000 276.000000 276.000000 276.000000 mean 1185.510870 358.137681 3.510870 2.807971 464.565217 std 382.098621 452.211249 26.042271 0.806389 447.279607 min 483.000000 0.000000 0.000000 0.000000 0.000000 25% 911.500000 0.000000 0.000000 2.000000 0.000000 50% 1130.500000 0.000000 0.000000 3.000000 403.000000 75% 1437.750000 728.000000 0.000000 3.000000 753.250000 max 3138.000000 1538.000000 245.000000 5.000000 2260.000000 BsmtFinSF2 BsmtFullBath BsmtHalfBath BsmtUnfSF EnclosedPorch \ count 276.000000 276.000000 276.000000 276.000000 276.000000 mean 60.286232 0.427536 0.043478 563.264493 17.402174 std 168.011519 0.524147 0.204302 449.223223 53.724088 min 0.000000 0.000000 0.000000 0.000000 0.000000 25% 0.000000 0.000000 0.000000 216.750000 0.000000 50% 0.000000 0.000000 0.000000 473.500000 0.000000 75% 0.000000 1.000000 0.000000 813.250000 0.000000 max 1057.000000 2.000000 1.000000 2336.000000 280.000000 ... OverallCond OverallQual PoolArea ScreenPorch TotRmsAbvGrd \ count ... 276.000000 276.000000 276.000000 276.000000 276.000000 mean ... 5.518116 6.231884 1.880435 24.869565 6.644928 std ... 1.049518 1.448710 31.240129 69.400532 1.594539 min ... 2.000000 3.000000 0.000000 0.000000 3.000000 25% ... 5.000000 5.000000 0.000000 0.000000 6.000000 50% ... 5.000000 6.000000 0.000000 0.000000 6.000000 75% ... 6.000000 7.000000 0.000000 0.000000 7.000000 max ... 9.000000 10.000000 519.000000 440.000000 12.000000 TotalBsmtSF WoodDeckSF YearBuilt YearRemodAdd YrSold count 276.000000 276.000000 276.000000 276.000000 276.000000 mean 1088.115942 98.028986 1974.387681 1985.467391 2007.699275 std 414.550395 136.349974 28.043052 20.385084 1.326705 min 0.000000 0.000000 1892.000000 1950.000000 2006.000000 25% 832.000000 0.000000 1958.000000 1968.000000 2006.000000 50% 1040.000000 0.000000 1976.000000 1994.000000 2008.000000 75% 1362.750000 168.750000 2003.000000 2004.000000 2009.000000 max 3138.000000 857.000000 2009.000000 2010.000000 2010.000000 [8 rows x 36 columns] print(y_train head()) print(y_train_enc head()) print(y_train equals(y_train_enc)) print(y_train min(), y_train max()) print(y_test_enc min(), y_test_enc max()) print(y_test_pred_enc[:5]) print(X_train_enc mean() mean()) print(X_test_enc mean() mean()) In [51]: 1105 171000.0 309 119000.0 915 140000.0 682 222000.0 1236 147000.0 Name: SalePrice, dtype: float64 1105 171000.0 309 119000.0 915 140000.0 682 222000.0 1236 147000.0 Name: SalePrice, dtype: float64 True 35311.0 755000.0 67000.0 582933.0 [340995.39987183 128102.03601074 64614.67333984 219484.55224609 188358.10662842] 85.0188819793848 83.50938817831752 Lets Scale the data and try again for Better mse from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler fit_transform(X_train_enc) X_test_scaled = scaler transform(X_test_enc) lr = LinearRegression() lr fit(X_train_scaled, y_train) ▾ ? i LinearRegression LinearRegression() y_pred = lr predict(X_test_scaled) print(X_train_scaled mean()) # should be ~0 print(X_test_scaled mean()) # close to 0 -7.83437116987757e-17 0.008071663388711875 y_test_pred_enc = lr_enc predict(X_test_scaled) mse_test_enc = mean_squared_error(y_test, y_test_pred_enc) print(mse_test_enc) 1.1554975676869076e+26 In [52]: In [53]: In [54]: In [55]: Out[55]: In [56]: In [57]: In [59]: /opt/intel/oneapi/intelpython/lib/python3.11/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but LinearRegression was fitt ed with feature names warnings.warn( Question 5 For each of the data sets (one-hot encoded and not encoded): • Scale the all the non-hot encoded values using one of the following: StandardScaler , MinMaxScaler , MaxAbsScaler • Compare the error calculated on the test sets Be sure to calculate the skew (to decide if a transformation should be done) and fit the scaler on ONLY the training data, but then apply it to both the train and test data identically. skewness = X_train skew() print(skewness sort_values(ascending =False )) In [60]: MiscVal 28.420789 PoolArea 13.921842 LotArea 11.576472 LowQualFinSF 10.577862 3SsnPorch 10.257336 KitchenAbvGr 5.268794 BsmtFinSF2 4.446876 ScreenPorch 4.404733 BsmtHalfBath 3.799697 EnclosedPorch 3.184311 LotFrontage 2.991530 MasVnrArea 2.497032 OpenPorchSF 2.280418 BsmtFinSF1 1.868217 TotalBsmtSF 1.866779 1stFlrSF 1.464172 GrLivArea 1.419965 MSSubClass 1.406291 WoodDeckSF 1.322117 OverallCond 0.940152 BsmtUnfSF 0.908809 GarageArea 0.830671 2ndFlrSF 0.783074 TotRmsAbvGrd 0.623061 Fireplaces 0.618361 HalfBath 0.591871 BsmtFullBath 0.459529 OverallQual 0.291434 GarageCars 0.216782 MoSold 0.205489 BedroomAbvGr 0.085342 YrSold 0.083387 FullBath 0.047250 YearRemodAdd -0.559818 GarageYrBlt -0.652540 YearBuilt -0.666944 dtype: float64 import numpy as np for col in X_train columns: if X_train[col] skew() > 1: X_train[col] = np log1p(X_train[col]) X_test[col] = np log1p(X_test[col]) # non encoded data from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler fit_transform(X_train) X_test_scaled = scaler transform(X_test) In [61]: In [62]: # for one hot encoded data dummy_cols = [col for col in X_train_enc columns if X_train_enc[col] nunique() num_cols = [col for col in X_train_enc columns if col not in dummy_cols] scaler = StandardScaler() X_train_enc[num_cols] = scaler fit_transform(X_train_enc[num_cols]) X_test_enc[num_cols] = scaler transform(X_test_enc[num_cols]) from sklearn.linear_model import LinearRegression lr = LinearRegression() # Non-encoded lr fit(X_train_scaled, y_train) y_pred_test = lr predict(X_test_scaled) # Encoded lr_enc = LinearRegression() lr_enc fit(X_train_enc, y_train) y_pred_test_enc = lr_enc predict(X_test_enc) from sklearn.metrics import mean_squared_error mse_test = mean_squared_error(y_test, y_pred_test) mse_test_enc = mean_squared_error(y_test, y_pred_test_enc) print("Non-encoded Test MSE:", mse_test) print("Encoded Test MSE:", mse_test_enc) Non-encoded Test MSE: 1428740550.4861448 Encoded Test MSE: 1379698198.1571403 Question 6 Plot predictions vs actual for one of the models. import matplotlib.pyplot as plt # Plot plt figure(figsize = (6,6)) plt scatter(y_test, y_pred_test_enc, alpha = 0.6) <matplotlib.collections.PathCollection at 0x7d3362e0ce50> In [65]: In [66]: In [67]: In [68]: Out[68]: plt plot([y_test min(), y_test max()], [y_test min(), y_test max()], color = 'red') [<matplotlib.lines.Line2D at 0x7d3362d77e90>] In [69]: Out[69]: plt xlabel("Actual Prices") plt ylabel("Predicted Prices") plt title("Actual vs Predicted Prices") plt show() In [70]: plt figure(figsize = (6,6)) plt scatter(y_test, y_pred_test_enc, alpha = 0.6) # perfect line plt plot([0, max(y_test)], [0, max(y_test)], 'r--') plt xlabel("Actual") plt ylabel("Predicted") plt title("Actual vs Predicted") plt grid( True ) plt show() In [71]: In [ ]: