APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 1 ASSIGNMENT TECHNOLOGY PARK MALAYSIA CT127-3-2-PFDA PROGRAMMING FOR DATA ANALYSIS TYPE INTAKE CODE HAND OUT DATE: 10 OCTOBER 2022 HAND IN DATE: 28 NOVEMBER 2022 WEIGHTAGE: 50% INSTRUCTIONS TO CANDIDATES: 1 Submit your assignment at the administrative counter. 2 Students are advised to underpin their answers with the use of references (cited using the American Psychological Association (APA) Referencing). 3 Late submission will be awarded zero (0) unless Extenuating Circumstances (EC) are upheld. 4 C ases of plagiarism will be penalized. APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 2 5 The assignment should be bound in an appropriate style (comb bound or stapled). 6 Where the assignment should be submitted in both hardcopy and softcopy, the softcopy of the written assignment and source code (where appropriate) should be on a CD in an envelope / CD cover and attached to the hardcopy. 7 You must obtain 50 % overall to pass this module. APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 3 Table of Contents 1.0 Introduction & Assumption ................................ ................................ ................................ ... 8 1.1 Introduction ................................ ................................ ................................ ....................... 8 1.2 Assumption ................................ ................................ ................................ ........................ 8 2.0 Data Impor t, Cleaning, Pre - processing, Data exploration, Additional functions. .................... 9 2.1 Data Import ................................ ................................ ................................ ....................... 9 2.1.1 Importing house rent dataset csv ................................ ................................ .................. 9 2. 1.2 Importing libraries ................................ ................................ ................................ ..... 10 2.2 Cleaning ................................ ................................ ................................ .......................... 11 2.2.1 Detecting missing value of the dataset ................................ ................................ ....... 11 2.2.2 Detecting & duplication value ................................ ................................ ................... 13 2.2.3 Detecting unusual & inappropriate value ................................ ................................ ... 15 2.2.4 Dropping row of missing value ................................ ................................ .................. 15 2.2.5 Replacing unusual value with missing value ................................ .............................. 16 2.3 Pre - processing ................................ ................................ ................................ ................. 17 2.3.1 Rename columns ................................ ................................ ................................ ....... 17 2.3.2 View the data structure of house rent dataset ................................ ............................. 18 2.3.3 Checking data type for each column ................................ ................................ .......... 18 2.3.4 Modify data type for specific column ................................ ................................ ........ 19 2.5 Additional functions ................................ ................................ ................................ ........ 20 2.5. 1 Quantify the linear relationship between bedroom number, lease fee, bathroom number and square feet (correlation) ................................ ................................ ............................... 20 2.5.2 Loca tion of the dataset (geo_map) ................................ ................................ ............. 22 2.5.3 Lollipop Chart ................................ ................................ ................................ ........... 25 APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 4 2.5.4 Violin Chart ................................ ................................ ................................ .............. 26 2.5.5 Donut Chart ................................ ................................ ................................ ............... 27 2.5.6 Hove r label for boxplot chart using Plotly packages ................................ .................. 29 2.5.7 Additional dplyr and r built in functions ................................ ................................ .... 30 2.5.8 Additional ggplot or plotly functions ................................ ................................ ......... 31 2.4 Data Exploration ................................ ................................ ................................ .............. 32 2.4.1 Checking dataset dimensions, rows and columns ................................ ....................... 32 2.4.2 Head and tail functions of the dataset ................................ ................................ ........ 33 2.4.3 Summary of the dataset ................................ ................................ ............................. 35 3.0 Questions & Analysis ................................ ................................ ................................ .......... 36 3.1 Question 1: What is the factor affecting the house rental price? ................................ ........ 36 3.1.1 Analysis 1: Find the number bedroom and its average price ................................ ....... 36 3.1.2 Analysis 2: Find size of the home (sqrt) its average rental price ................................ 40 3.1.3 Analysis 3: Find furnishing status its average rental price ................................ .......... 42 3.1.4 Analysis 4: Find city and its average rental price ................................ ....................... 44 3.1.5 Analysis 5: Find the area type and its average rental price ................................ ......... 47 3.1.6 Analysis 6: Find number of bathrooms its average rental price range ......................... 49 3.1.7 Analysis 7: Find the number of bathroom and its average rental price ........................ 52 3.1.8 Analysis 8: Find tenant type and its average rental price ................................ ........... 54 3.1.9 Analysis 9: Find out top 6 location of highest rental price ................................ ......... 57 3.1.10 Analysis 10: Find out the types of contact and its average rental price ..................... 59 3.1.11 Analysis 11: Find out the top 6 floor level its highest average price ......................... 61 Conclusion of Question 1 ................................ ................................ ................................ ... 64 3.2 Question 2: Which types of houses preferred by fam ilies? ................................ ............... 65 3.2.1 Analysis 1: Find the total number of families from the tenant preference column ....... 65 APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 5 3.2.2 Analysis 2: Find the total number of families and their preference on furnishing type 69 3.2.3 Analysis 3: Find the total number of families preferences on bedroom number with semi - furnished home ................................ ................................ ................................ .................. 72 3.2.4 Analysis 4: Find the total number of families preferences on bathroom number with semi furnished and 3 - bedroom home ................................ ................................ .......................... 75 3.2.5 Analysis 5: Find the total number of families preferences area type with semi furnished, 3 - bedroom, and 3 - bathroom home ................................ ................................ ...................... 77 3.2.6 Analysis 6: Find the total number of families preferences on size .............................. 79 3.2.7 Analysis 7: Find the top 10 location that preferred by the family ............................... 81 3.2.8 Analysis 8: Find t he total number of families preferences on city .............................. 83 3.2.9 Analysis 9: Find the percentage of families preferences on floor number ................... 85 3.2.10 Analysis 10: Find the total number of families preferences on floor number ............ 88 3.2.11 Analysis 11: Find the families preferences on the range of the rent price ................. 91 Conclusion of Question 2 ................................ ................................ ................................ ... 93 3.3 Question 3: Which types of houses are preferred by the bachelors? ................................ .. 94 3.3.1 Analysis 1: Find the total number of bachelors from the tenant preference column .... 94 3.3.2 Analysis 2: Find the total number of bachelors and their preference on furnishing status ................................ ................................ ................................ ................................ .......... 97 3.3.3 Analysis 3: Find the tot al number of bachelors and their preference on be bedroom number ................................ ................................ ................................ ............................... 99 3.3.4 Analysis 4: Find the total number of bachelors and their preference on bathroom number ................................ ................................ ................................ ................................ ........ 101 3.3.5 Analysis 5: Find the total number of bachelors and their preference on area type ..... 103 3.3.6 Analysis 6: Find the total number of bachelors and their preference on size ............. 105 3.3.7 Analysis 7: Find the top 5 location that preferred by the bachelors ........................... 107 3.3.8 Analysis 8: Find the total number of bachelors and their preference on city ............. 109 APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 6 3.3.9 Analysis 9: Find the top 5 floor number that preferred by the bachelors ................... 111 3.3.10 Analysis 10: Find the total number of bachelors and their p reference rental price .. 114 Conclusion of Question 3 ................................ ................................ ................................ 115 3.4 Question 4: What types of the house does the owner has ................................ ................ 117 3.4.1 Analysis 1 : Find the relationship between owner and bedroom number ................... 117 3.4.2 Analysis 2: Find the relationship between owner and range of house rent ............ 119 3.4.3 Analysis 3: Find the relationship betw een owner and range of house size ................ 121 3.4.4 Analysis 4: Find the relationship between owner and top 10 floor level ................... 123 3.4.5 Analysis 5: Find the relationship between owner and area type ................................ 125 3.4.6 Analysis 6: Find the relationship between owner and city ................................ ........ 126 3.4.7 Analysis 7: Find the relationship between owner and furnishing type ...................... 129 3.4.8 Analysis 8: Find the relationship between owne r and preferred tenant type .............. 131 3.4.9 Analysis 9: Find the relationship between owner and bathroom number .................. 133 3.4.10 Analysis 10: Find the relationship between owner and location ............................. 135 Conclusion of Question 4 ................................ ................................ ................................ 137 3.5 Question 5: What types of the house does the agent has ................................ ................. 138 3.5.1 Analysis 1: Find the relationship between agent and number of bedroom ................ 138 3.5.2 Analysis 2: Find the relationship between agent and range of house rent ................. 141 3.5.3 Analysis 3: Find the relationship between agent and range of house size ................. 143 3.5.4 Analysis 4: Find the relationship between agent and floor level ............................... 145 3.5.5 Analysis 5: Find the relationship between agent and area type ................................ 147 3.5.6 Analysis 6: Find the relationship between agent and city ................................ ......... 149 3.5.7 Analysis 7: Find the relationship between agent and furnishing type ........................ 151 3.5.8 Analysis 8: Find the relationship between agent and preferred tenant type ............... 154 3.5.9 Analysis 9: Find t he relationship between agent and bathroom number .................... 156 APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 7 3.5.10 Analysis 10: Find the relationship between agent and location ............................... 158 Conclusion of Question 5 ................................ ................................ ................................ 160 3.6 When does the house listing being published? ................................ ................................ 161 3.6.1 Find total the house listing being publi shed for all month ................................ ....... 161 3.6.2 Find the listing published for April only ................................ ................................ .. 164 3.6.3 Find the listing published for May only ................................ ................................ ... 166 3.6.4 Find the listing published for June only ................................ ................................ ... 168 3.6.5 Find the listing published for July only ................................ ................................ .... 170 Conclusion of Question 6 ................................ ................................ ................................ 171 4.0 Conclusion ................................ ................................ ................................ ........................ 173 5.0 Reference ................................ ................................ ................................ .......................... 173 APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 8 1.0 Introduction & Assumption 1.1 Introduction With the advancement of computing and technology, data analysis and prediction have taken on a greater role in the daily operations of businesses. A comprehensive data analysis may provide valuable information, insight, and early problem detection regarding the organisation; in other words, it assists senior management in making better decisions and preventing problems, as well as informing business owners of the state of their company. The process of data analysis is crucial and essential for firms that frequently deal with large amounts of client data, such as the sales, ecommerce, and real estate sectors. In conclusion, a strong data analysis not only aids senior management in making good decisions, but also propels a company to the top by identifying its consumers and competitors. The goal of this assignment is to analyse and investigate the supplied dataset of house ren t prediction using data analytics techniques such as exploration, manipulation, transformation, and visualisation. Additionally, the student is required to undertake research on the issue domain and its datasets. Lastly, a thorough and detailed data analys is and data visualisation should be undertaken. The student must also present a comprehensive report or documentation on their analysed data and software outputs. 1.2 Assumption APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 9 2.0 Data Import, Cleaning, Pre - processing, Data exploration, Additional f un ction s 2.1 Data Import 2.1.1 Importing house rent dataset csv Figure 1 Shows the r script of data import Importing data is one of the most crucial elements in the data analysis process; in the first step of data analysis, we utilized the read operation. csv() methods to read the given dataset in csv format, with the first parameter including the path of the d ataset directories and the header = TRUE logical value indicating that the given dataset file contains a heading or variable names as its first row. Figure 2 Shows the r console output of data import APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 10 The output of the data importing procedure discussed in the preceding step is depicted in the table above. The data are being loaded and displayed on the R studio console. Furthermore, it is evident that these data have been transformed and displayed in th e data.frame format by R S tudio automatically 2.1.2 Importing libraries The example above demonstrates that the r script for all libraries used in this project includes extra functions as well as functions that are not taught during lecture sessions. F igure 3 Shows the r script of import libraries Figure 4 Shows the r console output of import libraries APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 11 2.2 Cleaning 2.2.1 Detecting missing value of the dataset 2.2.1.1 Detecting m issing value using is .na function Missing values on certain rows or columns are one of the most frequent errors in datasets, thus in this section, we will use the is.na function to determine whether the dataset or rows of the particular column include a single missing valu e by returning true or false. The total amount of missing values will be added up when the sum function is used, rather than a Boolean result. APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 12 In the given house rent dataset, all columns returned 0, which can be seen in the terminal above. This indicates that there are no missing or NA values. 2.2.1.2 Detecting missing value using mean function For columns with numeric and integer data types, there is an alternative to utilizing the is.na method to check for missing or NA values. The mean() method may be used to find missing values for numeric datatypes , if a value is missing, this function will return null. APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 13 The BHK, Rent, Size, and Bathroom columns of the home rent dataset are returning decimal values rather than NULL in the mean function result seen in the console above, confirming that those columns don't have any missing values. 2.2.1.3 De tecting missing value using the summary function Although the summary function are not mainly focuses or purposed developed to used for detecting missing values for int or numeric columns, but it is highly efficient and time - saving when you wanted to chec k data types and missing values for numeric or integer columns. 2.2.2 Detecting & duplication value Figure 5 Shows the r script of detecting duplication value In this section, the duplicated () function will be used to check for duplicated data in the dataset. It is crucial to identify duplicate values in a dataset in order to prevent redundancy and inaccuracy when moving on to further analysis. For more thorough analysis, the sum () function wil l compile the total number of duplicated data points. APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 14 Figure 6 Shows the r console output of detecting duplication value in the console above, which displays the overload of the result provided by the duplicated() method. We ca n plainly see that every single cell of the data returned a FALSE value, indicating that there is no duplication between them, Figure 7 Shows the r console output of detecting duplication While determining the missing value by a single cell produced by duplicated() is messy, complicated, and unrealistic, sum() is being utilized instead. We can see instantly that 0 is being returned, indicating that there are no duplicate values in the dataset. APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 15 2.2.3 Detecting un usual & inapprop riate value Figure 8 Shows the process of detecting unusual and inappropriate data Figure 9 Shows the pricess of detecting unusual data Disclaimer : The implementation of this section is only for reporting and research purposed, this portion of script will not be advised or to executed in the actual assignment or presentation, student are advised not to make changes such as deleting the dataset by Ms. Minnu Helen Joseph The aforementioned spreadsheets reveal that unsuitable and unusual values occurred on lines 443 and 3566, which are located beneath the column for area locality and indicate that only a place name or character should be used in this col umn rather than an int value. In this part, we have two options: we may either choose to remove all of the rows or replace the uncommon value with another missing value. In the part after this, we'll talk about these two solutions. 2.2.4 Dropping row of mi ssing value Figure 10 Shows the r script of dropping row of missing value Disclaimer : The implementation of this section is only for reporting and research purposed, this portion of script will not be advised or to executed in the actual assignment or presentation, student are advised not to make changes such as deleting the dataset by Ms. Minnu Helen Joseph APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 16 The aforementioned script implemented deleting missing values for both rows 443 and 3566, the latter of which, as previous ly indicated, has an unexpected value. The first line of the script indicates that deleting rows 443 and 3566 would update or overwrite the data frame. The remaining two lines are then utilized to print the range of the remaining two rows to confirm that t hey have been dropped. Figure 11 Shows the r console output of dropping row of missing data The above console display in R Studio makes it plain that rows with unusual values have been erased or removed from both 443 and 3566. 2.2.5 Replacing unusual value with missing value Figure 12 Shows the r script of replacing unusual value with missing value Disclaimer : The implementation of this section is only for reporting and research purposed, this portion of script will not be advised or to executed in the actual assignment or presentation, student are advised not to make changes such as deleting the dataset by Ms. Minnu Helen Joseph APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 17 The second argument of mean na is used to replace NA value with another mean value, which is another approach or option we might use to deal with an improper value. The implementation is illustrated above. The value should be stripped b efore the computation moves forward, as indicated by the value rm = TRUE. 2.3 Pre - processing 2.3.1 Rename columns Figure 13 Shows the r script of rename columns Renaming data variables, headers, or columns is a crucial step in the pre - processing of data; doing so will make the analysis carried out during the subsequent phases simpler and more understandable, as well as prevent name conflicts or misunderstanding between other variables. Using the names() function, we will rename columns by assigning a vector of values to the column headings, followed by the sequence or index of the original dataset. Figure 14 Shows the r console output of renaming column The above console clearly demonstrates the outc ome of the renaming operation; the column heading in the console has been successfully changed to the newly allocated heading values. APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 18 2.3.2 View the data structure of house rent dataset Figure 15 Shows the r script of view data structure of dataset We can examine our data structure using the class() method as it will be used to return the value of the class. Figure 16 Shows the r console output of view data structure of dataset The console output shows that the data.frame data structure is being used to implement the current dataset. 2.3.3 Checking data type for each column Figure 17 Shows the r script of checking data type for each column Most datasets typically include some columns with the incorrect data type , thus it is crucial that we examine the dataset for each column and make sure they are all in the correct data type before performing any analysis. The str() function will be used in this phase to display an object's i nternal structure in R programming. The str acronym stands for structure. Figure 18 Shows the r console output of checking data type for each column APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 19 As shown in the figure above, the str() function returned information about the dataset's data structure, number of objects, and number of variables. It is also clear that some columns have incorrect data types, such as the chr data type for the Posted Date column and the int data type for the Lease Fee column. 2.3.4 Modify data type for specific column 2.3.4.1 Modify data type for Posted_Date column from char to date format Figure 19 Shows the r script of modify data type for specific column As noted earlier in the datatype checking for each column, datasets typically arrive with incorrect data types on their columns, which is a component of data pre - processing; consequently, we will also be necessary to manually modify the datatype itself aft er discovering those with mistakes. In this column, the data type for Posted Date must be converted from char to date format using the as. Date() method additionally overwrites the existing columns , the second arguments of the function are used to change t he order of day, month, and year based on the user's specifications. Figure 20 Shows the r console output of modify data type for specific columns The result displayed in the Posted Date column has been successfully translated from the char data type to the date data type using the class() method. APD2F2209SE CT127 - 3 - 2 - PFDA - 092022 - FHI TP060711 20 2.3.4.2 Modify data type for Lease_fee column from int to numeric format Figure 21 Shows the r console output of modify data type for lease fee from int to numeric format As a column representing fees and money, integer may not be the optimal data type to implement in this case, as money is represented by decimal; consequently, conversion from integer to numeric data types is required. Figure 22 Shows the r console output of figure 21 The console output seen above indicates that the conversion from integer to numeric has been successfully implemented for the Lease Fee columns. 2.5 Additional functions 2.5.1 Q uantify the linear relationship between bedroom number, lease fee, bathroom number and square feet ( correlation ) Figure 23 Shows the r script o f correlation The first extra feature is the construction of a correlation finding an d chart visualizing between a home's square footage, rental price, and number of bedrooms and bathrooms is shown in the r program's source code above.