Steps in Analyzing Software Metrics Data Feature/Attribute Extraction Using Principal Component Analysis Features and Independent Variables • "Feature" and "independent variable" are different terms for the same thing. "Feature" is more common in machine learning, whereas "independent variable" is more common in statistics What is Principal Component Analysis and what is it used for? • Principal Component Analysis , or more commonly known as PCA , is a way to reduce the number of variables while maintaining the majority of the important information. • It transforms a number of variables that may be correlated into a smaller number of uncorrelated variables, known as principal components • The principal components are linear combinations of the original variables weighted by their variances (or eigenvalues) in a particular orthogonal dimension. • The main objective of PCA is to simplify your model features into fewer components to help visualize patterns in your data and to help your model run faster. Using PCA also reduces the chance of overfitting your model by eliminating features with high correlation. Example data set Feature1 (e.g WMC) 4 8 13 7 Feature 2 (e.g LOC) 11 4 5 14 Step 1: Calculate Mean of each Feature • The figure shows the scatter plot of the given data points. Calculate the mean of X 1 and X 2 as shown below Step 2: Calculation of the covariance matrix. • The covariances are calculated as follows: Covariance Matrix • The covariance matrix is, Step 3: Eigenvalues of the covariance matrix • The characteristic equation of the covariance matrix is, • Solving the characteristic equation we get, Step 4: Computation of the eigenvectors • To find the first principal components, we need only compute the eigenvector corresponding to the largest eigenvalue. In the present example, the largest eigenvalue is λ 1 and so we compute the eigenvector corresponding to λ 1 • The eigenvector corresponding to λ = λ 1 is a vector • The Eigen vector satisfying the following equation: • This is equivalent to the following two equations: • Using the theory of systems of linear equations, we note that these equations are not independent and solutions are given by, Taking t = 1 , we get an eigenvector corresponding to λ 1 as that is, • To find a unit eigenvector, we compute the length of X 1 which is given by, • Therefore, a unit eigenvector corresponding to λ 1 is • By carrying out similar computations, the unit eigenvector e 2 corresponding to the eigenvalue λ= λ 2 can be shown to be, Step 5: Computation of first principal components Let be the k th sample in the above data set The first principal component of this example is obtained as follows • For example, the first principal component corresponding to the first example P11 is calculated as follows: • Similarly we can calculate the remaining elements P12, P13, P14 of the first Principal Component. • Repeating the steps 4 and 5 corresponding to eigen value λ 2 , we can we second principal component. • Thus the original data set was Feature1 (e.g WMC) 4 8 13 7 Feature 2 (e.g LOC) 11 4 5 14 The transformed data set after PCA is PC1 -4.30518692 3.73612869 -5.69282771 -5.12376947 PC2 1.92752836 2.50825486 -2.20038921 -2.23539401 Covariance Matrix of the transformed data set