JDA Corporate Documentation Template - jreeg

Please enable JavaScript to view the full PDF

Learning A to Z of R Programming TABLE OF CONTENTS Unit 1: Getting Started with R ...........................................................................2 Getting Started .............................................................................................................. 2 R Objects and Data Types .............................................................................................. 5 R Operators ................................................................................................................... 8 Decision Making in R ................................................................................................... 11 LOOPS in R ................................................................................................................... 13 STRINGS in R ................................................................................................................ 15 Unit 2: FUNCTIONS in R ..................................................................................17 Built-in Function .......................................................................................................... 17 User-defined Function................................................................................................. 17 Unit 3: VECTORS, LISTS, ARRAYS & MATRICES ........................................19 VECTORS ...................................................................................................................... 19 LISTS ............................................................................................................................ 21 MATRICES .................................................................................................................... 25 ARRAYS ........................................................................................................................ 27 Factors ......................................................................................................................... 29 Data Frames ................................................................................................................ 34 Unit 4: Working with Files ...............................................................................45 Working with Excel Files .............................................................................................. 46 Unit 5: Working with MSAccess Database ....................................................48 Unit 6: Working with Graphs ..........................................................................51 Unit 7: Overview of R Packages .....................................................................64 Unit 8: Programming Examples .....................................................................68 1|P ag e Learning A to Z of R Programming Unit 1: Getting Started with R GETTING STARTED R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. Why R? It's free, open source, powerful and highly extensible. "You have a lot of prepackaged stuff that's already available, so you're standing on the shoulders of giants," Google's chief economist told The New York Times back in 2009.There can be little doubt that interest in the R statistics language, especially for data analysis, is soaring. Downloading R The primary R system is available from the Comprehensive R Archive Network, also known as CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R. The “base” R system that you download from CRAN: Linux, Windows, Mac, Source Code Website to download: https://cran.r-project.org/mirrors.html The R Foundation for Statistical Computing The R Foundation is a not-for-profit organization working in the public interest. It was founded by the members of the R Development Core Team in order to:  Provide support for the R project and other innovations in statistical computing. We believe that R has become a mature and valuable tool and we would like to ensure its continued development and the development of future innovations in software for statistical and computational research.  Provide a reference point for individuals, institutions or commercial enterprises that want to support or interact with the R development community.  Hold and administer the copyright of R software and documentation. R functionality is divided into a number of packages:  The “base” R system contains, among other things, the base package which is required to run R and contains the most fundamental functions.  The other packages contained in the “base” system include utils, stats, datasets, graphics, grDevices, grid, methods, tools, parallel, compiler, splines, tcltk, stats4.  There are also “Recommended” packages: boot, class, cluster, codetools, foreign, KernSmooth, lattice, mgcv, nlme, rpart, survival, MASS, spatial, nnet, Matrix. When you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available: 2|P ag e Learning A to Z of R Programming  There are over 4000 packages on CRAN that have been developed by users and programmers around the world.  People often make packages available on their personal websites; there is no reliable way to keep track of how many packages are available in this fashion.  There are a number of packages being developed on repositories like GitHub and BitBucket but there is no reliable listing of all these packages. More details can be found at the R foundation website: https://www.r-project.org/ Let’s create our first R Program Launch R. In Windows you can launch R software using the option shown below under Program Files. Figure 1: Launch R Programming Window After launching R interpreter, you will get a prompt > where you can start typing your Program. Let’s try our first program: In the Hello World code below, vString is a variable which stores the String value “Hello World” and in the next line we print the value of the vString variable. Please note that R command are case sensitive. print is the valid command to print the value on the screen. Figure 2: Hello World # is the syntax used to print comments in the program Figure 3: R Programming R Basic Syntax 3|P ag e Learning A to Z of R Programming Download and Install R software When R is run, this will launch R interpreter. You will get a prompt where you can start typing your programs as follows: Here first statement defines a string variable myString, where we assign a string "Hello, World!" and then next statement print() is being used to print the value stored in variable myString. R Script File Usually, you will do your programming by writing your programs in script files and then you execute those scripts at your command prompt with the help of R interpreter called Rscript. So let's start with writing following code in a text file called test.R as under: Save the above code in a file test.R and execute it at Linux command prompt as given below. Even if you are using Windows or other system, syntax will remain same. For windows, go to command prompt and browse to the directory where R.exe/Rscript.exe is installed. Run-> Rscript filename.R (filename.R is the name of the file which has R program along with the path name.) 4|P ag e Learning A to Z of R Programming We will use RStudio for rest of our course example. Download and install R Studio. R OBJECTS AND DATA TYPES Generally, while doing programming in any programming language, you need to use various variables to store information. Variables are nothing but reserved memory locations to store values. This means that, when you create a variable you reserve some space in memory. In contrast to other programming languages like C and java in R, the variables are not declared as some data type. The variables are assigned with R-Objects and the data type of the R- object becomes the data type of the variable. R has five basic or “atomic” classes of objects:  character  numeric (real numbers)  integer  complex  logical (True/False) The frequently used ones are: Vectors Lists Matrices Arrays Factors Data Frames The simplest of these objects is the vector object and there are six data types of these atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic vectors. 5|P ag e Learning A to Z of R Programming Figure 4: Data Types in R Creating Vectors The c() function can be used to create vectors of objects by concatenating things together. When you want to create vector with more than one element, you should use c() function which means to combine the elements into a vector. You can also use the vector() function to initialize vectors. Figure 5: Vector example Lists, Matrices, Arrays A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it. A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function. While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension. In the below example we create an array with two elements which are 3x3 matrices each. Factors 6|P ag e Learning A to Z of R Programming Factors are used to represent categorical data and can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label. Factors are important in statistical modeling and are treated specially by modelling functions like lm() and glm(). Using factors with labels is better than using integers because factors are self-describing. Having a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2. Factor objects can be created with the factor() function. Figure 6: List, Matrix and Array example Figure 7: Factors example 7|P ag e Learning A to Z of R Programming Data Frames Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data. The first column can be numeric while the second column can be character and third column can be logical. It is a list of vectors of equal length. Data Frames are created using the data.frame() function. Figure 8: Data frames example Mixing Objects There are occasions when different classes of R objects get mixed together. Sometimes this happens by accident but it can also happen on purpose. In implicit coercion, what R tries to do is find a way to represent all of the objects in the vector in a reasonable fashion. Sometimes this does exactly what you want and sometimes not. For example, combining a numeric object with a character object will create a character vector, because numbers can usually be easily represented as strings. Figure 9: Mixing and Missing Objects examples R OPERATORS 8|P ag e Learning A to Z of R Programming We have the following types of operators in R programming:  Arithmetic Operators  Relational Operators  Logical Operators  Assignment Operators  Miscellaneous Operators Arithmetic Operators Figure 10: Assignment Operators Relational Operators Operators Meaning Checks if each element of the first vector is greater than the corresponding element > of the second vector. Checks if each element of the first vector is less than the corresponding element of < the second vector. Checks if each element of the first vector is equal to the corresponding element of == the second vector. Checks if each element of the first vector is less than or equal to the corresponding <= element of the second vector. Checks if each element of the first vector is greater than or equal to the >= corresponding element of the second vector. Checks if each element of the first vector is unequal to the corresponding element != of the second vector. 9|P ag e Learning A to Z of R Programming Logical Operators Operators Meaning It is called Element-wise Logical AND operator. It combines each element of the & first vector with the corresponding element of the second vector and gives a output TRUE if both the elements are TRUE. It is called Element-wise Logical OR operator. It combines each element of the first | vector with the corresponding element of the second vector and gives a output TRUE if one the elements is TRUE. It is called Logical NOT operator. Takes each element of the vector and gives the ! opposite logical value. The logical operator && (logical AND) and || (logical OR) considers only the first element of the vectors and give a vector of single element as output. Readers are encouraged to practice all the operators and see the output. Assignment Operators A variable in R can store an atomic vector, group of atomic vectors or a combination of many R objects. The variables can be assigned values using leftward, rightward and equal to operator. The values of the variables can be printed using print() or cat() function. The cat() function combines multiple items into a continuous print output. In R, a variable itself is not declared of any data type, rather it gets the data type of the R - object assigned to it. So R is called a dynamically typed language, which means that we can change a variable’s data type of the same variable again and again when using it in a program. Figure 11: Variable assignment 10 | P a g e Learning A to Z of R Programming Figure 12: Listing and deleting variables Miscellaneous Operators Operators Meaning : Colon operator. It creates the series of numbers in sequence for a vector. %in% This operator is used to identify if an element belongs to a vector. %*% This operator is used to multiply a matrix with its transpose. DECISION MAKING IN R R provides the following types of decision making statements: Statement Description An if statement consists of a Boolean expression followed by one or more If statement statements. An if statement can be followed by an optional else statement, which executes If else statement when the Boolean expression is false. A switch statement allows a variable to be tested for equality against a list of Switch statement values. Figure 13: Example of If Statement 11 | P a g e Learning A to Z of R Programming Figure 14: Example of If Else Statement Multiple if else An if statement can be followed by an optional else if...else statement, which is very useful to test various conditions using single if...else if statement. Syntax When using if, else if, else statements there are few points to keep in mind.  An if can have zero or one else and it must come after any else if's.  An if can have zero to many else if's and they must come before the else.  Once an else if succeeds, none of the remaining else if's or else's will be tested. SWITCH statement A switch statement allows a variable to be tested for equality against a list of values. Each value is called a case, and the variable being switched on is checked for each case. Syntax 12 | P a g e Learning A to Z of R Programming The following rules apply to a switch statement:  If the value of expression is not a character string it is coerced to integer.  You can have any number of case statements within a switch. Each case is followed by the value to be compared to and a colon.  If the value of the integer is between 1 and nargs()-1 (The max number of arguments)then the corresponding element of case condition is evaluated and the  result returned.  If expression evaluates to a character string then that string is matched (exactly) to the names of the elements.  If there is more than one match, the first matching element is returned.  No Default argument is available.  In the case of no match, if there is a unnamed element of ... its value is returned. (If there is more than one such argument an error is returned.) LOOPS IN R Loops are used to repeat a block of code. Being able to have your program repeatedly execute a block of code is one of the most basic but useful tasks in programming- a loop lets you write a very simple statement to produce a significantly greater result simply by repetition. R programming language provides the following kinds of loop to handle looping requirements: Loop Type Description REPEAT loop Executes a sequence of statements multiple times and abbreviates the code that manages the loop variable. WHILE loop Repeats a statement or group of statements while a given condition is true. It tests the condition before executing the loop body. FOR loop It executes a block of statements repeatedly until the specified condition returns false. Look Control Statements 13 | P a g e Learning A to Z of R Programming Control Type Description BREAK statement Terminates the loop statement and transfers execution to the statement immediately following the loop. NEXT statement The next statement simulates the behavior of R switch (skips the line of execution). REPEAT – loop The Repeat loop executes the same code again and again until a stop condition is met. Syntax: Example: WHILE – loop The While loop executes the same code again and again until a stop condition is met. Syntax: Example: FOR – loop A for loop is a repetition control structure that allows you to efficiently write a loop that needs to execute a specific number of times. Syntax: Example: 14 | P a g e Learning A to Z of R Programming STRINGS IN R Any value written within a pair of single quote or double quotes in R is treated as a string. Internally R stores every string within double quotes, even when you create them with single quote. Rules Applied in String Construction  The quotes at the beginning and end of a string should be both double quotes or both single quote. They can not be mixed.  Double quotes can be inserted into a string starting and ending with single quote.  Single quote can be inserted into a string starting and ending with double quotes.  Double quotes can not be inserted into a string starting and ending with double quotes.  Single quote can not be inserted into a string starting and ending with single quote. Examples of Strings in R 15 | P a g e Learning A to Z of R Programming Formatting numbers & strings - format() function Numbers and strings can be formatted to a specific style using format()function. Syntax - The basic syntax for format function is : Following is the description of the parameters used:  x is the vector input.  digits is the total number of digits displayed.  nsmall is the minimum number of digits to the right of the decimal point.  scientific is set to TRUE to display scientific notation.  width indicates the minimum width to be displayed by padding blanks in the beginning.  justify is the display of the string to left, right or center. Other functions Functions Functionality nchar(x) This function counts the number of characters including spaces in a string. toupper(x) / tolower(x) These functions change the case of characters of a string. substring(x,first,last) This function extracts parts of a String. 16 | P a g e Learning A to Z of R Programming Unit 2: FUNCTIONS in R A function is a set of statements organized together to perform a specific task. R has a large number of in-built functions and the user can create their own functions. The different parts of a function are:  Function Name: This is the actual name of the function. It is stored in R environment as an object with this name.  Arguments: An argument is a placeholder. When a function is invoked, you pass a value to the argument. Arguments are optional; that is, a function may contain no arguments. Also arguments can have default values.  Function Body: The function body contains a collection of statements that defines what the function does.  Return Value: The return value of a function is the last expression in the function body to be evaluated. BUILT-IN FUNCTION R has many in-built functions which can be directly called in the program without defining them first. Simple examples of in-built functions are seq(), mean(), max(), sum(x)and paste(...) etc. USER-DEFINED FUNCTION We can also create and use our own functions referred as user defined functions. An R function is created by using the keyword function. The basic syntax of an R function definition is as follows: 17 | P a g e Learning A to Z of R Programming Example: Calling a function with argument values (by position and by name) Example: Calling a function with default values Lazy Evaluation of Function: Arguments to functions are evaluated lazily, which means so they are evaluated only when needed by the function body. 18 | P a g e Learning A to Z of R Programming Unit 3: VECTORS, LISTS, ARRAYS & MATRICES VECTORS Vectors are the most basic R data objects and there are six types of atomic vectors. They are logical, integer, double, complex, character and raw. Even when you write just one value in R, it becomes a vector of length 1 and belongs to one of the above vector types. # Atomic vector of type character. # Atomic vector of type double. print("ABC"); print (1.2) [1] "ABC" [1] 12.5 # Atomic vector of type integer. # Atomic vector of type logical. print(10L) print(TRUE) [1] 10 [1] TRUE # Atomic vector of type complex. # Atomic vector of type raw. print(4+8i) print(charToRaw('hello')) [1] 4+8i [1] 68 65 6c 6c 6f Multiple Elements Vector Using colon operator with numeric data # Creating a sequence from 2 to 8. # Creating a sequence from 6.6 to 12.6. v <- 2:8 v <- 6.6:12.6 print(v) print(v) [1] 2 3 4 5 6 7 8 [1] 6.6 7.6 8.6 9.6 10.6 11.6 12.6 # If the final element specified does not belong to the sequence then it is discarded. v <- 3.8:11.4 print(v) [1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8 Using sequence (Seq.) operator Syntax and example of using Seq. operator: # # Create vector with elements from 5 to 9 incrementing by 0.4. print (seq(5, 9, by=0.4)) [1] 5.0 5.4 5.8 6.2 6.6 7.0 7.4 7.8 8.2 8.6 9.0 Using the c () function The non-character values are coerced to character type if one of the elements is a char. Syntax and example of using c() function: ## The logical and numeric values are converted to characters. x <- c('apple', 'red', 5, TRUE) print(x) [1] "apple" "red" "5" "TRUE" Accessing Vector Elements Elements of a Vector are accessed using indexing. The [ ] brackets are used for indexing. Indexing starts with position 1. Giving a negative value in the index drops that element from result. TRUE, FALSE or 0 and 1 can also be used for indexing. Syntax and example: # Accessing vector elements using position. t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat") 19 | P a g e Learning A to Z of R Programming u <- t[c(2,3,6)] print(u) [1] "Mon" "Tue" "Fri" # Accessing vector elements using logical indexing. v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)] print(v) [1] "Sun" "Fri" # Accessing vector elements using negative indexing. x <- t[c(-2,-5)] print(x) [1] "Sun" "Tue" "Wed" "Fri" "Sat" # Accessing vector elements using 0/1 indexing. y <- t[c(0,0,0,0,0,0,1)] print(y) [1] "Sun" Vector Manipulation Vector Arithmetic- Two vectors of same length can be added, subtracted, multiplied or divided giving the result as a vector output. Syntax and example: # Create two vectors. v1 <- c(3,8,4,5,0,11) v2 <- c(4,11,0,8,1,2) # Vector addition. add.result <- v1+v2 print(add.result) [1] 7 19 4 13 1 13 # Vector substraction. sub.result <- v1-v2 print(sub.result) [1] -1 -3 4 -3 -1 9 # Vector multiplication. multi.result <- v1*v2 print(multi.result) [1] 12 88 0 40 0 22 # Vector division. divi.result <- v1/v2 print(divi.result) [1] 0.7500000 0.7272727 Inf 0.6250000 0.0000000 5.5000000 Vector Element Recycling If we apply arithmetic operations to two vectors of unequal length, then the elements of the shorter vector are recycled to complete the operations. Syntax and example: v1 <- c(3,8,4,5,0,11) v2 <- c(4,11) 20 | P a g e Learning A to Z of R Programming # V2 becomes c(4,11,4,11,4,11) add.result <- v1+v2 print(add.result) [1] 7 19 8 16 4 22 sub.result <- v1-v2 print(sub.result) [1] -1 -3 0 -6 -4 0 Vector Element Sorting Elements in a vector can be sorted using the sort() function. Syntax and example: v <- c(3,8,4,5,0,11, -9, 304) # Sort the elements of the vector. sort.result <- sort(v) print(sort.result) [1] -9 0 3 4 5 8 11 304 # Sort the elements in the reverse order. revsort.result <- sort(v, decreasing = TRUE) print(revsort.result) [1] 304 11 8 5 4 3 0 -9 # Sorting character vectors. v <- c("Red","Blue","yellow","violet") sort.result <- sort(v) print(sort.result) [1] "Blue" "Red" "violet" "yellow" # Sorting character vectors in reverse order. revsort.result <- sort(v, decreasing = TRUE) print(revsort.result) [1] "yellow" "violet" "Red" "Blue" LISTS Lists are the R objects which contain elements of different types like - numbers, strings, vectors and another list inside it. A list can also contain a matrix or a function as its elements. List is created using list() function. Syntax and example: ## Create a list containing strings, numbers, vectors and a logical values. list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1) print(list_data) [[1]] [1] "Red" [[2]] [1] "Green" [[3]] 21 | P a g e Learning A to Z of R Programming [1] 21 32 11 [[4]] [1] TRUE [[5]] [1] 51.23 [[6]] [1] 119.1 Naming List Elements The list elements can be given names and they can be accessed using these names. 22 | P a g e Learning A to Z of R Programming Manipulating List Elements We can add, delete and update list elements as shown below. We can add and delete elements only at the end of a list. But we can update any element. 23 | P a g e Learning A to Z of R Programming Merging Lists You can merge many lists into one list by placing all the lists inside one list() function. Converting Lists to Vector A list can be converted to a vector so that the elements of the vector can be used for further manipulation. All the arithmetic operations on vectors can be applied after the list is converted into vectors. To do this conversion, we use the unlist() function. It takes the list as input and produces a vector. 24 | P a g e Learning A to Z of R Programming MATRICES Matrices are the R objects in which the elements are arranged in a two-dimensional format. They contain elements of the same atomic types. But we use matrices containing numeric elements to be used in mathematical calculations. A Matrix is created using the matrix() function. Syntax Parameters used:  data is the input vector which becomes the data elements of the matrix.  nrow is the number of rows to be created.  ncol is the number of columns to be created.  byrow is a logical clue. If TRUE then the input vector elements are arranged by row.  dimname is the names assigned to the rows and columns. # Elements are arranged sequentially by row. M <- matrix(c(3:14), nrow=4, byrow=TRUE) print(M) # Elements are arranged sequentially by column. N <- matrix(c(3:14), nrow=4, byrow=FALSE) print(N) 25 | P a g e Learning A to Z of R Programming # Define the column and row names. rownames = c("row1", "row2", "row3", "row4") colnames = c("col1", "col2", "col3") # Accessing Elements of a Matrix # Access the element at 3rd column and 1st row. print(N[1,3]) # Access the element at 2nd column and 4th row. print(N[4,2]) # Access only the 2nd row. print(N[2,]) # Access only the 3rd column. print(N[,3]) Matrix Computations Various mathematical operations are performed on the matrices using the R operators. The result of the operation is also a matrix. The dimensions (number of rows and columns) should be same for the matrices involved in the operation. # Create two 2x3 matrices. matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow=2) print(matrix1) matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow=2) print(matrix2) # Add the matrices. result <- matrix1 + matrix2 cat("Result of addition","\n") print(result) # Subtract the matrices result <- matrix1 - matrix2 cat("Result of subtraction","\n") print(result) Matrix Multiplication & Division # Create two 2x3 matrices. matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow=2) print(matrix1) matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow=2) print(matrix2) # Multiply the matrices. result <- matrix1 * matrix2 cat("Result of multiplication","\n") print(result) # Divide the matrices result <- matrix1 / matrix2 cat("Result of division","\n") 26 | P a g e Learning A to Z of R Programming print(result) ARRAYS Arrays are the R data objects which can store data in more than two dimensions. For example - If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. Arrays can store only data type. An array is created using the array() function. It takes vectors as input and uses the values in the dim parameter to create an array. # Create two vectors of different lengths. vector1 <- c(5,9,3) vector2 <- c(10,11,12,13,14,15) # Take these vectors as input to the array. result <- array(c(vector1,vector2),dim=c(3,3,2)) print(result) Naming Columns and Rows: We can give names to the rows, columns and matrices in the array by using the dimnames parameter. # Create two vectors of different lengths. vector1 <- c(5,9,3) vector2 <- c(10,11,12,13,14,15) column.names <- c("COL1","COL2","COL3") row.names <- c("ROW1","ROW2","ROW3") matrix.names <- c("Matrix1","Matrix2") # Take these vectors as input to the array. result <- array(c(vector1,vector2),dim=c(3,3,2),dimnames = list(column.names,row.names,matrix.names)) print(result) Accessing Array Elements # Create two vectors of different lengths. vector1 <- c(5,9,3) vector2 <- c(10,11,12,13,14,15) column.names <- c("COL1","COL2","COL3") row.names <- c("ROW1","ROW2","ROW3") matrix.names <- c("Matrix1","Matrix2") # Take these vectors as input to the array. result <- array(c(vector1,vector2),dim=c(3,3,2),dimnames = list(column.names,row.names,matrix.names)) # Print the third row of the second matrix of the array. print(result[3,,2]) # Print the element in the 1st row and 3rd column of the 1st matrix. print(result[1,3,1]) 27 | P a g e Learning A to Z of R Programming # Print the 2nd Matrix. print(result[,,2]) Manipulating Array Elements As array is made up matrices in multiple dimensions, the operations on elements of array are carried out by accessing elements of the matrices. # Create two vectors of different lengths. vector1 <- c(5,9,3) vector2 <- c(10,11,12,13,14,15) # Take these vectors as input to the array. array1 <- array(c(vector1,vector2),dim=c(3,3,2)) # Create two vectors of different lengths. vector3 <- c(9,1,0) vector4 <- c(6,0,11,3,14,1,2,6,9) array2 <- array(c(vector1,vector2),dim=c(3,3,2)) # create matrices from these arrays. matrix1 <- array1[,,2] matrix2 <- array2[,,2] # Add the matrices. result <- matrix1+matrix2 print(result) Calculations Across Array Elements: We can do calculations across the elements in an array using the apply() function. Syntax Parameters used:  x is an array.  margin is the name of the data set used.  fun is the function to be applied across the elements of the array. We use the apply() function below to calculate the sum of the elements in the rows of an array across all the matrices. # Create two vectors of different lengths. vector1 <- c(5,9,3) vector2 <- c(10,11,12,13,14,15) # Take these vectors as input to the array. new.array <- array(c(vector1,vector2),dim=c(3,3,2)) print(new.array) # Use apply to calculate the sum of the rows across all the matrices. 28 | P a g e Learning A to Z of R Programming result <- apply(new.array, c(1), sum) print(result) Array indexing. Subsections of an array Individual elements of an array may be referenced by giving the name of the array followed by the subscripts in square brackets, separated by commas. More generally, subsections of an array may be specified by giving a sequence of index vectors in place of subscripts; however if any index position is given an empty index vector, then the full range of that subscript is taken. Continuing the previous example, a[2,,] is a 42 array with dimension vector c(4,2) and data vector containing the values c(a[2,1,1], a[2,2,1], a[2,3,1], a[2,4,1], a[2,1,2], a[2,2,2], a[2,3,2], a[2,4,2]) in that order. a[,,] stands for the entire array, which is the same as omitting the subscripts entirely and using a alone. For any array, say Z, the dimension vector may be referenced explicitly as dim(Z) (on either side of an assignment). Also, if an array name is given with just one subscript or index vector, then the corresponding values of the data vector only are used; in this case the dimension vector is ignored. This is not the case, however, if the single index is not a vector but itself an array, as we next discuss. FACTORS Factors are the data objects which are used to categorize the data and store it as levels. They can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like "Male, "Female" and True, False etc. They are useful in data analysis for statistical modeling. A factor is a vector object used to specify a discrete classification (grouping) of the components of other vectors of the same length. R provides both ordered and unordered factors. While the “real” application of factors is with model formulae (see Section 11.1.1 [Contrasts], page 53), we here look at a specific example. 4.1 A specific example Suppose, for example, we have a sample of 30 tax accountants from all the states and territories 29 | P a g e Learning A to Z of R Programming of Australia1 and their individual state of origin is specified by a character vector of state mnemonics as > state <- c("tas", "sa", "qld", "nsw", "nsw", "nt", "wa", "wa", "qld", "vic", "nsw", "vic", "qld", "qld", "sa", "tas", "sa", "nt", "wa", "vic", "qld", "nsw", "nsw", "wa", "sa", "act", "nsw", "vic", "vic", "act") Notice that in the case of a character vector, “sorted” means sorted in alphabetical order. A factor is similarly created using the factor() function: > statef <- factor(state) The print() function handles factors slightly differently from other objects: > statef [1] tas sa qld nsw nsw nt wa wa qld vic nsw vic qld qld sa [16] tas sa nt wa vic qld nsw nsw wa sa act nsw vic vic act Levels: act nsw nt qld sa tas vic wa To find out the levels of a factor the function levels() can be used. > levels(statef) [1] "act" "nsw" "nt" "qld" "sa" "tas" "vic" "wa" 4.2 The function tapply() and ragged arrays To continue the previous example, suppose we have the incomes of the same tax accountants in another vector (in suitably large units of money) > incomes <- c(60, 49, 40, 61, 64, 60, 59, 54, 62, 69, 70, 42, 56, 61, 61, 61, 58, 51, 48, 65, 49, 49, 41, 48, 52, 46, 59, 46, 58, 43) To calculate the sample mean income for each state we can now use the special function tapply(): > incmeans <- tapply(incomes, statef, mean) giving a means vector with the components labelled by the levels act nsw nt qld sa tas vic wa 44.500 57.333 55.500 53.600 55.000 60.500 56.000 52.250 The function tapply() is used to apply a function, here mean(), to each group of components of the first argument, here incomes, defined by the levels of the second component, here statef2, as if they were separate vector structures. The result is a structure of the same length as the levels attribute of the factor containing the results. The reader should consult the help document for more details. Suppose further we needed to calculate the standard errors of the state income means. To do this we need to write an R function to calculate the standard error for any given vector. Since there is an builtin function var() to calculate the sample variance, such a function is a very simple one liner, specified by the assignment: 30 | P a g e Learning A to Z of R Programming > stdError <- function(x) sqrt(var(x)/length(x)) (Writing functions will be considered later in Chapter 10 [Writing your own functions], page 42. Note that R’s a builtin function sd() is something different.) After this assignment, the standard errors are calculated by > incster <- tapply(incomes, statef, stderr) and the values calculated are then > incster act nsw nt qld sa tas vic wa 1.5 4.3102 4.5 4.1061 2.7386 0.5 5.244 2.6575 As an exercise you may care to find the usual 95% confidence limits for the state mean incomes. To do this you could use tapply() once more with the length() function to find the sample sizes, and the qt() function to find the percentage points of the appropriate t- distributions. (You could also investigate R’s facilities for t-tests.) The function tapply() can also be used to handle more complicated indexing of a vector by multiple categories. For example, we might wish to split the tax accountants by both state and sex. However in this simple instance (just one factor) what happens can be thought of as follows. The values in the vector are collected into groups corresponding to the distinct entries in the factor. The function is then applied to each of these groups individually. The value is a vector of function results, labelled by the levels attribute of the factor. The combination of a vector and a labelling factor is an example of what is sometimes called a ragged array, since the subclass sizes are possibly irregular. When the subclass sizes are all the same the indexing may be done implicitly and much more efficiently, as we see in the next section. 4.3 Ordered factors The levels of factors are stored in alphabetical order, or in the order they were specified to factor if they were specified explicitly. Sometimes the levels will have a natural ordering that we want to record and want our statistical analysis to make use of. The ordered() function creates such ordered factors but is otherwise identical to factor. For most purposes the only difference between ordered and unordered factors is that the former are printed showing the ordering of the levels, but the contrasts generated for them in fitting linear models are different. Factors are created using the factor () function by taking a vector as input. Factors are categorical variables that are super useful in summary statistics, plots, and regressions. They basically act like dummy variables that R codes for you. So, let’s start off with some data: 31 | P a g e Learning A to Z of R Programming and let’s check out what kinds of variables we have: so we see that Race is a factor variable with three levels. I can see all the levels this way: So what his means that R groups statistics by these levels. Internally, R stores the integer values 1, 2, and 3, and maps the character strings (in alphabetical order, unless I reorder) to these values, i.e. 1=Black, 2=Hispanic, and 3=White. Now if I were to do a summary of this variable, it shows me the counts for each category, as below. R won’t let me do a mean or any other statistic of a factor variable other than a count, so keep that in mind. But you can always change your factor to be numeric. If I do a plot of age on race, I get a boxplot from the normal plot command since that is what makes sense for a categorical variable: plot(mydata$Age~mydata$Race, xlab=”Race”, ylab=”Age”, main=”Boxplots of Age by Race”) 32 | P a g e Learning A to Z of R Programming # Create a vector as input. data <- c("East","West","East","North","North","East","West","West","West","East","North") print(data) print(is.factor(data)) # Apply the factor function. factor_data <- factor(data) print(factor_data) print(is.factor(factor_data)) Factors in Data Frame On creating any data frame with a column of text data, R treats the text column as categorical data and creates factors on it. # Create the vectors for data frame. height <- c(132,151,162,139,166,147,122) weight <- c(48,49,66,53,67,52,40) gender <- c("male","male","female","female","male","female","male") # Create the data frame. input_data <- data.frame(height,weight,gender) print(input_data) # Test if the gender column is a factor. print(is.factor(input_data$gender)) # Print the gender column so see the levels. print(input_data$gender) Changing the Order of Levels: The order of the levels in a factor can be changed by applying the factor function again with new order of the levels. data <- c("East","West","East","North","North","East","West","West","West","East","North") # Create the factors factor_data <- factor(data) print(factor_data) # Apply the factor function with required order of the level. 33 | P a g e Learning A to Z of R Programming new_order_data <- factor(factor_data,levels = c("East","West","North")) print(new_order_data) Generating Factor Levels: We can generate factor levels by using the gl() function. It takes two integers as input which indicates how many levels and how many times each level. Syntax: gl(n, k, labels) Following is the description of the parameters used:  n is a integer giving the number of levels.  k is a integer giving the number of replications.  labels is a vector of labels for the resulting factor levels. v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston")) print(v) DATA FRAMES A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. Following are the characteristics of a data frame:  The column names should be non-empty.  The row names should be unique.  The data stored in a data frame can be of numeric, factor or character type.  Each column should contain same number of data items. # Create the data frame. emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25), start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05- 11","2015-03-27")), stringsAsFactors=FALSE ) # Print the data frame. print(emp.data) Get the Structure of the Data Frame: The structure of the data frame can be seen by using str() function. # Create the data frame. emp.data <- data.frame( emp_id = c (1:5), 34 | P a g e Learning A to Z of R Programming emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25), start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05- 11","2015-03-27")), stringsAsFactors=FALSE ) # Get the structure of the data frame. str(emp.data) Summary of Data in Data Frame The statistical summary and nature of the data can be obtained by applying summary() function. # Create the data frame. emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25), start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05- 11","2015-03-27")), stringsAsFactors=FALSE ) # Print the summary. print(summary(emp.data)) Extract Data from Data Frame Extract specific column from a data frame using column name. # Create the data frame. emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25), start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05- 11","2015-03-27")), stringsAsFactors=FALSE ) # Extract Specific columns. result <- data.frame(emp.data$emp_name,emp.data$salary) print(result) # Extract 3rd and 5th row with 2nd and 4th column. result <- emp.data[c(3,5),c(2,4)] print(result) # Extract first two rows. result <- emp.data[1:2,] print(result) # Expand Data Frame - A data frame can be expanded by adding columns and rows. 35 | P a g e Learning A to Z of R Programming # Add the "dept" coulmn. emp.data$dept <- c("IT","Operations","IT","HR","Finance") v <- emp.data print(v) Add Row To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function. In the example below we create a data frame with new rows and merge it with the existing data frame to create the final data frame. # Create the first data frame. emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"), salary = c(623.3,515.2,611.0,729.0,843.25), start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05- 11","2015-03-27")), dept=c("IT","Operations","IT","HR","Finance"), stringsAsFactors=FALSE ) # Create the second data frame emp.newdata <- data.frame( emp_id = c (6:8), emp_name = c("Rasmi","Pranab","Tusar"), salary = c(578.0,722.5,632.8), start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")), dept = c("IT","Operations","Fianance"), stringsAsFactors=FALSE ) # Bind the two data frames. emp.finaldata <- rbind(emp.data,emp.newdata) print(emp.finaldata) 36 | P a g e Learning A to Z of R Programming Unit 4: Simple manipulations; numbers and vectors Vectors and assignment R operates on named data structures. The simplest such structure is the numeric vector, which is a single entity consisting of an ordered collection of numbers. To set up a vector named x, say, consisting of five numbers, namely 10.4, 5.6, 3.1, 6.4 and 21.7, use the R command > x <- c(10.4, 5.6, 3.1, 6.4, 21.7) This is an assignment statement using the function c() which in this context can take an arbitrary number of vector arguments and whose value is a vector got by concatenating its arguments end to end. A number occurring by itself in an expression is taken as a vector of length one. Notice that the assignment operator (‘<-’), which consists of the two characters ‘<’ (“less than”) and ‘-’ (“minus”) occurring strictly side-by-side and it ‘points’ to the object receiving the value of the expression. In most contexts the ‘=’ operator can be used as an alternative. Assignment can also be made using the function assign(). An equivalent way of making the same assignment as above is with: > assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7)) The usual operator, <-, can be thought of as a syntactic short-cut to this. Assignments can also be made in the other direction, using the obvious change in the assignment operator. So the same assignment could be made using > c(10.4, 5.6, 3.1, 6.4, 21.7) -> x If an expression is used as a complete command, the value is printed and lost 2. So now if we were to use the command > 1/x the reciprocals of the five values would be printed at the terminal (and the value of x, of course, unchanged). The further assignment > y <- c(x, 0, x) would create a vector y with 11 entries consisting of two copies of x with a zero in the middle place. VECTOR ARITHMETIC Vectors can be used in arithmetic expressions, in which case the operations are performed element by element. Vectors occurring in the same expression need not all be of the same length. If they are not, the value of the expression is a vector with the same length as the longest vector which occurs in the expression. Shorter vectors in the expression are recycled 37 | P a g e Learning A to Z of R Programming as often as need be (perhaps fractionally) until they match the length of the longest vector. In particular a constant is simply repeated. So with the above assignments the command > v <- 2*x + y + 1 generates a new vector v of length 11 constructed by adding together, element by element, 2*x repeated 2.2 times, y repeated just once, and 1 repeated 11 times. The elementary arithmetic operators are the usual +, -, *, / and ^ for raising to a power. In addition all of the common arithmetic functions are available. log, exp, sin, cos, tan, sqrt, and so on, all have their usual meaning. max and min select the largest and smallest elements of a vector respectively. range is a function whose value is a vector of length two, namely c(min(x), max(x)). length(x) is the number of elements in x, sum(x) gives the total of the elements in x, and prod(x) their product. Two statistical functions are mean(x) which calculates the sample mean, which is the same as sum(x)/length(x), and var(x) which gives sum((x-mean(x))^2)/(length(x)-1) or sample variance. If the argument to var() is an n-by-p matrix the value is a p-by-p sample covariance matrix got by regarding the rows as independent p-variate sample vectors. sort(x) returns a vector of the same size as x with the elements arranged in increasing order; however there are other more flexible sorting facilities available (see order() or sort.list() which produce a permutation to do the sorting). Note that max and min select the largest and smallest values in their arguments, even if they are given several vectors. The parallel maximum and minimum functions pmax and pmin return a vector (of length equal to their longest argument) that contains in each element the largest (smallest) element in that position in any of the input vectors. For most purposes the user will not be concerned if the “numbers” in a numeric vector are integers, reals or even complex. Internally calculations are done as double precision real numbers, or double precision complex numbers if the input data are complex. To work with complex numbers, supply an explicit complex part. Thus sqrt(-17) : will give NaN and a warning, but sqrt(-17+0i) : will do the computations as complex numbers. GENERATING REGULAR SEQUENCES R has a number of facilities for generating commonly used sequences of numbers. For example 1:30 is the vector c(1, 2, ..., 29, 30). The colon operator has high priority within an expression, so, for example 2*1:15 is the vector c(2, 4, ..., 28, 30). Put n <- 10 and compare the sequences 1:n-1 and 1:(n-1). The construction 30:1 may be used to generate a sequence backwards. The function seq() is a more general facility for generating sequences. It has five arguments, only some of which may be specified in any one call. The first two arguments, if given, specify 38 | P a g e Learning A to Z of R Programming the beginning and end of the sequence, and if these are the only two arguments given the result is the same as the colon operator. That is seq(2,10) is the same vector as 2:10. Arguments to seq(), and to many other R functions, can also be given in named form, in which case the order in which they appear is irrelevant. The first two arguments may be named from=value and to=value; thus seq(1,30), seq(from=1, to=30) and seq(to=30, from=1) are all the same as 1:30. The next two arguments to seq() may be named by=value and length=value, which specify a step size and a length for the sequence respectively. If neither of these is given, the default by=1 is assumed. For example > seq(-5, 5, by=.2) -> s3 generates in s3 the vector c(-5.0, -4.8, -4.6, ..., 4.6, 4.8, 5.0). Similarly > s4 <- seq(length=51, from=-5, by=.2) generates the same vector in s4. The fifth argument may be named along=vector, which is normally used as the only argument to create the sequence 1, 2, ..., length(vector), or the empty sequence if the vector is empty (as it can be). A related function is rep() which can be used for replicating an object in various complicated ways. The simplest form is > s5 <- rep(x, times=5) which will put five copies of x end-to-end in s5. Another useful version is > s6 <- rep(x, each=5) which repeats each element of x five times before moving on to the next. LOGICAL VECTORS As well as numerical vectors, R allows manipulation of logical quantities. The elements of a logical vector can have the values TRUE, FALSE, and NA (for “not available”). The first two are often abbreviated as T and F, respectively. Note however that T and F are just variables which are set to TRUE and FALSE by default, but are not reserved words and hence can be overwritten by the user. Hence, you should always use TRUE and FALSE. Logical vectors are generated by conditions. For example > temp <- x > 13 sets temp as a vector of the same length as x with values FALSE corresponding to elements of x where the condition is not met and TRUE where it is. The logical operators are <, <=, >, >=, == for exact equality and != for inequality. In addition if c1 and c2 are logical expressions, then c1 & c2 is their intersection (“and”), c1 | c2 is their union (“or”), and !c1 is the negation of c1. Logical vectors may be used in ordinary arithmetic, in which case they are coerced into 39 | P a g e Learning A to Z of R Programming numeric vectors, FALSE becoming 0 and TRUE becoming 1. However there are situations where logical vectors and their coerced numeric counterparts are not equivalent, for example see the next subsection. MISSING VALUES In some cases the components of a vector may not be completely known. When an element or value is “not available” or a “missing value” in the statistical sense, a place within a vector may be reserved for it by assigning it the special value NA. In general, any operation on an NA becomes an NA. The motivation for this rule is simply that if the specification of an operation is incomplete, the result cannot be known and hence is not available. The function is.na(x) gives a logical vector of the same size as x with value TRUE if and only if the corresponding element in x is NA. > z <- c(1:3,NA); ind <- is.na(z) Notice that the logical expression x == NA is quite different from is.na(x) since NA is not really a value but a marker for a quantity that is not available. Thus x == NA is a vector of the same length as x all of whose values are NA as the logical expression itself is incomplete and hence undecidable. Note that there is a second kind of “missing” values which are produced by numerical computation, the so-called Not a Number, NaN, values. Examples are > 0/0 or > Inf - Inf which both give NaN since the result cannot be defined sensibly. In summary, is.na(xx) is TRUE both for NA and NaN values. To differentiate these, is.nan(xx) is only TRUE for NaNs. Missing values are sometimes printed as <NA> when character vectors are printed without quotes. 2.6 Character vectors Character quantities and character vectors are used frequently in R, for example as plot labels. Where needed they are denoted by a sequence of characters delimited by the double quote character, e.g., "x-values", "New iteration results". Character strings are entered using either matching double (") or single (’) quotes, but are printed using double quotes (or sometimes without quotes). They use C-style escape sequences, using \ as the escape character, so \\ is entered and printed as \\, and inside double quotes " is entered as \". Other useful escape sequences are \n, newline, \t, tab and \b, backspace— see ?Quotes for a full list. Character vectors may be concatenated into a vector by the c() function; examples of their 40 | P a g e