You should have already downloaded the script for this module and save it to the scripts folder in your R labs folder. If you missed this step do this now by right-clicking (or command-clicking for Macs) here.
You should also already have downloaded the cheat sheets in the prior module but if not here are the cheat sheets we will use in this module.
You may here people use the term base R when working in RStudio. This simply refers to the built-in functions and operations that R can perform without loading any packages. We will cover packages more later, but for now your Base R Cheat sheet will come in handy.
R Objects
We have already covered that R can compute mathmatical equations, but
what if we want to create an object and save it in R.
Let’s create our first object. Objects
are defined using R’s assignment operator (Windows
shortcut: Alt + -)(Mac: Option + -), which looks like a left arrow
(<-
).
Type the following statement directly into the console:
my_obj <- 6+11
Then hit “Enter”. What does R return (e.g. look at the console)?
NOTHING!
BUT…. check the environment. You should see that you
now have a new object in your environment, called
myname
, and this object contains a single text
string. Simply checking the environment works
to view this simple object but for more complex objects you may want to
print them in the console or open another window to view them.
To print an object type the object name in either your source if you want it to print every time you run your code, or the console to just view it once.
To view an object you can type the function View() with your object name in the parentheses or click on the object in the environment window.
Print and view your object using any of the methods described above.
NOTE: RStudio has a useful auto-fill feature, which can save you lots of time and headaches. After you’ve typed the first couple letters (e.g., “my”), RStudio will suggest “my_obj” (or use the “tab” key to trigger RStudio to give you autofill options) and you can just scroll to the right object/function and hit “Enter”! The auto-type feature, unlike R, is not case sensitive!
In your blank script, let’s define a new object. For example:
Now run that line of code. You should see a new object (‘my_name’) pop up in your environment.
Notice how for this object we put quotations
''
or ""
around the text. That is because
words and letters are characters and not
numeric variables so R recognized them differently.
Try creating the same object without the quotes. What happens? You get an Error message right? This is because without the quotes R assumes you are referring to an object you have created called ‘your name’ but it doesn’t exits. We will cover data types more in a minute.
Of course, it’s always a good idea to save your scripts often – save this script in your ‘scripts’ folder using ‘file’ > ‘save as’ and title it ’my_script_day1_1”.
R has many different kinds of objects that you can define and store in memory.
Objects that enable storage of information (data objects) include: vectors, matrices, lists, and data frames.
Objects that transform data and perform operations (e.g., statistics/visualizations) on data objects are called functions.
Remember the file you downloaded at the beginning of this today’s lab? That file is an R script (“day1_2.R”); Let’s click back to it now.
Functions are routines that take inputs (usually data objects) (also called arguments) and produce something useful in return (transformed data objects, summaries, plots, tables, statistics, files).
In R, the name of the function is followed by parentheses, and any arguments (inputs) are entered inside the parentheses. The basic syntax looks like the code snippet below.
## function syntax
functionName([argument 1, argument2, ...])
If you don’t type the parenthesis, the function is not run- instead, the function definition is shown.
Here are some examples of how functions are used. Note the parentheses!
# Functions -------------------
# sum
sum(1, 2, 3, 10) # returns: 15
## [1] 16
# combine
c(1, 2, 3, 10) # combine four numbers into a single data object (a vector!)
## [1] 1 2 3 10
# floor
floor(67.8) # removes the decimal component of a number
## [1] 67
# round
round(67.8) # rounds a number
## [1] 68
round # oops, forgot the parentheses!
## function (x, digits = 0) .Primitive("round")
Remember, it’s easy to get some help if you forget how a function works, Try accessing a help file for round() using one of the methods we covered earlier.
?round
The basic data types in R are:
Data of each of these types can be represented as scalars, vectors, matrices, lists, and data frames and various functions work differnetly or won’t work at all depending on the data type provided
Scalars are the simplest data objects. A scalar is just a single value of any data type.
# Create R Objects ------------------------
# Scalars----------------------
scalar1 <- 'this is a scalar'
scalar2 <- 104
Scalars can store information of any type. In the example above,
scalar1
is a character, scalar2
is
numeric.
If you’re uncertain about the type of any R object the R function
use the typeof()
function.
Vectors are strings (single dimensional) of values (can be numbers or characters) which are created following the syntax below. Vectors can combine multiple scalars in a single data object. In fact, a scalar in R is really just a vector of length 1.
# Vectors ----------------------
# create a vector of numbers 1 to 4
vector1 <- c(1, 2, 3, 4)
# print vector1
vector1
## [1] 1 2 3 4
In the code above we used a function called c()
which is
a ‘combine’ function- it takes several smaller
data objects and combines them together into a single object. We will be
using this function a lot.
There’s a many different ways to create a vector. Let’s see some examples of ways we can create the same vector above more efficiently.
# create a vector of numbers 1 to 4 using ':'
vector2 <- 1:4
# print vector2
vector2
## [1] 1 2 3 4
# create a vector of numbers 1 to 4 using seq()
vector3 <- seq(1, 4, by = 1)
# print vector 3
vector3
## [1] 1 2 3 4
# create a vector of numbers 1 to 4 using 'c'
vector4 <- c(1, 4)
# print vector 4
vector4
## [1] 1 4
Notice, each vector is composed of one or more scalar elements of the same type. If you try to create a vector combining numeric and character elements you will get an error. You can try this in the console now.
Now let’s do some stuff with vectors!
# make a vector of 1, 2, and 3
d1 <- 1:3
d1
## [1] 1 2 3
# add 3 to all elements of the vector d1
d2 <- d1 + 3
d2
## [1] 4 5 6
# elemntwise addition
d3 <- d1 + d2
d3
## [1] 5 7 9
# check the number of elements in a vector
length(d1)
## [1] 3
# sum all the elements in a vector
sum(d3)
## [1] 21
# extract the second element in a vector
d2[2]
## [1] 5
Matrix data objects have two dimensions: rows and columns. All of the elements in a matrix must be of the same type.
Let’s make our first matrix. One simple way to make a matrix is just
by joining two or more vectors using the function cbind()
(bind vectors or matrices together by column) or rbind()
(bind vectors or matrices together by row)
# Matrices ----------------------
# create a matrix by binding vectors, with vector d1 as column 1 and d2 as column 2
mymat <- cbind(d1, d2)
mymat
## d1 d2
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
# confirm that the new object "mymat" is a matrix using the 'class()' function
class(mymat)
## [1] "matrix" "array"
# create matrix another way (matrix constructor)
mymat <- matrix(
c(1,2,3,4,5,6),
nrow = 3,
ncol = 2)
mymat
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
We can do stuff with matrices too!
# math with matrices
mymat + 2
## [,1] [,2]
## [1,] 3 6
## [2,] 4 7
## [3,] 5 8
sum(mymat)
## [1] 21
# extract matrix elements
# extract the element in the 3rd row, 2nd column
mymat[3, 2]
## [1] 6
mymat[, 1] # extract the entire first column
## [1] 1 2 3
# Syntax for using []
# X[a,b] access row a, column b element of matrix/data frame X
# X[,b] access all rows of column b of matrix/data frame X
# X[a,] access row a of matrix/data frame X
Notice, when using []
with vectors you only need one
value to reference the element’s position, however with matrices you
need two values to specify the row (first) and the column (second).
Lists are more general than matrices. List objects are just a bunch of arbitrary R data objects (called list elements) grouped together into a single object! The elements of a list don’t need to be the same length or the same type. The elements of a list can be vectors, matrices, other lists, or even functions.
Let’s make our first list:
# lists ----------------------
# create an empty list
mylist <- list()
# add elements to the empty list
mylist[[1]] <- c(1, 2, 3) # note the double brackets- this is one way to reference list elements.
mylist[[2]] <- c('piatra','craiului')
mylist[[3]] <- matrix(1:6,
nrow = 2)
# print mylist
mylist
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] "piatra" "craiului"
##
## [[3]]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
# do stuff with lists
# extract the second list element
mylist[[2]]
## [1] "piatra" "craiului"
# extract the first row, second column from the matrix that is embedded as the third element in this list !
mylist[[3]][1,2]
## [1] 3
Data frame objects are the basic data storage object in R. Data frames are a special type of list in which each list element is a vector of equal length. Each list element in a data frame is also known as a column.
Data frames superficially resemble matrices, since both have rows and columns. However, unlike matrices, the columns of a data frame can represent different data types (i.e., character, logical, numeric, factor), and can thereby represent different types of information!
Data frames are the fundamental data storage structure in R. You can think of a data frame like a spreadsheet. Each row of the the data frame represents a different observation, and each column represents a different measurement taken on each observation unit.
Let’s make our first data frame.
# Data frames ----------------------
# create a data frame with two columns. Each column is a vector of length 3
df1 <- data.frame(col1 = c(1,2,3),
col2 = c("A","A","B"))
df1
## col1 col2
## 1 1 A
## 2 2 A
## 3 3 B
# extract the first element in the second column
df1[1, 2]
## [1] "A"
# extract the second column by name (alternatively, df1[["col2"]])
df1$col2
## [1] "A" "A" "B"
Now we have a data frame with three observation units and two measurements (variables).
A tibble is the same thing as a data frame, just with a few tweaks to make it work better in the tidyverse. We will primarily work with tibble data frames in this course. For our purposes right now, tibbles and data frames are the same thing.
In this section, we will make up some fake data objects. In the module we’ll practice working with real data!
One task that comes up a lot is generating sequences of numbers:
# Making up data ----------------------------------
# Generating vector sequences
# sequential vector from 1 to 10
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
# sequence of length 5 between 0 and 1
seq(0, 1, length = 5)
## [1] 0.00 0.25 0.50 0.75 1.00
Another task is to group regular recurring sequences together:
# Repeating vector sequences
# repeat 0 three times
rep(0,
times = 3)
## [1] 0 0 0
# repeat the vector 1:3 twice
rep(1:3,
times = 2)
## [1] 1 2 3 1 2 3
# repeat each element of 1:3 two times
rep(1:3,
each = 2)
## [1] 1 1 2 2 3 3
We can also fill up a vector with random numbers using one of R’s built in random number generators:
# Random numbers
# 10 samples from std. normal
rnorm(10)
## [1] -0.7408191 -0.7187452 0.1209377 1.4057562 0.8316211 0.5322898
## [7] -0.8242141 0.3286136 -0.5420464 0.9434093
# 10 samples from Normal(-2,4^2)
rnorm(10,
mean= -2,
sd = 4)
## [1] -1.8996426 -6.8537051 2.6647374 -0.5085015 6.7146124 1.4806531
## [7] -6.2045503 -2.2605679 -9.8202165 2.5246209
# 5 samples from Binom(3,0.5)
rbinom(5,
size = 3,
prob = .5)
## [1] 2 1 2 3 1
# 5 samples from Binom(3,0.1)
rbinom(5, 3, .1)
## [1] 0 0 1 0 0
# 10 standard uniform random numbers
runif(10)
## [1] 0.96533861 0.16773323 0.08546954 0.71453723 0.65396066 0.34180639
## [7] 0.16557319 0.66700850 0.31211709 0.28972317
# 10 uniform random numbers from [-1,1]
runif(10,
min = -1,
max = 1)
## [1] 0.53794858 0.71385974 -0.40624226 0.71339981 0.54327869 0.04556408
## [7] 0.86171430 -0.75404816 0.56019722 -0.74087121
And finally, we can make up a fake data frame (or tibble) using some of the tricks we just learned!
# Make up an entire fake data frame!
my.data <- data.frame(
Obs.Id = 1:100,
Treatment = rep(c("A","B","C","D","E"),
each = 20),
Block = rep(1:20,
times = 5),
Germination = rpois(100,
lambda = rep(c(1,5,4,7,1),
each = 20)), # random poisson variable
AvgHeight = rnorm(100,
mean = rep(c(10,30,31,25,35,7),
each = 20))
)
my.data
## Obs.Id Treatment Block Germination AvgHeight
## 1 1 A 1 0 11.776084
## 2 2 A 2 0 8.789402
## 3 3 A 3 0 9.940561
## 4 4 A 4 1 11.133206
## 5 5 A 5 0 9.968294
## 6 6 A 6 2 10.524047
## 7 7 A 7 3 9.399188
## 8 8 A 8 1 8.069986
## 9 9 A 9 2 9.855275
## 10 10 A 10 0 10.286203
## 11 11 A 11 0 9.902374
## 12 12 A 12 0 7.744513
## 13 13 A 13 1 8.952270
## 14 14 A 14 0 10.030242
## 15 15 A 15 2 10.960314
## 16 16 A 16 2 9.013240
## 17 17 A 17 1 9.909689
## 18 18 A 18 0 10.086074
## 19 19 A 19 2 10.397558
## 20 20 A 20 0 10.097633
## 21 21 B 1 4 29.333932
## 22 22 B 2 3 29.700823
## 23 23 B 3 7 28.899031
## 24 24 B 4 4 30.322238
## 25 25 B 5 5 30.003315
## 26 26 B 6 4 31.162666
## 27 27 B 7 2 30.192316
## 28 28 B 8 3 29.552833
## 29 29 B 9 6 30.775975
## 30 30 B 10 4 29.292635
## 31 31 B 11 4 27.955480
## 32 32 B 12 4 31.567821
## 33 33 B 13 4 29.051798
## 34 34 B 14 6 30.049071
## 35 35 B 15 9 29.090790
## 36 36 B 16 5 29.945843
## 37 37 B 17 10 30.159281
## 38 38 B 18 5 29.939229
## 39 39 B 19 2 31.297747
## 40 40 B 20 3 30.734756
## 41 41 C 1 3 31.982634
## 42 42 C 2 2 29.212980
## 43 43 C 3 4 30.629105
## 44 44 C 4 7 30.453483
## 45 45 C 5 6 30.538681
## 46 46 C 6 3 31.439560
## 47 47 C 7 2 29.300472
## 48 48 C 8 6 31.100520
## 49 49 C 9 3 31.168748
## 50 50 C 10 2 30.677564
## 51 51 C 11 4 30.368452
## 52 52 C 12 1 30.850778
## 53 53 C 13 2 29.436806
## 54 54 C 14 5 32.887565
## 55 55 C 15 7 31.182943
## 56 56 C 16 3 29.858984
## 57 57 C 17 7 29.834310
## 58 58 C 18 6 29.529683
## 59 59 C 19 8 29.943757
## 60 60 C 20 4 31.686075
## 61 61 D 1 7 23.726529
## 62 62 D 2 3 24.299542
## 63 63 D 3 4 26.210504
## 64 64 D 4 7 24.347510
## 65 65 D 5 5 24.410449
## 66 66 D 6 6 24.793476
## 67 67 D 7 8 25.561033
## 68 68 D 8 7 23.660624
## 69 69 D 9 7 23.607021
## 70 70 D 10 8 23.333794
## 71 71 D 11 9 24.056266
## 72 72 D 12 5 25.332273
## 73 73 D 13 10 23.319500
## 74 74 D 14 6 26.240681
## 75 75 D 15 9 25.229666
## 76 76 D 16 7 25.897959
## 77 77 D 17 7 26.018481
## 78 78 D 18 7 24.255815
## 79 79 D 19 8 25.251850
## 80 80 D 20 9 24.611841
## 81 81 E 1 1 35.812012
## 82 82 E 2 1 34.292458
## 83 83 E 3 0 34.616365
## 84 84 E 4 0 35.266973
## 85 85 E 5 1 34.723115
## 86 86 E 6 2 35.605474
## 87 87 E 7 1 36.074186
## 88 88 E 8 1 36.393650
## 89 89 E 9 1 34.932942
## 90 90 E 10 2 36.239528
## 91 91 E 11 1 35.809744
## 92 92 E 12 1 36.121714
## 93 93 E 13 0 35.114737
## 94 94 E 14 0 32.711338
## 95 95 E 15 1 33.884105
## 96 96 E 16 1 35.364785
## 97 97 E 17 3 33.077700
## 98 98 E 18 1 33.926186
## 99 99 E 19 1 34.739980
## 100 100 E 20 0 35.716779
# Use the "summary()" function to summarize each column in the data frame.
summary(my.data)
## Obs.Id Treatment Block Germination
## Min. : 1.00 Length:100 Min. : 1.00 Min. : 0.00
## 1st Qu.: 25.75 Class :character 1st Qu.: 5.75 1st Qu.: 1.00
## Median : 50.50 Mode :character Median :10.50 Median : 3.00
## Mean : 50.50 Mean :10.50 Mean : 3.54
## 3rd Qu.: 75.25 3rd Qu.:15.25 3rd Qu.: 6.00
## Max. :100.00 Max. :20.00 Max. :10.00
## AvgHeight
## Min. : 7.745
## 1st Qu.:23.974
## Median :29.541
## Mean :26.025
## 3rd Qu.:31.333
## Max. :36.394
# extract rows 21 to 30 and store as a new data frame
mydf = my.data[21:30, ]
mydf
## Obs.Id Treatment Block Germination AvgHeight
## 21 21 B 1 4 29.33393
## 22 22 B 2 3 29.70082
## 23 23 B 3 7 28.89903
## 24 24 B 4 4 30.32224
## 25 25 B 5 5 30.00331
## 26 26 B 6 4 31.16267
## 27 27 B 7 2 30.19232
## 28 28 B 8 3 29.55283
## 29 29 B 9 6 30.77598
## 30 30 B 10 4 29.29264
# access a column of the data frame by name
mydf$Treatment
## [1] "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
Here are some useful R functions for exploring data objects. We will cover these in more depth later but take a look at the outputs for each now.
# Useful data exploration/checking tools in R --------------------
# Obtain length (# elements) of vector d2
length(d2)
# Obtain dimensions of matrix or array
dim(mymat)
# summarize columns in a data frame.
summary(my.data)
# look at the "internals" of an object (useful for making sense of complex objects!)
str(my.data)
# get names of variables in a data frame (or names of elements in a named vector)
names(my.data)
# get number of rows/observations in a data frame
nrow(my.data)
# get number of columns/variables in a data frame
ncol(my.data)
Below are some exercises for your first stats assignment based on the
material we just covered. For all exercises the results will be
available on this website after I’ve graded the assignments. You have
until the start of lab the following week to complete the practice
problems for all modules we’ve covered. These are meant to test your
knowledge of the material we’ve covered and help you learn to work with
real data. These are meant to mimic working in R in the real world and
you will have to modify code from the module and possibly learn to use a
new function to complete these assignment. Remember you can always look
up information for a particular function using
?function name
to open the help window.
Create a vector called ‘myvec’ using any of the methods you learned with numbers 1 to 10. Note there are multiple ways to do this.
rbind()
Create a 3 row by 2 column matrix named ‘mymat’. Use the
rbind()
function to bind the following
three rows/vectors together:
c(1,4)
c(2,5)
c(3,6)
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
Get the names of columns in the data frame you created earlier
‘mydf’. Hint see the R functions to explore data section. Then
extract all rows for column 5 by name, do the same thing using the
element position e.g. []
## [1] "Obs.Id" "Treatment" "Block" "Germination" "AvgHeight"
## [1] 29.33393 29.70082 28.89903 30.32224 30.00331 31.16267 30.19232 29.55283
## [9] 30.77598 29.29264
## [1] 29.33393 29.70082 28.89903 30.32224 30.00331 31.16267 30.19232 29.55283
## [9] 30.77598 29.29264
Create a new matrix called ‘mymat2’ that includes all the data from
columns 3 to 5 of data frame mydf. HINT: use the
as.matrix()
function to coerce a data frame into a matrix.
Since we didn’t cover this function you may need to look it up in the
help files.
Note your values for some columns may be slighly different since the code to create mydf uses random number generators.
## Block Germination AvgHeight
## 21 1 4 29.33393
## 22 2 3 29.70082
## 23 3 7 28.89903
## 24 4 4 30.32224
## 25 5 5 30.00331
## 26 6 4 31.16267
## 27 7 2 30.19232
## 28 8 3 29.55283
## 29 9 6 30.77598
## 30 10 4 29.29264
Create a list named ‘mylist’ that is composed of a
- vector: 1:3,
- a matrix: matrix(1:6, nrow = 3, ncol = 2),
- and a data frame: data.frame(x =c (1, 2, 3), y = c(TRUE, FALSE, TRUE),
z = c(“a”, “a”, “b”)).
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
##
## [[3]]
## x y z
## 1 1 TRUE a
## 2 2 FALSE a
## 3 3 TRUE b
Extract the second and third observation from the 1st column of the data frame in ‘mylist’ (the list created above).
## [1] 2 3
Please email me a copy of your R script for this assignment by start of lab on Friday January 26th with your first and last name followed by assignment 1 as the file name (e.g. ‘marissa_dyck_assignment1.R’)
You should always be following best coding practices (see Intro to R module 1) but especially for assingment submissions. Please make sure each problem has its own header so that I can easily navigate to your answers and that your code is well organized with spaces as described in the best coding practices section and comments as needed.