11 R & Tidyverse
Source: CS50 R
12 Representing Data
12.1 What is R
R is built to deal with data, Used in data science, statistics and research
Tip: install package styler
12.2 R studio
To use R, we download R and R studio, in the console we write
file.create('hello.R')
# TRUENotice that our program must end with R
Our file is created, in our folder aka working directory
to open our file, double click hello.R from the file explorer
print("hello, world")To save our file
ctrl s
To run our code (turn our code to 1 and 0 so computer can understand it)
To clear our console
ctrl l
to run our code
- click on the run button
12.3 Functions in R
R is full of functions, a command with () that takes inputs aka arguments
Like in print("**hello world**") hello world here is the argument
Printing is called side effect cuz I see something on the screen
12.3.1 vscode vs Rstudio
Rstudio is built for R and really makes the code easier
12.3.2 Python vs R
py can do many stuff, R is more precise though
But in the end of the day, we can do everything to both
12.3.3 to change working directory
setwd("")Debug is the process of fixing our code
12.4 Readline
to take input in R
readline("what's your name? ")
print("Hello, Ahmed")Notice run button runs one line, source runs the whole code
Our code is not making use of the readline yet, we can make use of
return value
Print vs return value:
in an exam: you hand a paper to your friend to write down the answer
print: screams out the result
return: writes down the answer in the paper silently
To make use of the readline, we use a variable
name <- readline('What is your name? ')<- is called assignment, name will be found in the environment vars.
12.5 Paste
name <- readline('What is your name? ')
print("Hello name")
#Hello nameIt literally prints hello name, we want to add the text aka String and variable
the process of adding texts is called string concatenation, to do so
we use paste
name <- readline('What is your name? ')
greeting <- paste("Hello ", name)
print(greeting)
#Hello ahmedwe split arguments using , This program has a bug though, we have two spaces between hello and ahmed
Why? we need to read the documentation
paste(...,sep=' ')... means we can enter as many arguments as we want
sep means it will add a white space
name <- readline('What is your name? ')
greeting <- paste("Hello ", name, sep='')
print(greeting)
#Hello, ahmedIts tiring to use the sep every time so we have another function
name <- readline('What is your name? ')
greeting <- paste0("Hello ", name)
print(greeting)
#Hello ahmed12.5.1 Paste vs Cat
cat has side effect, paste not
12.6 Function Composition
Notice we can make our code shorter
name <- readline('What is your name? ')
print(paste0("Hello ", name))function that takes function, it will run the function inside first, then do the outer function
we can go super crazy and do
print(paste0("Hello ", readline('What is your name? ')))This is shorter, but hard to read
Note: we can also have comments aka notes for myself
#Asks user for name
name <- readline('What is your name? ')
#says hello to user
print(paste0("Hello ", name))Comments by conviction are above the code
12.7 Count.R
file.create("count.R")to clear the terminal ctrl l
mario <- readline("Enter votes for Mario: ")
peach <- readline("Enter votes for Peach: ")
bowser <- readline("Enter votes for Bowser: ")
total <- mario + peach + bowser
# weird errorNotice how we can add using +
the error tells us they are not numbers, in the environment you will find them stored as '100' not 100
12.8 Storage Mode
R deals with
- character
- Double
- integer
we can convert storage mode aka coercion
mario <- readline("Enter votes for Mario: ")
peach <- readline("Enter votes for Peach: ")
bowser <- readline("Enter votes for Bowser: ")
mario <- as.integer(mario)
peach <- as.integer(peach)
bowser <- as.integer(bowser)
total <- mario + peach + bowserwe can clean it up
mario <- as.integer(readline("Enter votes for Mario: "))
peach <- as.integer(readline("Enter votes for Peach: "))
bowser <- as.integer(readline("Enter votes for Bowser: "))
total <- mario + peach + bowserSince R deals alot with data, we can add using another way
mario <- as.integer(readline("Enter votes for Mario: "))
peach <- as.integer(readline("Enter votes for Peach: "))
bowser <- as.integer(readline("Enter votes for Bowser: "))
total <- sum(mario,peach,bowser)
print(paste('Total Votes:',total))12.9 Tables
we learnt how to enter data, but we want to use a better way
data are usually read in tables that consist of rows and columns
we can do operations on row, and column
But first, to access, this data, it will be saved in CSV aka comma separated
To list what is in the env. and clear it
ls
rm(list = ls())
ls12.10 tabulate.R
to read the data
votes <- read.table("votes.csv")To view the data
View(votes)looks like the data is read wrongly. after reading the doc
votes <- read.table("votes.csv", sep = ',')we are getting closer but not there yet
votes <- read.table("votes.csv",
sep = ',',
header = TRUE
)Since life is short, we have another function
votes <- read.csv('votes.csv')Votes is stored as Data frame, we can access rows and columns using []
remember your linear algebra class
Note: if you omit row number, it means you want all the rows
votes[,2]
# all rows of second columnBut a better way is to use names
to access column by name
votes$poll12.10.1 access mario number of polls
votes[1,2] #3712.11 Vectors
when we select column from a dataframe
like array, list of values, all have the same storage mode
we can access elements in array using []
votes$poll[1] #37Instead of using total we can
sum(votes$poll)sum is vectorized aka can deal with vectors
we want to know how many votes each candidate did
votes$poll[1] + votes$mail[1]
votes$poll[2] + votes$mail[2]
votes$poll[3] + votes$mail[3]Which works but its a bad idea
again, remembering your linear algebra course, we can do vector arithmetic
votes$poll + votes$mail
votes[,2] + votes[,3]
# returns element wise sumnotice: rstudio helps me get the result without actually printing
Note
votes[1] # returns a data frame not a vectorwe can add a column
votes$total <- votes$poll + votes$mail12.12 write.csv
to save our data
write.csv(votes,'totals.csv')this adds row names, its better to keep them but if you want, add row.names= TRUE
to access column names
colnames(votes)
rownames(votes)12.13 voters.R
we can access data from the internet using read.csv
url <- "https://github.com/fivethirtyeight/data/raw/master/non-voters/nonvoters_data.csv"
voters <- read.csv(url)the data is big, we need to know how many rows are there
nrow(voters)
ncol(voters)to understand what the column names mean, check the code block
we want to check a column
voters$voter_categorythey are massive, to get the unique values
unique(voters$voter_category)12.14 Special values
we are interested in q22
voters$Q22It is full of NA, which refers to not available
we also have Inf , -Inf
NAN not a number like infinity / infinity
NULL means there is nothing.
Another interesting column is q21
unique(voters$Q21)I get 1,2, 3. its hard to remember what number represents what, so we use factors
its like dummy variables, we can give them labels
factor(
voters$Q21,
labels = c('?','Yes','No','Unsure'),
)now we get names instead of numbers
we can remove a number too
factor(
voters$Q21,
labels = c('Yes','No','Unsure'),
exclude = c(-1)
)12.14.1 factor vs as.factor
check the doc, they are kind a the same
Next time, we will learn how to transform data
13 Transforming Data
13.1 Outliers
Outliers are data that are super away from the range. We remove them (lol)
13.2 temps.R
r can read r files, like csv, but specific to R
load("temps.RData")to find mean aka sum over length
mean(temps)To peak at the vector, just write its name
tempsHere is a visual of the data
second day seems an outlier, lets select it, and select days 4 & 7 too.
temps[2]
temps[4]
temps[7]13.3 Transforming Vector
Instead of selecting individual elements, we can select them all in one shot using vector
temps[c(2,4,7)]To remove them from temps. just add a minus sign
temps[-c(2,4,7)]13.4 Logical expressions
I want to get a yes or no question, is this value an outlier or not
we do so using > < >= <= == != this return logical aka true or false
# is first element less than 0?
temps[1] < 0Now to do this for the whole vector
temps < 0This returns true or false for every element
13.4.1 to remove outlier once and for all
temps <- temps[-c(2,4,7)]To know which indexes are true
which(temps < 0)13.5 Logical operators
and or not & |
remember outliers can be small or massive values
temps < 0 | temps > 60Note: if dealing with single values, use && and ||
any(temps < 0 | temps > 60)returns if any of the values is true
all(temps < 0 | temps > 60)returns if all values are true
lets select the outliers
temps[which(temps < 0 | temps > 60)]
temps[(temps < 0 | temps > 60)]these two are the same
now lets use not
filter <- (temps < 0 | temps > 60)
outliers <- temps[filter]
no_outliers <- temps[!filter]To save out data as RData
save(no_outliers, file = 'no_outliers.RData')13.6 Chicks.R
our data consists of three columns: chick, feed, weight
they can eat either casein or fava
goal: get weight relative to what they ate
chicks <- read.csv('chicks.csv')
View(chicks)Our data has na values
mean(chicks$weight) #nacan’t use mean cuz I have na values.
lets remove them temporarily using na.rm
mean(chicks$weight. na.rm = FALSE)to get weight for those who ate casein only, we can subset by selecting rows
casein_chicks <- chicks[c(1,2,3),]
mean(casein_chicks$weight)reached our goal , but we can be more efficient by using range :
casein_chicks <- chicks[1:3,]
mean(casein_chicks$weight)even better, filtering by name
filter <- chicks$feed == 'casein'
casein_chicks <- chicks[filter, ]
mean(casein_chicks$weight)Note, we can move the na values from the beginning
chicks$weight == NA #does not work13.7 Logical functions
To see if we have na values
is.na(chicks$weight) now to get the values that are not na
chicks <- chicks[!is.na(chicks$weight), ]There is another way, by using subset instead of filtering by rows and columns
chicks <- subset(chicks, !is.na(weight))this filters based on a condition
we can get row names
rownames(chicks)Notice that rownames are not continuous cuz I removed na names
to reset rownaemes
rownames(chicks) <- NULL
rownames(chicks)To get the count of na
sum(is.na(chicks#weight))why? cuz True is represented as 1, false is 0
13.9 Conditionals
what if our user chooses 0 or 7, we need to handle it
chicks <- read.csv("chicks.csv")
chicks <- subset(chicks, !is.na(weight))
feed_options <- unique(chicks$feed)
formatted_options <- paste0(1:6,'.',feed_options)
cat(formatted_options, sep = '\n')
feed_choice <- as.integer(readline("Feed type: "))
**if (feed_choice < 1 || feed_choice > 6){
cat("Invalid choice")
} else{
selected_feed <- feed_options[feed_choice]
print(subset(chicks, feed == selected_feed))
}**we also have else if yuck
13.10 sales.R
what if we have many datasets, same column names, how to merge them?
Q1 <- read.csv("Q1.csv")
Q2 <- read.csv("Q2.csv")
Q3 <- read.csv("Q3.csv")
Q4 <- read.csv("Q4.csv")
sales <- rbind(Q1, Q2, Q3, Q4)
head(sales)I don’t know which data came from which csv, we can add another column
Notice: we do this before binding
Q1 <- read.csv("Q1.csv")
Q1.quarter <- 'Q1'
Q2 <- read.csv("Q2.csv")
Q2.quarter <- 'Q2'
Q3 <- read.csv("Q3.csv")
Q3.quarter <- 'Q3'
Q4 <- read.csv("Q4.csv")
Q4.quarter <- 'Q4'
sales <- rbind(Q1, Q2, Q3, Q4)
head(sales)
sales$value <- ifelse(sales$sale_amount >100,
'high value', 'regular')14 Applying Functions
Into functional programming
remember our code to get votes
when something is repeated, its an indicator to do our own functions
get_votes <- function() {
votes <- as.integer(readline("Enter Votes: "))
return(votes)
}Remember the difference between return and print, lets use our function
mario <- get_votes()
peach <- get_votes()
bowser <- get_votes()
total <- sum(mario, peach, bowser)
cat("total votes:", total)Yay, we did our first function, lets make it fancier by prompting the user
get_votes <- function(prompt) {
votes <- as.integer(readline(prompt))
return(votes) #optional
}
mario <- get_votes("Mario votes: ")
peach <- get_votes("peach votes: ")
bowser <- get_votes("bowser votes: ")
total <- sum(mario, peach, bowser)
cat("total votes:", total)Note: in R, by default, functions return last computed value so no need to explicitly say return(votes)
Now, lets make a default value
get_votes <- function(prompt = "Enter votes") {
votes <- as.integer(readline(prompt))
}
mario <- get_votes("Mario votes: ")
peach <- get_votes(prompt = "peach votes: ")
bowser <- get_votes()
total <- sum(mario, peach, bowser)
cat("total votes:", total)If i don’t specify the prompt, it will use the default, I can also specify the keyword prompt or not
14.1 Scope
Notice: votes does not appear in the environment variables cuz it exists in the function scope only. Same for prompt
Lets think defensively, what if we enter a name instead of number?
it will break our code, we can deal with it using if statement???
get_votes <- function(prompt = "Enter votes") {
votes <- suppressWarnings(as.integer(readline(prompt)))
if (is.na(votes)) {
return(0)
} else {
return(votes)
}
}
# entering duck, duck, duck will return 0
mario <- get_votes("Mario votes: ")
peach <- get_votes(prompt = "peach votes: ")
bowser <- get_votes()
total <- sum(mario, peach, bowser)
cat("total votes:", total)suppress warning, means don’t scream out the warning, I will manage it myself
remember this is r, we can use the function ifelse
get_votes <- function(prompt = "Enter votes") {
votes <- suppressWarnings(as.integer(readline(prompt)))
ifelse(is.na(votes), 0, votes)
}
mario <- get_votes("Mario votes: ")
peach <- get_votes(prompt = "peach votes: ")
bowser <- get_votes()
total <- sum(mario, peach, bowser)
cat("total votes:", total)14.2 Loops
If we want to repeat something multiple times, this is looping
Remember the whole point is not repeating the code.
14.3 repeat
i <- 3
repeat {
cat("quack\n")
i <- i - 1
if (i == 0) {
break
}
}each loop is called iteration. repeat can easily do infinite loops
we have the keywords break next
we can also use next like this (redundant tho)
i <- 3
repeat {
cat("quack\n")
i <- i - 1
if (i == 0) {
break
} else{
next
}
}14.4 while
our while loop finally
i <- 3
while (i != 0) {
cat("quack\n")
i <- i - 1
}The main difference is that while checks the condition.
repeat goes indefinitely so I have to use break.
14.5 for
iterate a certain number of times
for (i in c(1, 2, 3)) {
cat("quack\n")
}remember we can use range using :
for (i in 1:3) {
cat("quack\n")
}14.6 count.R
lets make our votes program better
get_votes <- function(prompt = "Enter votes: ") {
repeat {
votes <- suppressWarnings(as.integer(readline(prompt)))
if (!is.na(votes)) {
return(votes)
}
}
}
mario <- get_votes("Mario: ")
peach <- get_votes("Peach: ")
bowser <- get_votes("Bowser: ")
total <- sum(mario, peach, bowser)
cat("Total votes:", total)Notice: return breaks the loop like break
Notice 2: we can loop over characters
get_votes <- function(prompt = "Enter votes: ") {
repeat {
votes <- suppressWarnings(as.integer(readline(prompt)))
if (!is.na(votes)) {
return(votes)
}
}
}
total <- 0
for (i in c("Mario", "Peach", "Bowser")) {
votes <- get_votes(paste0(i, ":"))
total <- total + votes
}
print(total)to sum over a loop, notice how we made use of total
14.7 tabulate.r
we can sum over rows or columns using loops
votes <- read.csv("votes.csv")
voteswe can loop over rows like this
total_votes <- c()
for (candidate in rownames(votes)) {
total_votes[candidate] <- sum(votes[candidate, ])
}
total_votesNotice how we are initializing a vector and populating it
14.8 apply
like map in py, loops and applies a function at the same time
margin = 1: row
margin =2: column
apply(votes, MARGIN = 1, FUN = sum)
apply(votes, MARGIN = 2, FUN = sum)15 Tidying Data
15.1 Tidyverse and dplyr
data are messy, they already made packages to clean it
library("tidyverse")dplyr has some useful functions
- select
- filter
- arange
- distinct
- group_by
- summarize
15.2 storms.py
storms is a dataset in r, it is stored as tibble
stormstibble is a better version of data frame
head(storms)shows me the first 5 rows
anyway, we want to remove some columns from the dataset
head(dplyr::select(
storms,
!c(lat, long, pressure)
))select comes with useful helper functions
- contains
- ends_with
- starts_with
head(dplyr::select(
storms,
!c(lat, long, pressure, ends_with("diameter"))
))filter removes rows
head(filter(
select(
storms,
!c(lat, long, pressure, ends_with("diameter"))
),
status == "hurricane"
))Notice that dplyr:: is kinda optional
15.3 Pipe operator
instead of nesting, we can use piping
- |>
- %>%
first one is better
storms |>
select(!c(lat, long, pressure, ends_with("diameter"))) |>
filter(status == "hurricane") |>
head()15.4 arrange
we can sort our data ascending or descending, if there are ties we can sort them with a second column
storms |>
select(!c(lat, long, pressure, ends_with("diameter"))) |>
filter(status == "hurricane") |>
arrange(desc(wind), name) |>
head()15.5 distinct
if we want to remove duplicate rows
storms |>
select(!c(lat, long, pressure, ends_with("diameter"))) |>
filter(status == "hurricane") |>
arrange(desc(wind), name) |>
distinct(name,year .keep_all TRUE) |>
head()15.6 group_by
lets save our data first before exploring grouping
hurricanes <- storms |>
select(!c(lat, long, pressure, ends_with("diameter"))) |>
filter(status == "hurricane") |>
arrange(desc(wind), name) |>
distinct(name, year, .keep_all = TRUE)
hurricanes |>
select(c(year, name, wind)) |>
write.csv("hurricanes.csv", row.names = FALSE)lets group data, grouping is usually done at the top of the code
data <- read.csv("hurricanes.csv")
data |>
group_by(year) |>
arrange(desc(wind)) |>
slice_head()slice head returns first row, we have
- slice_head
- slice_tail
- slice_max
- slice_min
slice_max looks for max row, so no need to arrange
to use it, I need to specify order_by
data |>
group_by(year) |>
slice_max(order_by = wind)15.7 summarize
to get number of hurricanes per year, we use summarize
data |>
group_by(year) |>
summarize(n())But the column header will be n()
we can rename the column
data |>
group_by(year) |>
summarize(hurricanes = n())notice, the data will stay grouped, which may be a bad idea if I want to reuse the data later
So: ungroup when done
data |>
group_by(year) |>
slice_max(order_by = wind) |>
ungroup()15.8 Tidy data
tidy data follow 3 rules only
- each observation is a row; each row is an observation
- each variable is a column; each column is a variable
- each value is a cell; each cell is a single value
when data is not tidy, we normalize it
The following dataset is untidy, attribute column should be 2 columns
after some magic its clean
15.9 tidyr pivot_wider
to turn attribute to columns, data become wider, aka pivot_wider
students <- pivot_wider(
students,
id_cols = student,
names_from = attribute,
values_from = value
)Notice how I use student as id column
Now that data is clean, we can easily work on the data
but notice, we need to fix the types first
students$GPA <- as.numeric(students$GPA)
students |>
group_by(major) |>
summarise(GPA = mean(GPA))If a value is missing, it will be transformed as NA
there is also pivot_longer
15.10 stringr
we can clean strings aka words
show <- read.csv("../shows.csv")
head(show)
show |>
group_by(show) |>
summarise(votes = n()) |>
ungroup() |>
arrange(desc(votes))if we have white space
- str_trim
- str_squish
show$show <- shows$show |>
str_trim() |>
str_squish() |>
str_to_title()
shoes$show[str_detect(shows$show, "Avatar")] <- 'Avatar: the last airbender'
show |>
group_by(show) |>
summarise(votes = n()) |>
ungroup() |>
arrange(desc(votes))trim deals with white space at the beginning and end
squish deals with white space between text
to deal with capitalization
- str_to_lower
- str_to_upper
- str_to_title
if we have avatar, avatar the last air bender, we use
str_detect
16 Visualizing Data
16.1 Grammar of graphics
ggplot aka grammar of graphics help us visualize data
remember from our first week,
data is the poll
then we have geometry
**** like column, points, lines
aka bar chart, scatterplot, line charts
column is good with the poll data
and for aesthetic mappings:
what goes on x axis, y axis?
candidates on x axis, poll on y axis
16.2 votes.R
votes <- read.csv('votes.csv')
library('tidyverse')
ggplot()this creates an empty paper
ggplot(votes)still empty page
ggplot(votes) +
geom_col()imagine this as adding a layer on the empty page, we need to specify x axis and y axis
ggplot(votes, aes(x = candidate, y = votes)) +
geom_col()finally, our code is complete. Notice the order is different
Ggplot orders the columns alphabetically, this can be changed
16.3 Scales
what if I want to change the height of the y axis? aka the range
we have two types of scale, continuous and discrete
continuous scales have limits.
ggplot(votes, aes(x = candidate, y = votes)) +
geom_col() +
scale_y_continuous(limits = c(0,250))16.4 Labs
we can add titles, change labels, etc..
ggplot(votes, aes(x = candidate, y = votes)) +
geom_col() +
scale_y_continuous(limits = c(0,250)) +
labs(
x = 'Candidate',
y = 'Votes',
title = 'Election Results'
)16.5 Color filling and themes
we can do cool colors
ggplot(votes, aes(x = candidate, y = votes)) +
geom_col(aes(fill = candidate)) +
scale_y_continuous(limits = c(0,250)) +
labs(
x = 'Candidate',
y = 'Votes',
title = 'Election Results'
)But some people have color bindless
so we adjust our code
ggplot(votes, aes(x = candidate, y = votes)) +
geom_col(aes(fill = candidate)) +
scale_fill_viridis_d("Candidate") +
scale_y_continuous(limits = c(0,250)) +
labs(
x = 'Candidate',
y = 'Votes',
title = 'Election Results'
)we can also use themes
ggplot(votes, aes(x = candidate, y = votes)) +
geom_col(aes(fill = candidate)) +
scale_fill_viridis_d("Candidate") +
scale_y_continuous(limits = c(0,250)) +
labs(
x = 'Candidate',
y = 'Votes',
title = 'Election Results'
) +
theme_classic()Notice: order of layers is not important but please don’t change it
To remove legend
ggplot(votes, aes(x = candidate, y = votes)) +
geom_col(aes(fill = candidate), show.legend = FALSE) +
scale_fill_viridis_d("Candidate") +
scale_y_continuous(limits = c(0,250)) +
labs(
x = 'Candidate',
y = 'Votes',
title = 'Election Results'
) +
theme_classic()lets save this awesome plot
Notice: its a good bar chart template
p <- ggplot(votes, aes(x = candidate, y = votes)) +
geom_col(aes(fill = candidate), show.legend = FALSE) +
scale_fill_viridis_d("Candidate") +
scale_y_continuous(limits = c(0,250)) +
labs(
x = 'Candidate',
y = 'Votes',
title = 'Election Results'
) +
theme_classic()
ggsave(
'votes.png',
plot = p,
width = 1200,
height = 900,
units = 'px'
)16.6 Candy.R
we have columns, we can also have scatterplot
we have chocos, first choco is more expensive than 92% and has more sugar than 43% of other chocos
we can add points on graph as scatter plot
Note: if two data lie on the same point, what to do?
- jitter aka add small error
- use alpha
Here is our data
Name <- c(
"Harshey's Milk Chocolate", "Resse's Peanut Butter Cup",
"Sour Patch Kids","Swedish Fish","Harshey's Special Dark"
)
price_percentile <- c(92, 65, 12, 76, 92)
sugar_percentile <- c(43, 72, 7, 60, 43)
candy <- data.frame(Name, price_percentile, sugar_percentile)
candyggplot(candy,
aes(x = price_percentile,y = sugar_percentile)
)Still no plot
ggplot(candy, aes(
x = price_percentile,
y = sugar_percentile
)) +
geom_point()Here is our picture, but we did not account for points with equal data
ggplot(candy, aes(
x = price_percentile,
y = sugar_percentile
)) +
geom_jitter()
# notice the points here dont overlapProblem solves
lets get fancy
ggplot(candy, aes(
x = price_percentile,
y = sugar_percentile
)) +
geom_jitter(
color = "darkorchid",
fill = "orchid",
shape = 21,
size = 3,
) +
labs(
x = "Price",
y = "Sugar",
title = "price and sugar"
) +
theme_classic()Notice that we did not use color in aes, cuz the color will apply to all the points.
Shape takes numbers (sadly) but 21 is goof with scatterplot
16.7 anita.R
Now, lets deal with data that changes with time, aka time series.
Since date is ordered, we plot the dots and connect them with a line
timestamps <- seq(as.POSIXct("2024-03-22 00:00:00"), by = "hour", length.out = 10)
wind_speed <- runif(10, min = 3, max = 10)
data <- data.frame(Timestamp = timestamps, Wind_Speed = wind_speed)
dataHere is our code
ggplot(data, aes(x = Timestamp, y = Wind_Speed)) +
geom_line(
linetype = 2,
linewidth = 0.5
) +
geom_point(
color = "deepskyblue4",
size = 2
) +
labs(
x = "Date",
y = "Wind Speed",
title = "time series "
) +
theme_classic()Notice that we are using two geoms. Order of geoms matter.
Here points will be above lines
When height becomes more than 6, it is considered a hurricane. we can add a horizontal line
ggplot(data, aes(x = Timestamp, y = Wind_Speed)) +
geom_line(
linetype = 1,
linewidth = 0.5
) +
geom_point(
color = "deepskyblue4",
size = 2
) +
geom_hline(
linetype = 3,
yintercept = 6
) +
labs(
x = "Date",
y = "Wind Speed",
title = "time series "
) +
theme_classic()17 Testing Programs
17.1 Exceptions
Here is our function, we want to see how stuff can do wrong
average <- function(x) {
sum(x) / length(x)
}if we write characters instead of numbers, we get errors aka exceptions
we need to handle it, lets test if the data is numeric or not
average <- function(x) {
if (!is.numeric(x)) {
return(NA)
}
sum(x) / length(x)
}The user has no idea what is happening tho, so we tell the useage message
average <- function(x) {
if (!is.numeric(x)) {
message("'x', must be a numeric vector :>{")
return(NA)
}
sum(x) / length(x)
}17.2 warning
warning indicates a potential issue
average <- function(x) {
if (!is.numeric(x)) {
warning("'x', must be a numeric vector :>{")
return(NA)
}
sum(x) / length(x)
}17.3 stop
stop totally stops the code
average <- function(x) {
if (!is.numeric(x)) {
stop("'x', must be a numeric vector :>{")
}
sum(x) / length(x)
}Now lets handle not available too
average <- function(x) {
if (!is.numeric(x)) {
stop("'x', must be a numeric vector :>{")
}
if (any(is.na(x))) {
warning("'x' contains one or more NA values")
return(NA)
}
sum(x) / length(x)
}
average(c(1, 2, NA))17.4 Unit tests
we can create a new file test-average.R to test our function
source('average.R')
test_average <- function() {
if (average(c(1, 2, 3)) == 2) {
cat("avergae passed test\n")
} else {
cat("average failed\n")
}
if (average(c(-1, 0, 1)) == 0) {
cat("avergae passed test\n")
} else {
cat("average failed\n")
}
if (average(c(-1, -2, -3)) == -2) {
cat("avergae passed test\n")
} else {
cat("average failed\n")
}
}to import our function, we source the file that it was defined in.
17.5 testthat
what if we want to test many stuff, using if is really tedious.
Instead, we use testthat
library(testthat)source('average.R')
test_that("'average' calculates mean", {
expect_equal(average(c(1, 2, 3)), 2)
expect_equal(average(c(-1, -2, -3)), -2)
expect_equal(average(c(-1, 0, 1)), 0)
})To run the test, ctl enter or two other ways
source('test-average.R')
test_file('test-average.R')some people believe to write test first before function, others disagree
We can divide our test into chunks, each chunk tests something specific
test_that("'average' calculates mean", {
expect_equal(average(c(1, 2, 3)), 2)
expect_equal(average(c(-1, -2, -3)), -2)
expect_equal(average(c(-1, 0, 1)), 0)
})
test_that("'average' warns about NA in input", {
expect_warning(average(c(1, NA, 3)))
expect_warning(average(c(NA, NA, NA)))
})running this code shows that we need to change our order of code in average
average <- function(x) {
if (any(is.na(x))) {
warning("'x' contains one or more NA values")
return(NA)
}
if (!is.numeric(x)) {
stop("'x', must be a numeric vector :>{")
}
sum(x) / length(x)
}Now lets test that our function returns NA
test_that("'average' calculates mean", {
expect_equal(average(c(1, 2, 3)), 2)
expect_equal(average(c(-1, -2, -3)), -2)
expect_equal(average(c(-1, 0, 1)), 0)
})
test_that("'average' warns about NA in input", {
expect_warning(average(c(1, NA, 3)))
expect_warning(average(c(NA, NA, NA)))
})
test_that("'average returns NA with NA value", {
expect_equal(suppressWarnings(average(c(1, NA, 3))), NA)
})we need to add suppress warnings cuz our actual function throws an error
we can also use expect_error
test_that("'average' calculates mean", {
expect_equal(average(c(1, 2, 3)), 2)
expect_equal(average(c(-1, -2, -3)), -2)
expect_equal(average(c(-1, 0, 1)), 0)
})
test_that("'average' warns about NA in input", {
expect_warning(average(c(1, NA, 3)))
expect_warning(average(c(NA, NA, NA)))
})
test_that("'average returns NA with NA value", {
expect_equal(suppressWarnings(average(c(1, NA, 3))), NA)
})
test_that("'average stops if x is non numeric", {
expect_error(average("quack"))
})17.6 Floating point imprecision
Remember that
\[ \dfrac{0.1+0.5}{0.2} = 0.299999 \]
print(0,3, digitis = 17)17.7 tolerance
we allow error aka tolerance
0.3 +- 10^-0.8
test_that("average calculates mean", {
expect_equal(average(c(0.1, 0.5)), 0.3, tolerance = 1e-8)
})17.8 test driven development
people believe to write the test before writing the code, wow
test_greet.R
we write the test first to see what we expect our code to be
# source("greet.R")
test_that("'greet says hello to a user", {
expect_equal(greet("Carter"), "hello, Carter")
expect_equal(greet("Mario"), "hello, Mario")
})Now lets write greet
greet <- function(to) {
return(paste("hello,", to))
}Remember that this is iterative process
17.9 Behavior-Driven Development
desctibe it
describe("greet()", {
it("can say hello to a user", {
name <- "Carter"
expect_equal(greet(name), "hello, Carter")
})
})its like using English to explain the behavior of the function
18 Packaging Programs
18.1 ducksay
let’s do our package, cowsay in py, lets make ducksay in R
dir.create("ducksay")
setwd("ducksay/")18.2 package structure
- description
- namespace
- man
- R
- tests
18.3 description
should have:
- package
- title
- description
- version
- authors@R
- license
file.create("DESCRIPTION")Package: ducksay
Title: Duck Say
Description: Say hello with a ducksay
Version: 1.0
Authors@R:person("Ahmed",
"Darwish",
email = "ahmedh457@gmail.com",
role = c("aut", "cre", "cph")
)
License: MIT + file LICENSEauthor created the package
creator maintains the package
for license, MIT is good.
use file license to add my own license
18.4 license
file.create("LICENSE")YEAR: ...
COPYRIGHT HOLDER: ducksay authors18.5 test
we will create tests before having the function
we can’t run the tests yet
library(devtools)
use_testthat()this will add suggest in the description and create test folders
use_test('ducksay')this will create the file test-ducksay.R
describe("ducksay()", {
it("can print to console with 'cat'", {
expect_output(cat(ducksay()))
})
it("can say hello to the world", {
expect_match(ducksay(), "hello, world")
})
it("can say hello with a duck", {
duck <- paste(
"hello, world",
">(. )--",
"(____/",
sep = "\n"
)
expect_match(ducksay(), duck, fixed = TRUE )
})
})expect_match searches for the words in the output
we add fixed = true so its not treated as regular expression
18.6 ducksay.R
To create the function ducksay.R
use_r('ducksay')we still can’t test it until we export
ducksay <- function() {
paste(
"hello, world",
">(. )--",
"(____/",
sep = "\n"
)
}we still can’t use it. until we define namespace
18.7 NAMESPACE
file.create("NAMESPACE")in namespace, write out functions
export(ducksay)Now its available for the end user
in the console, to use our function
load_all()
cat(ducksay())
test()18.8 man
to add our own documentation .Rd we write our own manual
dir.create("man")
file.create("man/ducksay.Rd")\name{ducksay}
\alias{ducksay}
\title{ Duck Say}
\descriptions{A duck that says hello. }
\usage{
ducksay()
}
\value{
string representation of a duck saying hello to the world.
}
\examples{
cat(ducksay())
}alias: what the user writes to get the manual
lets use our manual
?ducksay18.9 build
to build the package, we use build
build()it will create tar.gz file
18.10 update package
lets add more code
describe("ducksay()", {
it("can print to console with 'cat'", {
expect_output(cat(ducksay()))
})
it("can say hello to the world", {
expect_match(ducksay(), "hello, world")
})
it("can say hello with a duck", {
duck <- paste(
"hello, world",
">(. )--",
"(____/",
sep = "\n"
)
expect_match(ducksay(), duck, fixed = TRUE )
})
it("can say any given phrase", {
expect_match(ducksay("quack!"), "quack!")
})
})ducksay <- function(phrase = "hello, world") {
paste(
phrase,
">(. )--",
"(____/",
sep = "\n"
)
}now lets use it
load_all()
cat(ducksay('quack'))
test()build()18.11 Using the package
lets create a new program that uses our package
setwd('..')
file.create('greet.R')
library(ducksay)install.packages('ducksay_1.0.tar.gz')library(ducksay)
name <- readline("whats ur name?")
gretting <- ducksay(paste("hello", name))
cat(greeting)to share the code
- cran
- github