tidyr
tidyr (gather, spread) takes the place of reshape2 (melt, cast)
Wide data
Wide data has a column for each variable. wide data are more readable
# ozone wind temp
# 1 23.62 11.623 65.55
# 2 29.44 10.267 79.10
# 3 59.12 8.942 83.90
Long data
Long-format data has a column for possible variable types and a column for the values of those variables. long data are more convenient for data analysis
Long-format data isn’t necessarily only two columns. In other words, there are different levels of “longness”.
# variable value
# 1 ozone 23.615
# 2 ozone 29.444
# 3 ozone 59.115
# 4 wind 11.623
# 5 wind 10.267
# 6 wind 8.942
# 7 temp 65.548
# 8 temp 79.100
# 9 temp 83.903
There are four fundamental functions of data tidying:
- gather(): takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.
- spread(): takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider
- separate() splits a single column into multiple columns
- unite() combines multiple columns into a single column
#install.packages("tidyr")
suppressPackageStartupMessages(library(tidyr))
gather()
: Reshaping wide format to long format
gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
= data %>% gather(key, value, ..., na.rm = FALSE, convert = FALSE)
data: data frame
key: column name representing new variable
value: column name representing variable values
...: names of columns to gather (or not gather)
na.rm: whether remove observations with missing values
convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
long_DF <- airquality %>% gather(characters, measurements, Ozone:Temp)
#long_DF <- airquality %>% gather(characters, measurements, 1:4) #same results
#long_DF <- airquality %>% gather(characters, measurements, -Month, -Day) #same results
#long_DF <- airquality %>% gather(characters, measurements, Ozone, Solar.R, Wind, Temp) #same results
head(long_DF)
## Month Day characters measurements
## 1 5 1 Ozone 41
## 2 5 2 Ozone 36
## 3 5 3 Ozone 12
## 4 5 4 Ozone 18
## 5 5 5 Ozone NA
## 6 5 6 Ozone 28
spread( )
: Reshaping long format to wide format
spread(data, key, value, fill = NA, convert = FALSE)
= data %>% spread(key, value, fill = NA, convert = FALSE)
data: data frame
key: column values to convert to multiple columns
value: single column values to convert to multiple columns' values
fill: If there isn't a value for every combination of the other variables and the key column, this value will be substituted
convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate
wide_DF <- long_DF %>% spread(characters, measurements) #opposite to gather()
unite()
: Merging two variables into one
unite(data, col, ..., sep = " ", remove = TRUE)
= data %>% unite(col, ..., sep = " ", remove = TRUE)
data: data frame
col: column name of new "merged" column
...: names of columns to merge
sep: separator to use between merged values
remove: if TRUE, remove input column from output data frame
create some fake data:
set.seed(1)
date <- as.Date('2016-01-01') + 0:14
hour <- sample(1:24, 15)
min <- sample(1:60, 15)
second <- sample(1:60, 15)
event <- sample(letters, 15)
data <- data.frame(date, hour, min, second, event)
head(data)
## date hour min second event
## 1 2016-01-01 7 30 29 u
## 2 2016-01-02 9 43 36 a
## 3 2016-01-03 13 58 60 l
## 4 2016-01-04 20 22 11 q
## 5 2016-01-05 5 44 47 p
## 6 2016-01-06 18 52 37 k
unite_DF <- data %>%
unite(datehour, date, hour, sep = ' ') %>%
unite(datetime, datehour, min, second, sep = ':')
head(unite_DF)
## datetime event
## 1 2016-01-01 7:30:29 u
## 2 2016-01-02 9:43:36 a
## 3 2016-01-03 13:58:60 l
## 4 2016-01-04 20:22:11 q
## 5 2016-01-05 5:44:47 p
## 6 2016-01-06 18:52:37 k
separate()
: Splitting a single variable into two
separate(data, col, into, sep = " ", remove = TRUE, convert = FALSE)
= data %>% separate(col, into, sep = " ", remove = TRUE, convert = FALSE)
data: data frame
col: column name representing current variable
into: names of variables representing new variables
sep: how to separate current variable (char, num, or symbol).
If no spearator is identified, "_" will automatically be used
remove: if TRUE, remove input column from output data frame
convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate
separate_DF <- unite_DF %>%
separate(datetime, c('date', 'time'), sep = ' ') %>%
separate(time, c('hour', 'min', 'second'), sep = ':')
head(separate_DF)
## date hour min second event
## 1 2016-01-01 7 30 29 u
## 2 2016-01-02 9 43 36 a
## 3 2016-01-03 13 58 60 l
## 4 2016-01-04 20 22 11 q
## 5 2016-01-05 5 44 47 p
## 6 2016-01-06 18 52 37 k