tidyr

tidyr (gather, spread) takes the place of reshape2 (melt, cast)

Wide data

Wide data has a column for each variable. wide data are more readable

#   ozone   wind  temp
# 1 23.62 11.623 65.55
# 2 29.44 10.267 79.10
# 3 59.12  8.942 83.90

Long data

Long-format data has a column for possible variable types and a column for the values of those variables. long data are more convenient for data analysis
Long-format data isn’t necessarily only two columns. In other words, there are different levels of “longness”.

#    variable  value
# 1     ozone 23.615
# 2     ozone 29.444
# 3     ozone 59.115
# 4      wind 11.623
# 5      wind 10.267
# 6      wind  8.942
# 7      temp 65.548
# 8     temp 79.100
# 9     temp 83.903

There are four fundamental functions of data tidying:

  • gather(): takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.
  • spread(): takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider
  • separate() splits a single column into multiple columns
  • unite() combines multiple columns into a single column
#install.packages("tidyr")
suppressPackageStartupMessages(library(tidyr))

gather(): Reshaping wide format to long format

gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)   
= data %>% gather(key, value, ..., na.rm = FALSE, convert = FALSE)   
  
  data: data frame     
  key: column name representing new variable    
  value: column name representing variable values   
  ...: names of columns to gather (or not gather)    
  na.rm: whether remove observations with missing values     
  convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6
long_DF <- airquality %>% gather(characters, measurements, Ozone:Temp) 
#long_DF <- airquality %>% gather(characters, measurements, 1:4) #same results
#long_DF <- airquality %>% gather(characters, measurements, -Month, -Day) #same results 
#long_DF <- airquality %>% gather(characters, measurements, Ozone, Solar.R, Wind, Temp) #same results
head(long_DF)
##   Month Day characters measurements
## 1     5   1      Ozone           41
## 2     5   2      Ozone           36
## 3     5   3      Ozone           12
## 4     5   4      Ozone           18
## 5     5   5      Ozone           NA
## 6     5   6      Ozone           28

spread( ): Reshaping long format to wide format

spread(data, key, value, fill = NA, convert = FALSE)
= data %>% spread(key, value, fill = NA, convert = FALSE)

  data: data frame
  key: column values to convert to multiple columns
  value: single column values to convert to multiple columns' values 
  fill: If there isn't a value for every combination of the other variables and the key column, this value will be substituted
  convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate
wide_DF <- long_DF %>% spread(characters, measurements) #opposite to gather()

unite(): Merging two variables into one

unite(data, col, ..., sep = " ", remove = TRUE)
= data %>% unite(col, ..., sep = " ", remove = TRUE)

  data: data frame
  col: column name of new "merged" column
  ...: names of columns to merge
  sep: separator to use between merged values
  remove: if TRUE, remove input column from output data frame

create some fake data:

set.seed(1)
date <- as.Date('2016-01-01') + 0:14
hour <- sample(1:24, 15)
min <- sample(1:60, 15)
second <- sample(1:60, 15)
event <- sample(letters, 15)
data <- data.frame(date, hour, min, second, event)
head(data)
##         date hour min second event
## 1 2016-01-01    7  30     29     u
## 2 2016-01-02    9  43     36     a
## 3 2016-01-03   13  58     60     l
## 4 2016-01-04   20  22     11     q
## 5 2016-01-05    5  44     47     p
## 6 2016-01-06   18  52     37     k
unite_DF <- data %>%
  unite(datehour, date, hour, sep = ' ') %>%
  unite(datetime, datehour, min, second, sep = ':')   
head(unite_DF)
##              datetime event
## 1  2016-01-01 7:30:29     u
## 2  2016-01-02 9:43:36     a
## 3 2016-01-03 13:58:60     l
## 4 2016-01-04 20:22:11     q
## 5  2016-01-05 5:44:47     p
## 6 2016-01-06 18:52:37     k

separate(): Splitting a single variable into two

separate(data, col, into, sep = " ", remove = TRUE, convert = FALSE)          
= data %>% separate(col, into, sep = " ", remove = TRUE, convert = FALSE)

  data: data frame   
  col: column name representing current variable        
  into: names of variables representing new variables       
  sep: how to separate current variable (char, num, or symbol).       
       If no spearator is identified, "_" will automatically be used  
  remove: if TRUE, remove input column from output data frame      
  convert: if TRUE will automatically convert values to logical, integer, numeric, complex or factor as appropriate     
separate_DF <- unite_DF %>% 
  separate(datetime, c('date', 'time'), sep = ' ') %>% 
  separate(time, c('hour', 'min', 'second'), sep = ':')
head(separate_DF)
##         date hour min second event
## 1 2016-01-01    7  30     29     u
## 2 2016-01-02    9  43     36     a
## 3 2016-01-03   13  58     60     l
## 4 2016-01-04   20  22     11     q
## 5 2016-01-05    5  44     47     p
## 6 2016-01-06   18  52     37     k

References

Data Processing with dplyr & tidyr
tidyr and reshape2

CHENYUAN

CHENYUAN

CHENYUAN
Pursuing the dream and the best future

CHENYUAN Blog Homepage

因为不想遗忘! 在这个信息大爆炸的年代,最重要的是对知识的消化-吸收-重铸。每天学了很多东西,但是理解的多少,以及能够运用多少是日后成功的关键。作为一个PhD,大脑中充斥了太多的东西,同时随着年龄的增长,难免会忘掉很多事情。所以只是为了在众多教程中写一个自己用到的,与自己...… Continue reading