尘缘 – CHENYUAN

作为网络数据抓取中三种信息提取方式之一（另外两种分别是XPath表达式和CSS选择器），正则表达式最为底层，也是最难掌握的一种语法。当我们需要的数据以一定的规则隐藏在一段文字中时，就不可避免地要使用到正则表达式。

正则表达式 introduction

正则表达式就是使用一个字符串来描述、匹配一系列某个语法规则的字符串。通过特定的字母、数字以及特殊符号的灵活组合即可完成对任意字符串的匹配，从而达到提取相应文本信息的目的。在R语言中，有两种风格的正则表达式可以实现，一种就是在基本的正则表达式基础上进行扩展，这和相应的R字符串处理函数相关，另一种就是Perl正则表达式，这种风格的正则我们在R中一般不常用，本文主要还是针对R默认的基础的正则表达式风格进行讲解。

R默认的正则表达式风格包括基础文本处理函数(R base)和stringr包中的文本处理函数。在R中二者都支持正则表达式，也都具备基本的文本处理能力，但基础函数的一致性要弱很多，在函数命名和参数定义上很难让人印象深刻。stringr包是Hadley Wickham开发了一款专门进行文本处理的R包，它对基础的文本处理函数进行了扩展和整合，在一致性和易于理解性上都要优于基础函数。本文在介绍基本的正则表达式语法的基础上，通过R中这两种文本处理函数进行实例说明，也好让大家对R语言中正则表达式的基本用法有个大致了解，在后续的爬虫演练中更容易理解一些信息提取的细节知识。

正则表达式规则

正则表达式速查表

在线测试正则表达式的网页

元字符

正则表达式中，有12个字符被保留用作特殊用途。他们分别是：

元字符	用途
`[ ]`	括号内的任意字符将被匹配
`\`	1. 对元字符进行转义 2. 一些以\开头的特殊序列表达了一些字符串组
`^`	1. 匹配字符串的开始 2. 将^置于character class的首位表达的意思是取反义
`$`	1. 匹配字符串的结束 2. 将$置于character class内则消除了它的特殊含义
`.`	匹配除换行符以外的任意字符
`\|`	或者
`?`	前面的字符(组)最多被匹配一次
`*`	前面的字符(组)将被匹配零次或多次
`+`	前面的字符(组)将被匹配一次或多次
`( )`	表示一个字符组，括号内的字符串将作为一个整体被匹配

重复

代码	含义说明
`?`	重复零次或一次
`*`	重复零次或多次
`+`	重复一次或多次
`{n}`	重复n次
`{n,}`	重复n次或更多次
`{n,m}`	重复n次到m次

转义

如果我们想查找元字符本身，如”?”和”*“，我们需要提前告诉编译系统，取消这些字符的特殊含义。这个时候，就需要用到转义字符\，即使用\?和\*.当然，如果我们要找的是\,则使用\\进行匹配。

注：R中的转义字符是双斜杠：\\

R中预定义的字符组

代码	含义说明
`[:digit:]`	数字：0-9
`[:lower:]`	小写字母：a-z
`[:upper:]`	大写字母：A-Z
`[:alpha:]`	字母：a-z及A-Z
`[:alnum:]`	所有字母及数字
`[:punct:]`	标点符号，如`. , ;`等
`[:graph:]`	Graphical characters,即[:alnum:]和[:punct:]
`[:blank:]`	空字符，即：Space和Tab
`[:space:]`	Space，Tab，newline，及其他space characters
`[:print:]`	可打印的字符，即：[:alnum:]，[:punct:]和[:space:]

代表字符组的特殊符号

代码	含义说明
`\w`	字符串，等价于`[:alnum:]`
`\W`	非字符串，等价于`[^[:alnum:]]`
`\s`	空格字符，等价于`[:blank:]`
`\S`	非空格字符，等价于`[^[:blank:]]`
`\d`	数字，等价于`[:digit:]`
`\D`	非数字，等价于`[^[:digit:]]`
`\b`	Word edge（单词开头或结束的位置）
`\B`	No Word edge（非单词开头或结束的位置）
`\<`	Word beginning（单词开头的位置）
`\>`	Word end（单词结束的位置）

`stringr` package

stringr包一共为我们提供了30个字符串处理函数，其中大部分均可支持正则表达式的应用，包内所有函数均以str_开头，后面单词用来说明该函数的含义，相较于基础文本(R base)处理函数，stringr包函数更容易直观地理解。

可见，stringr包中的字符处理函数更丰富和完整（其实还有更多函数），并且更容易记忆。或许速度也会更快.

`str_extract()`&`str_extract_all`: 提取匹配的字符串

str_extract()和str_extract_all()区别在于: 前者只提取一次满足条件的匹配对象，而后者可以提取所有匹配对象

与str_match(), str_match_all()提取匹配字符串的功能类似

str_extract(string, pattern)
str_extract_all(string, pattern, simplify = FALSE)

string：需要处理的字符串
pattern：指定匹配的模式，一般指定正则表达式
simplify：默认返回值为列表，如果指定为TRUE，则返回矩阵格式，这样有助于将结果写入二维表中

library(stringr)
S1<- "1. A small sentence. - 2. Another tiny sentence."
str_extract(S1, "sentence") # 提取包含sentence特征的第一个字符串

## [1] "sentence"

unlist(str_extract_all(S1, "sentence")) # 提取包含sentence特征的全部字符串

## [1] "sentence" "sentence"

str_extract(S1, "^1") # 提取以1开始的字符串

## [1] "1"

unlist(str_extract_all(S1, ".$")) # 提取以句号结尾的字符

## [1] "."

unlist(str_extract_all(S1, "tiny|sentence")) # 提取包含tiny或者sentence特征的字符串

## [1] "sentence" "tiny"     "sentence"

str_extract(S1, "sm.ll")  # 点号进行模糊匹配

## [1] "small"

str_extract(S1, "sm[abc]ll")  # 中括号内表示可选字符串

## [1] "small"

str_extract_all(S1, "([[:alpha:]]).+?\\1") # 特定的字符

## [[1]]
## [1] "A small sentence. - 2. A" "nother tin"              
## [3] "ente"

unlist(str_extract_all(S1, "\\w+"))  # 提取全部单词字符

## [1] "1"        "A"        "small"    "sentence" "2"        "Another" 
## [7] "tiny"     "sentence"

`str_locate()`&`str_locate_all()`: 字符定位函数，返回匹配对象的首末位置

str_locate()和str_locate_all()的区别在于: 前者只匹配首次，而后者可以匹配所有可能的值

一般将定位函数与str_sub()函数搭配使用。

str_locate(string, pattern)
str_locate_all(string, pattern)

string：需要处理的字符串
pattern：需要匹配的对象，一般为正则表达式

S2 <- c('liushunxiang1989','zhangsan1234')
str_locate(S2,'n')

##      start end
## [1,]     7   7
## [2,]     4   4

str_locate_all(S2,'n')

## [[1]]
##      start end
## [1,]     7   7
## [2,]    11  11
## 
## [[2]]
##      start end
## [1,]     4   4
## [2,]     8   8

`str_replace()`&`str_replace_all()`: 字符串替换

str_replace与str_replace_all的区别在于: 前者只替换一次匹配的对象，而后者可以替换所有匹配的对象

str_replace(string, pattern, replacement)
str_replace_all(string, pattern, replacement)
str_replace_na(string, replacement = "NA")  # Turn NA into "NA"

string：需要处理的字符向量
pattern：指定匹配模式，
replacement：指定新的字符串用于替换匹配的模式

S3 <-'1989.07.17'
str_replace(S3, '\\.', '-')

## [1] "1989-07.17"

str_replace_all(S3, '\\.', '-')

## [1] "1989-07-17"

`str_split()`&`str_split_fixed`: 字符串分割

str_split与str_split_fixed的区别在于: 前者返回列表格式，后者返回矩阵格式

str_split(string, pattern, n = Inf)
str_split_fixed(string, pattern, n)

string：被分割的字符串向量
pattern：分割符，可以是正则表达式也可以是固定的字符
n：指定返回分割的个数(默认全部分割)，需要注意的是，其使用转移法分割字符串

S4 <- 'myxyznamexyzisxyzliuxyzshunxyzxiang!'
str_split(S4, 'xyz')

## [[1]]
## [1] "my"     "name"   "is"     "liu"    "shun"   "xiang!"

str_split(S4, 'xyz', n = 5) # 最后一组不会被分割

## [[1]]
## [1] "my"            "name"          "is"            "liu"          
## [5] "shunxyzxiang!"

str_split_fixed(S4, 'xyz', 6)

##      [,1] [,2]   [,3] [,4]  [,5]   [,6]    
## [1,] "my" "name" "is" "liu" "shun" "xiang!"

`str_detect()`: 检测字符串中是否存在某种匹配模式

str_detect(string, pattern)

string：检测的字符串对象
pattern：检测模式，可以是正则表达式

S5 <- c('LiuShunxiang','Zhangsan','Philips1990')
str_detect(S5,'^L')

## [1]  TRUE FALSE FALSE

str_detect(S5,'\\d')

## [1] FALSE FALSE  TRUE

str_detect(S5,'[a-zA-Z0-9]')

## [1] TRUE TRUE TRUE

`str_count()`: 计数匹配上的字符个数

str_count(string, pattern = "")

string：需要处理的字符串对象
pattern：指定匹配的模式，默认为""，计算每个字符串的长度

library(stringr)
S5 <- c('LiuShunxiang','Zhangsan','Philips1990')
str_count(S5)

## [1] 12  8 11

str_count(S5,'i')

## [1] 2 0 2

str_count(S5,'\\d')

## [1] 0 0 4

`word()`: 从句子中提取单词 (适用于英语环境)

word(string, start = 1L, end = start, sep = fixed(" "))

string：需要提取的字符串对象
start：整数向量，指定从第几个单词开始提取。默认从第一个单词开始
end：整数向量，指定取到第几个单词。默认到第一个单词
sep：指定单词之间的分隔符，默认为空格

S6 <- 'I like using R'
word(S6, 1, -1) #取出所有的单词

## [1] "I like using R"

word(S6, 1) #提取第一个单词

## [1] "I"

word(S6, -1) #提取最后一个单词

## [1] "R"

word(S6, 1, 1:4) #逐步提取所有单词

## [1] "I"              "I like"         "I like using"   "I like using R"

`str_sub()`: 根据正则表达式匹配字符串中的值

str_sub()与str_subset()有相同的作用。它们与word()函数的区别在于: 前者提取字符串的子串，后者提取的是单词

str_sub(string, pattern)

string，需要处理的字符向量
pattern，需要匹配的字符模式，默认模式可以是正则表达式

str_sub(S6, 1, 1)

## [1] "I"

str_sub(S6, 1, 4)

## [1] "I li"

str_sub(S6, -1)

## [1] "R"

str_sub(S6, -1, -1) <- "Python"
S6

## [1] "I like using Python"

`str_dup()`: 重复字符串

str_dup(string, times)

string：需要重复处理的字符串
times：指定重复的次数

fruit <- c("apple", "pear", "banana")
str_dup(fruit, 2)

## [1] "appleapple"   "pearpear"     "bananabanana"

str_dup(fruit, 1:3)

## [1] "apple"              "pearpear"           "bananabananabanana"

`str_c()`: 将多个字符串连接为单个字符串

str_c(..., sep = "", collapse = NULL)

...：一个或多个字符向量
sep：字符串之间的连接符，功能类似于paste()函数
collapse：如果是向量之间的连接，collapse的作用与sep一样，只不过此时sep无效

str_c("ba", str_dup("na", 0:5))

## [1] "ba"           "bana"         "banana"       "bananana"    
## [5] "banananana"   "bananananana"

str_c(c(1989,07,17), sep = '-')  # 向量使用sep

## [1] "1989" "7"    "17"

str_c(c(1989,07,17), collapse = '-')  # 向量使用collapse

## [1] "1989-7-17"

str_c('x',c(1:7),':') 

## [1] "x1:" "x2:" "x3:" "x4:" "x5:" "x6:" "x7:"

`str_pad()`: 字符填充

str_pad(string, width, side = c("left", "right", "both"), pad = " ")

string：需要被填充的字符串
width：指定被填充后的字符长度
side：指定填充的方向，默认向左填充
pad：指定填充的字符，默认用空格填充

S7 <- 'LiuShunxiang'
str_pad(S7, 10) #指定的长度少于string长度时，将只返回原string

## [1] "LiuShunxiang"

str_pad(S7, 20)

## [1] "        LiuShunxiang"

str_pad(S7, 20, side = 'both', pad = '*')

## [1] "****LiuShunxiang****"

`str_trim()`: 剔除字符串多余的首末空格

str_trim(string, side = c("both", "left", "right"))

string：需要处理的字符串
side：指定剔除空格的位置，both表示剔除首尾两端空格(默认)，left表示剔除字符串首部空格，right表示剔除字符串末尾空格

S8 <- '   Why is me? I have worded hardly!    '
str_trim(S8, side = 'left')

## [1] "Why is me? I have worded hardly!    "

str_trim(S8, side = 'right')

## [1] "   Why is me? I have worded hardly!"

str_trim(S8, side = 'both')

## [1] "Why is me? I have worded hardly!"

`str_order()`&`str_sort()`: 对字符向量排序

str_order()和str_sort()的区别在于: 前者返回排序后的索引（下标），后者返回排序后的实际值

str_order(x, decreasing = FALSE, na_last = TRUE, locale = "", ...)
str_sort(x, decreasing = FALSE, na_last = TRUE, locale = "", ...)

x：需要排序的字符向量
decreasing：排序方式，默认为升序
na_last：是否将缺失值置于末尾，默认为TRUE

str_order(letters, locale = "en")

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26

str_sort(letters, locale = "en")

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

`str_wrap()`: 将段落划分为华丽的格式，可设置每行的宽度等

str_wrap(string, width = 80, indent = 0, exdent = 0)

string：需要被划分的字符串
width：设定每行的宽度
indent：设定每个段落第一行的缩进格式，默认没有缩进
exdent：设定每个段落第一行之后所有行的缩进格式，默认没有缩进

S9 <- "R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. To download R, please choose your preferred CRAN mirror. If you have questions about R like how to download and install the software, or what the license terms are, please read our answers to frequently asked questions before you send an email."

str_wrap(S9) #默认以80个字节作为行宽

## [1] "R is a free software environment for statistical computing and graphics. It\ncompiles and runs on a wide variety of UNIX platforms, Windows and MacOS. To\ndownload R, please choose your preferred CRAN mirror. If you have questions\nabout R like how to download and install the software, or what the license terms\nare, please read our answers to frequently asked questions before you send an\nemail."

cat(str_wrap(S9), sep = '\n') #以换行符连接每个固定长度的句子

## R is a free software environment for statistical computing and graphics. It
## compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. To
## download R, please choose your preferred CRAN mirror. If you have questions
## about R like how to download and install the software, or what the license terms
## are, please read our answers to frequently asked questions before you send an
## email.

cat(str_wrap(S9, indent = 4)) #段落第一行空4个字符

##     R is a free software environment for statistical computing and graphics. It
## compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. To
## download R, please choose your preferred CRAN mirror. If you have questions
## about R like how to download and install the software, or what the license terms
## are, please read our answers to frequently asked questions before you send an
## email.

编码转换相关函数

函数	功能说明
`iconv()`	转换编码格式
`Encoding()`	查看编码格式；或者指定编码格式
`tau::is.locale()`	tests if the components of a vector of character are in the encoding of the current locale
`tau::is.ascii()`
`tau::is.utf8()`	tests if the components of a vector of character are true UTF-8 strings

References

R中的正则表达式及字符处理函数总结

正则表达式速查表

Favorites

Categories

Tags

Lists

About

Home

尘缘

stringr

正则表达式 introduction

正则表达式规则

元字符

重复

转义

R中预定义的字符组

代表字符组的特殊符号

`stringr` package

`str_extract()`&`str_extract_all`: 提取匹配的字符串

`str_locate()`&`str_locate_all()`: 字符定位函数，返回匹配对象的首末位置

`str_replace()`&`str_replace_all()`: 字符串替换

`str_split()`&`str_split_fixed`: 字符串分割

`str_detect()`: 检测字符串中是否存在某种匹配模式

`str_count()`: 计数匹配上的字符个数

`word()`: 从句子中提取单词 (适用于英语环境)

`str_sub()`: 根据正则表达式匹配字符串中的值

`str_dup()`: 重复字符串

`str_c()`: 将多个字符串连接为单个字符串

`str_pad()`: 字符填充

`str_trim()`: 剔除字符串多余的首末空格

`str_order()`&`str_sort()`: 对字符向量排序

`str_wrap()`: 将段落划分为华丽的格式，可设置每行的宽度等

编码转换相关函数

References

Favorites

Categories

Tags

Lists

About

Home

尘缘

正则表达式 introduction

正则表达式规则

元字符

重复

转义

R中预定义的字符组

代表字符组的特殊符号

stringr package

str_extract()&str_extract_all: 提取匹配的字符串

str_locate()&str_locate_all(): 字符定位函数，返回匹配对象的首末位置

str_replace()&str_replace_all(): 字符串替换

str_split()&str_split_fixed: 字符串分割

str_detect(): 检测字符串中是否存在某种匹配模式

str_count(): 计数匹配上的字符个数

word(): 从句子中提取单词 (适用于英语环境)

str_sub(): 根据正则表达式匹配字符串中的值

str_dup(): 重复字符串

str_c(): 将多个字符串连接为单个字符串

str_pad(): 字符填充

str_trim(): 剔除字符串多余的首末空格

str_order()&str_sort(): 对字符向量排序

str_wrap(): 将段落划分为华丽的格式，可设置每行的宽度等

编码转换相关函数

References

`stringr` package

`str_extract()`&`str_extract_all`: 提取匹配的字符串

`str_locate()`&`str_locate_all()`: 字符定位函数，返回匹配对象的首末位置

`str_replace()`&`str_replace_all()`: 字符串替换

`str_split()`&`str_split_fixed`: 字符串分割

`str_detect()`: 检测字符串中是否存在某种匹配模式

`str_count()`: 计数匹配上的字符个数

`word()`: 从句子中提取单词 (适用于英语环境)

`str_sub()`: 根据正则表达式匹配字符串中的值

`str_dup()`: 重复字符串

`str_c()`: 将多个字符串连接为单个字符串

`str_pad()`: 字符填充

`str_trim()`: 剔除字符串多余的首末空格

`str_order()`&`str_sort()`: 对字符向量排序

`str_wrap()`: 将段落划分为华丽的格式，可设置每行的宽度等