Introduction

Hello! I am listing down things that I found difficult when it comes to text analysis.

White and Blank spaces

This took me a while to learn. Hint** white spaces and blank spaces are not the same.

trimsw() will not work at all if it is blank spaces.

Here are some useful tips to know more about regular expressions.

  • [:digit:] or digits, 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9].

  • : non-digits, equivalent to [^0-9].

  • [:lower:]: lower-case letters, equivalent to [a-z].

  • [:upper:]: upper-case letters, equivalent to [A-Z].

  • [:alpha:]: alphabetic characters, equivalent to [[:lower:][:upper:]] or [A-z].

  • [:alnum:]: alphanumeric characters, equivalent to [[:alpha:][:digit:]] or [A-z0-9].

  • : word characters, equivalent to [[:alnum:]_] or [A-z0-9_].

  • : not word, equivalent to [^A-z0-9_].

  • : hexadecimal digits (base 16), 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f, equivalent to [0-9A-Fa-f].

  • “[:blank:]”: blank characters, i.e. space and tab.

  • “[:space:]”: space characters: tab, newline, vertical tab, form feed, carriage return, space.

  • : space, .

  • : not space.

  • [:punct:]: punctuation characters, ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.

  • [:graph:]: graphical (human readable) characters: equivalent to [[:alnum:][:punct:]].

  • [:print:]: printable characters, equivalent to [[:alnum:][:punct:]\s].

  • [:cntrl:]: control characters, like or [00-1F7F].

Some stuff worth remembering.

If you want to find out if multiple strings exists.

Text <- c('I went out to meet some friends')
words <- unlist(strsplit(Text, split = " "))
grep('to|meet', words, value=T) # actual matches
## [1] "to"   "meet"
grepl('to|meet', words) # logical 
## [1] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE

If you want to remove the rest of a string after a specific character, you can use the gsub() function

  • . matches any single character after the specific character
  • * matches at least 0 times.
  • \ suppress the special meaning of metacharacters in regular expression. It has to be double backslash (i.e. \\) since \ itself needs to be escaped in R.
text <- c('dd1 =~aaa', 'dd2 =~`bbb')
trimws(gsub("\\=~.*", "", text)) 
## [1] "dd1" "dd2"

If you want to replace multiple patterns in a single string,

text <- c('I am having a lot of fun')
gsub('having|fun','cool' ,text)
## [1] "I am cool a lot of cool"

If you want to keep everything within the special characters only. It doesn’t work if you have special characters before or after the opening and closing brackets though.

item <- 'I like to [A1], [A2], [A3], [K1] fun'
words <- unlist(strsplit(item, split= ' '))
position <- (grep(c('\\['), words))
features_special_characters <- words[position]
(features <- gsub('\\[|\\]', "", features_special_characters))
## [1] "A1," "A2," "A3," "K1"

Citations