Author

Gary Hutson

Published

October 22, 2018

Modified

March 23, 2024

Analysing the pre-conference workshop sentiments

Loading in the required packages:

install_or_load_pack <- function(pack) {
  create.pkg <- pack[!(pack %in% installed.packages()[, "Package"])]
  if (length(create.pkg)) {
    install.packages(create.pkg, dependencies = TRUE)
  }
  sapply(pack, require, character.only = TRUE)
}
packages <- c(
  "ggplot2", "tidyverse", "data.table", "wordcloud", "tm", "wordcloud2",
  "scales", "tidytext", "devtools", "twitteR", "caret", "magrittr", "RColorBrewer", "tidytext", "ggdendro",
  "tidyr", "topicmodels", "SnowballC", "gtools"
)
install_or_load_pack(packages)

This function was previously covered in blog post: https://nhsrcommunity.com/blog/a-simple-function-to-install-and-load-packages-in-r/.

Here I specify that I want to load the main packages for dealing with sentiment and discourse analysis in R. Libraries such as {tm}, {wordcloud} and {wordcloud2} are loaded for working with this type of data.

Choosing the file to import

The file we have to import is a prepared csv file and instead of hard coding the path to load the file from I simply use:

path <- choose.files()

Screenshot of Select files window that pops up to allow files to be chosen

This is a special function which allows you to open a dialog UI from R:

From this dialog I select the csv file I want to be imported. Once I have selected the csv and hit open, the path variable will be filled with the location of the file to work with.

Creating the R Data Frame

To create the data frame I can now pass the variable path to the read_csv() command:

workshop_sentiment <- read_csv(path, col_names = T)

This will read the textual data from the workshops in to a data frame with 2 columns. The first relates to what the attendees enjoyed about the workshop and the second relates to improvements that can be made:

Screenshot of the highlights and improvements text from a pre-conference survey

Separate the master data frame

The master data frame now needs to be separated into two separate data frames, as text analysis requires one column with the number of rows for each sentence, as demonstrated. Here I use {magrittr} to divide this into two new data frames:

ws_highlights <- workshop_sentiment %>%
  .[, 1]

# Copy for improvements

ws_improvements <- workshop_sentiment %>%
  .[, 2]

The ws_highlights data frame uses the first column and the ws_improvements data frame uses the second.

Function to create textual corpus

As I want to replicate this for highlights and improvements – I have created a function that could be replicated with any text analysis to create what is known as a text corpus (see: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf) this creates a series of documents, in our case sentences.

corpus_tm <- function(x) {
  library(tm)

  corpus_tm <- Corpus(VectorSource(x))
}

This function allows you to pass any data frame to the function and creates a corpus for each data frame you pass to the function. The data frame would be passed to the x parameter. The VectorSource() function creates an element for each part of the corpus.

Create Corpus for Highlights and Improvements data frame

Now the function has been created, I can simply pass the two separate data frames I created before to create two corpuses:

corpus_positive <- corpus_tm(ws_highlights$Highlights)

corpus_improvements <- corpus_tm(ws_improvements$Improvements)

The code block above shows that I create a corpus for the positive (highlights) data frame and an improvements corpus. This will display as hereunder in your environment:

Screenshot of the corpus output from the Environment in RStudio

##Function to clean data in the corpus

The most common cleaning task of working with text data is to remove things like punctuation, common English words, and so on This is something I have to repeat multiple times when dealing with discourse analysis:

clean_corpus <- function(corpus_to_use) {
  library(magrittr)

  library(tm)

  corpus_to_use %>%
    tm_map(removePunctuation) %>%
    tm_map(stripWhitespace) %>%
    tm_map(content_transformer(function(x) iconv(x, to = "UTF-8", sub = "byte"))) %>%
    tm_map(removeNumbers) %>%
    tm_map(removeWords, stopwords("en")) %>%
    tm_map(content_transformer(tolower)) %>%
    tm_map(removeWords, c("etc", "ie", "eg", stopwords("english")))
}

The parameter here takes the corpus object previously created and uses the corpus passed to:

  • Remove punctuation
  • Strip out whitespace between each text item, as the VectorSource has stripped out each word from each sentence in the data frame
  • Change the underlying formatting of the text to UTF-8
  • Remove numbers
  • Remove common English word (stop words)
  • Change the case to lower case
  • Remove a custom vector of words to adjust for things like “e.g.”, “i.e.”, “etc”.

To clean the corpus objects I simply pass the original corpus objects back through this function to perform cleaning:

corpus_positive <- clean_corpus(corpus_positive)

corpus_improvements <- clean_corpus(corpus_improvements)

Inspection of one of the data frames confirms that this has successfully been cleaned:

Create TermDocumentMatrix to attain frequent terms

The term document matrix (explained well here: https://www.youtube.com/watch?v=dE10fBCDWQc) can be used with the corpus to identify frequent terms by classification on a matrix. However, more code is needed to do this:

 find_freq_terms_fun <- function(corpus_in){

  doc_term_mat <- TermDocumentMatrix(corpus_in)

  freq_terms <- findFreqTerms(doc_term_mat)[1:max(doc_term_mat$nrow)]

  terms_grouped <-

    doc_term_mat[freq_terms,] %>%

    as.matrix() %>%

    rowSums() %>%

    data.frame(Term=freq_terms, Frequency = .) %>%

    arrange(desc(Frequency)) %>%

    mutate(prop_term_to_total_terms=Frequency/nrow(.))

  return(data.frame(terms_grouped))

 }

This function needs explanation. The function uses as a single parameter the corpus that you need to pass in, then a variable is created to create the doc_term_mat which uses the tm TermDocumentMatrix.

Next, I use the findFreqTerms function to iterate from the first entry to the maximum number of rows in the matrix. These are the powerhouses of the function, as they highlight how many times a word has been used in a sentence across all the rows of text.

The terms_grouped variable then slices the term matrix with the frequent terms, this is converted to a matrix, sum of each row are calculated for example the number of times the word appears. Then, a data frame is created of the terms in the function with the headings term and Frequency.

Next, we use the power of {dplyr} to use arrange by the frequency descending and to add a mutated column to the data frame to calculate the proportion of that specific term over all terms. The return(data.frame(terms_group)) then forces R to return the results of the function.

I then pass my data frames (highlights and improvements) to the function I have just created to see if this method works:

positive_freq_terms <- data.frame(find_freq_terms_fun(corpus_positive))

improvement_freq_terms <- data.frame(find_freq_terms_fun(corpus_improvements))

These will be built as data frames and can be viewed in R Studio’s Data environment window:

Screenshot of the RStudio’s data environment window with terms and frequency

This has worked just as expected. You could now use ggplot2 to produce a bar chart / pareto chart of the terms.

Create a Word Cloud with the {wordcloud2} package

R has a {wordcloud} package that produces relatively nice looking word clouds, but {wordcloud2} surpasses this in terms of visualisation. To use this function is easy now I have the frequent terms data frame – using the highlights data frame this can be implemented by using the below syntax:

wordcloud2(positive_freq_terms[,1:2],

           shape="pentagon",

           color="random-dark")

To use the function I pass the data frame and use the term and frequency fields only to use the visualisation. There are a number of options and these can be accessed by using the help("wordcloud2") function. Here I use the shape and color parameters to set the display of the word cloud:

Screenshot of the word cloud with Code being the largest word, followed by plot, useful, pick and facets

This can be exported in the viewer window by using the Export function.

This word cloud relates to the pre workshop prior to the conference. I personally thought the NHS-R conference was amazing and I was honoured to have a spot to speak amongst so many other brilliant R users.

R is so versatile – every day is like a school day when you are learning it, but what a journey.

This blog has been edited for NHS-R styles and has been formatted to remove Latin Abbreviations

Back to top

Reuse

CC0

Citation

For attribution, please cite this work as:
Hutson, Gary. 2018. “Text Mining – Term Frequency Analysis and Word Cloud Creation in R.” October 22, 2018. https://nhs-r-community.github.io/nhs-r-community//blog/text-mining-term-frequency-analysis-and-word-cloud-creation.html.