NHS-R Community Quarto website - How to extrapolate data from data

There are many occasions when a column of data needs to be created from an already existing column for ease of data manipulation. For example, perhaps you have a body of text as a pathology report and you want to extract all the reports where the diagnosis is ‘dysplasia’.

You could just subset the data using grepl so that you only get the reports that mention this word…but what if the data needs to be cleaned prior to subsetting like excluding reports where the diagnosis is normal but the phrase ‘No evidence of dysplasia’ is present. Or perhaps there are other manipulations needed prior to subsetting.

This is where data accordionisation is useful. This simply means the creation of data from (usually) a column into another column in the same dataframe.

The neatest way to do this is with the mutate function from the {dplyr} package which is devoted to data cleaning. There are also other ways which I will demonstrate at the end.

The input data here will be an endoscopy data set:

Age <- sample(1:100, 130, replace = TRUE)
Dx <- sample(c("NDBE", "LGD", "HGD", "IMC"), 130, replace = TRUE)
TimeOfEndoscopy <- sample(1:60, 130, replace = TRUE)

library(dplyr)

EMRdf <- data.frame(Age, Dx, TimeOfEndoscopy, stringsAsFactors = F)

Perhaps you need to calculate the number of hours spent doing each endoscopy rather than the number of minutes

EMRdftbb <- EMRdf %>% mutate(TimeOfEndoscopy / 60)

# install.packages("knitr")
library(knitr)
library(kableExtra)

# Just show the top 20 results

kable(head(EMRdftbb, 20))

Age	Dx	TimeOfEndoscopy	TimeOfEndoscopy/60
45	NDBE	35	0.5833333
27	NDBE	48	0.8000000
59	HGD	35	0.5833333
17	LGD	37	0.6166667
15	LGD	53	0.8833333
91	IMC	16	0.2666667
18	NDBE	38	0.6333333
19	NDBE	29	0.4833333
6	HGD	7	0.1166667
54	HGD	22	0.3666667
96	NDBE	58	0.9666667
34	LGD	53	0.8833333
28	NDBE	25	0.4166667
59	IMC	60	1.0000000
80	NDBE	47	0.7833333
53	LGD	39	0.6500000
36	HGD	36	0.6000000
86	NDBE	2	0.0333333
85	HGD	49	0.8166667
30	IMC	25	0.4166667

That is useful but what if you want to classify the amount of time spent doing each endoscopy as follows: <0.4 hours is too little time and >0.4 hours is too long.

Using ifelse() with mutate for conditional accordionisation.

For this we would use ifelse(). However this can be combined with mutate() so that the result gets put in another column as follows

EMRdf2 <- EMRdf %>%
  mutate(TimeInHours = TimeOfEndoscopy / 60) %>%
  mutate(TimeClassification = ifelse(TimeInHours > 0.4, "Too Long", "Too Short"))

# Just show the top 20 results

kable(head(EMRdf2, 20))

Age	Dx	TimeOfEndoscopy	TimeInHours	TimeClassification
45	NDBE	35	0.5833333	Too Long
27	NDBE	48	0.8000000	Too Long
59	HGD	35	0.5833333	Too Long
17	LGD	37	0.6166667	Too Long
15	LGD	53	0.8833333	Too Long
91	IMC	16	0.2666667	Too Short
18	NDBE	38	0.6333333	Too Long
19	NDBE	29	0.4833333	Too Long
6	HGD	7	0.1166667	Too Short
54	HGD	22	0.3666667	Too Short
96	NDBE	58	0.9666667	Too Long
34	LGD	53	0.8833333	Too Long
28	NDBE	25	0.4166667	Too Long
59	IMC	60	1.0000000	Too Long
80	NDBE	47	0.7833333	Too Long
53	LGD	39	0.6500000	Too Long
36	HGD	36	0.6000000	Too Long
86	NDBE	2	0.0333333	Too Short
85	HGD	49	0.8166667	Too Long
30	IMC	25	0.4166667	Too Long

Note how we can chain the mutate() function together.

Using multiple ifelse()

What if we want to get more complex and put several classifiers in? We just use more ifelse’s:

EMRdf2 <- EMRdf %>%
  mutate(TimeInHours = TimeOfEndoscopy / 60) %>%
  mutate(TimeClassification = ifelse(TimeInHours > 0.8, "Too Long", ifelse(TimeInHours < 0.5, "Too Short", ifelse(TimeInHours >= 0.5 & TimeInHours <= 0.8, "Just Right", "N"))))

# Just show the top 20 results

kable(head(EMRdf2, 20))

Age	Dx	TimeOfEndoscopy	TimeInHours	TimeClassification
45	NDBE	35	0.5833333	Just Right
27	NDBE	48	0.8000000	Just Right
59	HGD	35	0.5833333	Just Right
17	LGD	37	0.6166667	Just Right
15	LGD	53	0.8833333	Too Long
91	IMC	16	0.2666667	Too Short
18	NDBE	38	0.6333333	Just Right
19	NDBE	29	0.4833333	Too Short
6	HGD	7	0.1166667	Too Short
54	HGD	22	0.3666667	Too Short
96	NDBE	58	0.9666667	Too Long
34	LGD	53	0.8833333	Too Long
28	NDBE	25	0.4166667	Too Short
59	IMC	60	1.0000000	Too Long
80	NDBE	47	0.7833333	Just Right
53	LGD	39	0.6500000	Just Right
36	HGD	36	0.6000000	Just Right
86	NDBE	2	0.0333333	Too Short
85	HGD	49	0.8166667	Too Long
30	IMC	25	0.4166667	Too Short

Using multiple ifelse() with grepl() or string_extract

Of course we need to extract information from text as well as numeric data. We can do this using grepl() or string_extract() from the library(stringr).

Let’s say we want to extract all the samples that had IMC. We don’t want to subset the data, just extract IMC into a column that says IMC and the rest say ’Non-IMC’

Using the dataset above:

library(stringr)

EMRdf$MyIMC_Column <- str_extract(EMRdf$Dx, "IMC")

# to fill the NA's we would do:EMRdf$MyIMC_Column<-ifelse(grepl("IMC",EMRdf$Dx),"IMC","NoIMC")

# Another way to do this (really should be for more complex examples when you want to extract the entire contents of the cell that has the match)

EMRdf$MyIMC_Column <- ifelse(grepl("IMC", EMRdf$Dx), str_extract(EMRdf$Dx, "IMC"), "NoIMC")

So data can be usefully created from data for further analysis.

Hopefully this way of extrapolating data and especially using conditional expressions to categorise data according to some rules is a helpful way to get more out of your data.

Please follow @gastroDS on twitter

This article originally appeared on https://sebastiz.github.io/gastrodatascience/ and has been edited to render in Quarto and had NHS-R styles applied.

Reuse

CC0

Citation

For attribution, please cite this work as:

Zeki, Sebastian. 2018. “How to Extrapolate Data from Data.” July 23, 2018. https://nhs-r-community.github.io/nhs-r-community//blog/how-to-extrapolate-data-from-data.html.