Module 8
Data Visualization (Base R)
We have already learned how to use R built-in functions to compute basic numerical summaries of data such as mean, median, IQR, quantiles, and so on. In this module, we are going to learn how to get summary of your data via visualization.
Data visualization plays a crucial role in data analysis. Prior to building statistical models or performing any other statistical procedure, you will want to plot your data as it will suggest what statistical tools might be more appropriate. In this module, you will learn how to visualize data in Base R
. Later, we will learn how to use ggplot2
package to do so. It is more versatile and has advanced visualization tools.
We will be using the same data set as before (Lung Capacity data set), which is available on Courseworks.
data <- read.table(file = "C:/Users/alexp/OneDrive/Desktop/R Bootcamp/R_bootcamp/lung_capacity.txt", header = T, sep = "", stringsAsFactors = TRUE)
Histograms
Histograms display a distribution of numerical data. They represent the frequencies of values of a variable bucketed into ranges. The function hist()
is used to plot histograms. Let’s plot a histogram of the Age
variable:
hist(data$Age)
![](Module_8_files/figure-html/unnamed-chunk-2-1.png)
You can change labels of a histogram by passing xlab
, ylab
, and main
arguments to the function:
hist(data$Age,
xlab = "AGE",
ylab = "FREQUENCY",
main = "Histogram of AGE")
![](Module_8_files/figure-html/unnamed-chunk-3-1.png)
The x-axis contains a range of values of the variable. Histograms divide it into intervals known as bins. Histograms are tricky because it depends on the subjective judgments of where exactly to put the bin margins that what graph you will be looking at. Wide bins produce one picture, narrow bins produce a different picture, and unequal bins produce confusion. You can choose a number of bins to display by passing a break
argument to the function. In the example below, you specify the starting and ending points of each bin (for instance, 0 to 2, 2 to 4, and so on):
hist(data$Age,
xlab = "AGE",
ylab = "FREQUENCY",
main = "Histogram of AGE",
breaks = seq(0, 20, 2)
)
![](Module_8_files/figure-html/unnamed-chunk-4-1.png)
You can change the color of bins and corresponding borders:
hist(data$Age,
xlab = "AGE",
ylab = "FREQUENCY",
main = "Histogram of AGE",
breaks = seq(0, 20, 2),
col = "lightpink",
border = "black")
![](Module_8_files/figure-html/unnamed-chunk-5-1.png)
You can display specific parts of a histogram by passing xlim()
arguments. It will display the part of a histogram that corresponds to the specified range:
hist(data$Age,
xlab = "AGE",
ylab = "FREQUENCY",
main = "Histogram of AGE",
breaks = seq(0, 20, 2),
col = "lightpink",
border = "black",
xlim = c(0, 12)
)
![](Module_8_files/figure-html/unnamed-chunk-6-1.png)
Barplots
Barplots are similar to histograms but are used for categorical/qualitative variables (in R we call them factors). They display levels of a categorical variable and their corresponding frequencies. We will create a barplot for the Status
variable:
![](Module_8_files/figure-html/unnamed-chunk-7-1.png)
Like histgorams, barplots can be customized. Let’s add labels to your barplot:
barplot(table(data$Status),
xlab = "Disease Stage",
ylab = "Frequences",
main = "Disease Status",
names.arg = c("Healthy", "Stage 1", "Stage 2", "Stage 3"))
![](Module_8_files/figure-html/unnamed-chunk-8-1.png)
Now, let’s change the color of bins and borders:
barplot(table(data$Status),
xlab = "Disease Stage",
ylab = "Frequences",
main = "Disease Status",
names.arg = c("Healthy", "Stage 1", "Stage 2", "Stage 3"),
col = "orange",
border = "blue")
![](Module_8_files/figure-html/unnamed-chunk-9-1.png)
You can even make a horizontal barplot:
barplot(table(data$Status),
xlab = "Frequences",
ylab = "Disease Stage",
main = "Disease Status",
names.arg = c("Healthy", "Stage 1", "Stage 2", "Stage 3"),
col = "orange",
border = "blue",
horiz = TRUE)
![](Module_8_files/figure-html/unnamed-chunk-10-1.png)
In R, you can create clustered barplots. For the two examples given below, guess what they are displaying:
barplot(table(data[, c("Sex", "Status")]),
legend.text = TRUE,
ylab = "Frequences",
xlab = "Disease Stage",
main = "Disease Status vs Sex",
names.arg = c("Healthy", "Stage 1", "Stage 2", "Stage 3"),
col = c("orange", "red"))
![](Module_8_files/figure-html/unnamed-chunk-11-1.png)
barplot(table(data[, c("Sex", "Status")]),
beside = TRUE,
legend.text = TRUE,
ylab = "Frequences",
xlab = "Disease Stage",
main = "Disease Status vs Sex",
names.arg = c("Healthy", "Stage 1", "Stage 2", "Stage 3"),
col = c("orange", "red"))
![](Module_8_files/figure-html/unnamed-chunk-12-1.png)
Boxplots
Boxplots are used to visualize a 5-Number summary (Minimum, Q1 (first quartile, also known as 25th percentile), median, Q3 (third quartile, also known as 75th percentile), and Maximum). Below is a boxplot for the Age
variable:
boxplot(data$Age)
![](Module_8_files/figure-html/unnamed-chunk-13-1.png)
Let’s add labels to it and change the colors:
boxplot(data$Age,
xlab = "AGE",
ylab = "Values",
main = "Boxplot of AGE",
col = "darkorange",
border = "dodgerblue")
![](Module_8_files/figure-html/unnamed-chunk-14-1.png)
You can even add a notch to it if you want to:
boxplot(data$Age,
xlab = "AGE",
ylab = "Values",
main = "Boxplot of AGE",
col = "darkorange",
border = "dodgerblue",
notch = T)
![](Module_8_files/figure-html/unnamed-chunk-15-1.png)
You can change the shape and size of points in the plot by passing pch
and cex
arguments, respectfully. Type ?pch
in the console to see what shapes are available.
boxplot(data$Age,
xlab = "AGE",
ylab = "Values",
main = "Boxplot of AGE",
col = "darkorange",
border = "dodgerblue",
notch = T,
pch = 20,
cex = 2)
![](Module_8_files/figure-html/unnamed-chunk-16-1.png)
Often you will be using boxplots to compare a numerical variable for different levels of other categorical variables (that is, levels of a factor). Let’s compare boxplots of the Age
variable for female and male patients:
boxplot(data$Age ~ data$Sex,
xlab = "SEX",
ylab = "AGE",
main = "Grouped Boxplots",
col = c("darkorange", "red"),
border = c("dodgerblue", "black"),
notch = T)
![](Module_8_files/figure-html/unnamed-chunk-17-1.png)
Let’s make it more complicated:
boxplot(data$Age ~ data$Sex:data$Smoke,
xlab = "COMBINATION OF SEX AND SMOKE VARIABLES",
ylab = "AGE",
main = "Grouped Boxplots",
names = c("F_Nonsmokers","M_Nonsmokers", "F_Smokers", "M_Smokers"),
col = c("darkorange", "red"),
border = c("dodgerblue", "black"),
notch = T)
![](Module_8_files/figure-html/unnamed-chunk-18-1.png)
Scatterplots
As mentioned earlier, before building a statistical model, it is recommended to visualize a relationship among variables. Suppose, you want to build a regression model that will describe the relationship between Age
and Lung Capacity
variables. First, we will visualize the data using plot()
function:
plot(x = data$Age, y = data$LungCap)
![](Module_8_files/figure-html/unnamed-chunk-19-1.png)
As you can observe, there is a linear trend between these two variables, so this suggests that a linear regression model might be appropriate. Now let’s customize the plot by labeling it and changing the color, shape, and size of the points:
plot(x = data$Age,
y = data$LungCap,
xlab = "AGE",
ylab = "Lung Capacity",
main = "Lung Capacity vs Age",
col = "dodgerblue",
pch = 20,
cex = 0.5)
![](Module_8_files/figure-html/unnamed-chunk-20-1.png)
You can focus on specific parts of the plot by adding xlim()
argument:
plot(x = data$Age,
y = data$LungCap,
xlab = "AGE",
ylab = "Lung Capacity",
main = "Lung Capacity vs Age",
col = "dodgerblue",
pch = 20,
cex = 1,
xlim = c(0, 10))
![](Module_8_files/figure-html/unnamed-chunk-21-1.png)
You can change the color (or the size) of points based on other factor variables. For example, we can make observation that belong to the female patients be displayed as red dots and observation that belong to the male patients be displayed as blue ones. In addition, we can add legends to clarify the meaning of colors in the plot:
colors <- c("red", "blue")
plot(x = data$Age,
y = data$LungCap,
xlab = "AGE",
ylab = "Lung Capacity",
main = "Lung Capacity vs Age",
col = colors[data$Sex],
pch = 20,
cex = 1)
legend("topleft", legend = c("FEMALE", "MALE"), pch = 20, col = colors)
![](Module_8_files/figure-html/unnamed-chunk-22-1.png)
Finally, in R you are able to display multiple plots in a single image. To do so, you need to use par()
function. You pass an mfrow()
argument to this function that specifies dimensions of your final plot. For example, if you want to plot two images in one row (that is, 1 row and 2 columns), then you execute par(mfrow = c(1, 2))
function followed by the plots that you aim to include in it:
par(mfrow = c(1, 2))
boxplot(data$LungCap ~ data$Sex,
xlab = "SEX",
ylab = "AGE",
main = "Grouped Boxplots",
col = c("darkorange", "red"),
border = c("dodgerblue", "black"),
notch = T)
plot(x = data$Age,
y = data$LungCap,
xlab = "AGE",
ylab = "Lung Capacity",
main = "Lung Capacity vs Age",
col = colors[data$Sex],
pch = 20,
cex = 1)
legend("topleft", legend = c("FEMALE", "MALE"), pch = 20, col = colors)
![](Module_8_files/figure-html/unnamed-chunk-23-1.png)