Lecture 08

Last lecture we learned how to utilize apply family of functions that implements looping over a matrix, data frame, or list in a compact form. But what if you want to apply a function over subsets of a vector? Or what if you want to split a vector into subsets, compute summary statistics for each subset and return the result in a group by form?

If that’s what you are trying to do, then you will find the tapply() and aggregate() functions very useful. They will come in handy when you start analyzing your own data. Now, let’s see how these functions work.

We will be using the same data set as before (Lung Capacity data set), which is available on Courseworks.


data <- read.table(file = "C:/Users/alexp/OneDrive/Desktop/lung_capacity.txt", header = T, sep = "", stringsAsFactors = TRUE)

Notice, we added one more argument stringsAsFactors = TRUE. It converts all character vectors in the data frame into factors.

tapply() Function

tapply() function is another member of the apply family of functions. tapply() is used to apply a function over subsets of a vector. It allows you to create group summaries based on factor levels of another variables. The arguments to tapply() function are as follows:


tapply(X, INDEX, FUN, ...)

X - is an input vector
INDEX - is a factor variable (or a list of factor variables)
FUN - is a function to be applied
... - other arguments

Now, suppose you want to calculate the mean value of Age for male and female patients separately. You can do it by passing Age and Sex variables along with the mean function to tappy():


tapply(data$Age, data$Sex, mean)
#>   FEMALE     MALE 
#> 12.44972 12.20708

You can pass a user-defined function as well. First, let’s create a function_1 that returns mean and median values of a vector:


function_1 <- function(x){
  
  return(c(Mean = mean(x), Median = median(x)))
  
}

Now, let’s pass it to tapply() function:


tapply(data$Age, data$Sex, function_1)
#> $FEMALE
#>     Mean   Median 
#> 12.44972 13.00000 
#> 
#> $MALE
#>     Mean   Median 
#> 12.20708 12.00000

We can have even more complex scenarios with two or more factor variables. In that case, factor variables should be put in a list:


tapply(data$Age, list(data$Sex, data$Smoke), mean)
#>              NO      YES
#> FEMALE 12.12739 14.75000
#> MALE   11.94910 14.81818


tapply(data$Age, list(data$Sex, data$Smoke, data$Status), mean)
#> , , HEALTHY
#> 
#>              NO      YES
#> FEMALE 11.89326 15.00000
#> MALE   12.13408 14.69231
#> 
#> , , STAGE_1
#> 
#>              NO      YES
#> FEMALE 12.23529 13.87500
#> MALE   11.69231 15.33333
#> 
#> , , STAGE_2
#> 
#>              NO YES
#> FEMALE 12.54545  14
#> MALE   11.72093  NA
#> 
#> , , STAGE_3
#> 
#>              NO      YES
#> FEMALE 13.16667 15.33333
#> MALE   11.95238 15.00000

aggregate() Function

Another way of splitting vectors into subsets and computing summary statistics for each subset is using aggregate() function. It is useful in performing all the aggregate operations like sum, count, mean, median, and so on. The arguments to aggregate() function are as follows:


aggregate(X, by, FUN, ...)

X - is an input object
by - is a list of grouping elements, by which the subsets are grouped by
FUN - is a function to be applied
... - other arguments

Here are some examples:


aggregate(data$LungCap, list(data$Smoke), median)
#>   Group.1    x
#> 1      NO 7.90
#> 2     YES 8.65


aggregate(data$LungCap, list(data$Smoke, data$Sex), median)
#>   Group.1 Group.2      x
#> 1      NO  FEMALE 7.6000
#> 2     YES  FEMALE 8.1625
#> 3      NO    MALE 8.2125
#> 4     YES    MALE 9.3500


aggregate(data$LungCap, list(data$Smoke, data$Sex, data$Status), median)
#>    Group.1 Group.2 Group.3       x
#> 1       NO  FEMALE HEALTHY  7.4375
#> 2      YES  FEMALE HEALTHY  8.2375
#> 3       NO    MALE HEALTHY  8.2000
#> 4      YES    MALE HEALTHY  9.0750
#> 5       NO  FEMALE STAGE_1  7.7500
#> 6      YES  FEMALE STAGE_1  7.8375
#> 7       NO    MALE STAGE_1  8.0750
#> 8      YES    MALE STAGE_1 10.1000
#> 9       NO  FEMALE STAGE_2  8.2750
#> 10     YES  FEMALE STAGE_2  7.9500
#> 11      NO    MALE STAGE_2  8.3500
#> 12      NO  FEMALE STAGE_3  7.7250
#> 13     YES  FEMALE STAGE_3  8.1250
#> 14      NO    MALE STAGE_3  7.8250
#> 15     YES    MALE STAGE_3  9.8750


aggregate(data$LungCap, list(data$Smoke, data$Sex, data$Status), function_1)
#>    Group.1 Group.2 Group.3    x.Mean  x.Median
#> 1       NO  FEMALE HEALTHY  7.212258  7.437500
#> 2      YES  FEMALE HEALTHY  8.341667  8.237500
#> 3       NO    MALE HEALTHY  8.385335  8.200000
#> 4      YES    MALE HEALTHY  9.144231  9.075000
#> 5       NO  FEMALE STAGE_1  7.208529  7.750000
#> 6      YES  FEMALE STAGE_1  7.575000  7.837500
#> 7       NO    MALE STAGE_1  8.153297  8.075000
#> 8      YES    MALE STAGE_1  9.875000 10.100000
#> 9       NO  FEMALE STAGE_2  7.681061  8.275000
#> 10     YES  FEMALE STAGE_2  7.716667  7.950000
#> 11      NO    MALE STAGE_2  7.811047  8.350000
#> 12      NO  FEMALE STAGE_3  7.913889  7.725000
#> 13     YES  FEMALE STAGE_3  8.275000  8.125000
#> 14      NO    MALE STAGE_3  7.802381  7.825000
#> 15     YES    MALE STAGE_3  9.875000  9.875000


aggregate(data[ ,c("Age", "LungCap")], list(data$Smoke), median)
#>   Group.1 Age LungCap
#> 1      NO  12    7.90
#> 2     YES  15    8.65

Data Visualization

We have already learned how to use R built-in functions to compute basic numerical summaries of data such as mean, median, IQR, quantiles, and so on. In this lecture, we are going to learn how to get summary of your data via visualization.

Data visualization plays a crucial role in data analysis. Prior to building statistical models or performing any other statistical procedure, you will want to plot your data as it will suggest what statistical tools might be more appropriate. In this lecture, you will learn how to visualize data in Base R. Later in the semester, we will learn how to use ggplot2 package to do so. It is more versatile and has advanced visualization tools.

Histograms

Histograms display a distribution of numerical data. They represent the frequencies of values of a variable bucketed into ranges. The function hist() is used to plot histograms. Let’s plot a histogram of the Age variable:


hist(data$Age)

You can change labels of a histogram by passing xlab, ylab, and main arguments to the function:


hist(data$Age,
     
     xlab = "AGE",
     
     ylab = "FREQUENCY",
     
     main = "Histogram of AGE")

The x-axis contains a range of values of the variable. Histograms divide it into intervals known as bins. Histograms are tricky because it depends on the subjective judgments of where exactly to put the bin margins that what graph you will be looking at. Wide bins produce one picture, narrow bins produce a different picture, and unequal bins produce confusion. You can choose a number of bins to display by passing a break argument to the function. In the example, below you specify the starting and ending points of each bin (for instance, 0 to 2, 2 to 4, and so on):


hist(data$Age,
     
     xlab = "AGE",
     
     ylab = "FREQUENCY",
     
     main = "Histogram of AGE",
     
     breaks = seq(0, 20, 2)
     
     )

You can change the color of bins and corresponding borders:


hist(data$Age,
     
     xlab = "AGE",
     
     ylab = "FREQUENCY",
     
     main = "Histogram of AGE",
     
     breaks = seq(0, 20, 2),
     
     col = "lightpink",
     
     border = "black")

You can display specific parts of a histogram by passing xlim() arguments. It will display the part of a histogram that corresponds to the specified range:


hist(data$Age,
     
     xlab = "AGE",
     
     ylab = "FREQUENCY",
     
     main = "Histogram of AGE",
     
     breaks = seq(0, 20, 2),
     
     col = "lightpink",
     
     border = "black",
     
     xlim = c(0, 12)
     
     )

Barplots

Barplots are similar to histograms but are used for categorical/qualitative variables (in R we call them factors). They display levels of a categorical variable and their corresponding frequencies. We will create a barplot for the Status variable:


barplot(table(data$Status))

Like histgorams, barplots can be customized. Let’s add labels to your barplot:


barplot(table(data$Status),
        
        xlab = "Disease Stage",
        
        ylab = "Frequences",
        
        main = "Disease Status",
        
        names.arg = c("Healthy", "Stage 1", "Stage 2", "Stage 3"))

Now, let’s change the color of bins and borders:


barplot(table(data$Status),
        
        xlab = "Disease Stage",
        
        ylab = "Frequences",
        
        main = "Disease Status",
        
        names.arg = c("Healthy", "Stage 1", "Stage 2", "Stage 3"),
        
        col = "orange",
        
        border = "blue")

You can even make a horizontal barplot:


barplot(table(data$Status),
        
        xlab = "Frequences",
        
        ylab = "Disease Stage",
        
        main = "Disease Status",
        
        names.arg = c("Healthy", "Stage 1", "Stage 2", "Stage 3"),
        
        col = "orange",
        
        border = "blue",
        
        horiz = TRUE)

In R, you can create clustered barplots. For the two examples given below, guess what they are displaying:


barplot(table(data[, c("Sex", "Status")]),
        
        legend.text = TRUE,
        
        ylab = "Frequences",
        
        xlab = "Disease Stage",
        
        main = "Disease Status vs Sex",
        
        names.arg = c("Healthy", "Stage 1", "Stage 2", "Stage 3"),
        
        col = c("orange", "red"))


barplot(table(data[, c("Sex", "Status")]),
        
        beside = TRUE,
        
        legend.text = TRUE,
        
        ylab = "Frequences",
        
        xlab = "Disease Stage",
        
        main = "Disease Status vs Sex",
        
        names.arg = c("Healthy", "Stage 1", "Stage 2", "Stage 3"),
        
        col = c("orange", "red"))

Boxplots

Boxplots are used to visualize a 5-Number summary (Minimum, Q1 (first quartile, also known as 25th percentile), median, Q3 (third quartile, also known as 75th percentile), and Maximum). Below is a boxplot for the Age variable:


boxplot(data$Age)

Let’s add labels to it and change the colors:


boxplot(data$Age,
        
        xlab = "AGE",
        
        ylab = "Values",
        
        main = "Boxplot of AGE",
        
        col = "darkorange",
        
        border = "dodgerblue")

You can even add a notch to it if you want to:


boxplot(data$Age,
        
        xlab = "AGE",
        
        ylab = "Values",
        
        main = "Boxplot of AGE",
        
        col = "darkorange",
        
        border = "dodgerblue",
        
        notch = T)

You can change the shape and size of points in the plot by passing pch and cex arguments, respectfully. Type ?pch in the console to see what shapes are available.


boxplot(data$Age,
        
        xlab = "AGE",
        
        ylab = "Values",
        
        main = "Boxplot of AGE",
        
        col = "darkorange",
        
        border = "dodgerblue",
        
        notch = T,
        
        pch = 20,
        
        cex  = 2)

Often you will be using boxplots to compare a numerical variable for different levels of other categorical variables (that is, levels of a factor). Let’s compare boxplots of the Age variable for female and male patients:


boxplot(data$Age ~ data$Sex,
        
        xlab = "SEX",
        
        ylab = "AGE",
        
        main = "Grouped Boxplots",
        
        col = c("darkorange", "red"),
        
        border = c("dodgerblue", "black"),
        
        notch = T)

Let’s make it more complicated:


boxplot(data$Age ~ data$Sex:data$Smoke,
        
        xlab = "COMBINATION OF SEX AND SMOKE VARIABLES",
        
        ylab = "AGE",
        
        main = "Grouped Boxplots",
        
        names = c("F_Nonsmokers","M_Nonsmokers", "F_Smokers", "M_Smokers"),
        
        col = c("darkorange", "red"),
        
        border = c("dodgerblue", "black"),
        
        notch = T)

Scatterplots

As mentioned earlier, before building a statistical model, it is recommended to visualize a relationship among variables. Suppose, you want to build a regression model that will describe the relationship between Age and Lung Capacity variables. First, we will visualize the data using plot() function:


plot(x = data$Age, y = data$LungCap)

As you can observe, there is a linear trend between these two variables, so this suggests that a linear regression model might be appropriate. Now let’s customize the plot by labeling it and changing the color, shape, and size of the points:


plot(x = data$Age,
     
     y = data$LungCap,
     
     xlab = "AGE",
     
     ylab = "Lung Capacity",
     
     main = "Lung Capacity vs Age",
     
     col = "dodgerblue",
     
     pch = 20,
     
     cex  = 0.5)

You can focus on specific parts of the plot by adding xlim() argument:


plot(x = data$Age,
     
     y = data$LungCap,
     
     xlab = "AGE",
     
     ylab = "Lung Capacity",
     
     main = "Lung Capacity vs Age",
     
     col = "dodgerblue",
     
     pch = 20,
     
     cex  = 1,
     
     xlim = c(0, 10))

You can change the color (or the size) of points based on other factor variables. For example, we can make the observations that belong to female patients be displayed as red dots and the observations that belong to male patients be displayed as blue ones. In addition, we can add legends to clarify the meaning of colors in the plot:


colors <- c("red", "blue")

plot(x = data$Age,
     
     y = data$LungCap,
     
     xlab = "AGE",
     
     ylab = "Lung Capacity",
     
     main = "Lung Capacity vs Age",
     
     col = colors[data$Sex],
     
     pch = 20,
     
     cex  = 1)


legend("topleft", legend = c("FEMALE", "MALE"), pch = 20, col = colors)

Finally, in R you are able to display multiple plots in a single image. To do so, you need to use par() function. You pass an mfrow() argument to this function that specifies dimensions of your final plot. For example, if you want to plot two images in one row (that is, 1 row and 2 columns), then you execute par(mfrow = c(1, 2)) function followed by the plots that you aim to include in it:


par(mfrow = c(1, 2))


boxplot(data$LungCap ~ data$Sex,
        
        xlab = "SEX",
        
        ylab = "AGE",
        
        main = "Grouped Boxplots",
        
        col = c("darkorange", "red"),
        
        border = c("dodgerblue", "black"),
        
        notch = T)


plot(x = data$Age,
     
     y = data$LungCap,
     
     xlab = "AGE",
     
     ylab = "Lung Capacity",
     
     main = "Lung Capacity vs Age",
     
     col = colors[data$Sex],
     
     pch = 20,
     
     cex  = 1)


legend("topleft", legend = c("FEMALE", "MALE"), pch = 20, col = colors)

Lecture 07

Lecture 09