Problem Set 2
You will be working with the datasets containing demographic information of midwest counties from 2000 US census. Dataset_1, Dataset_2, and Dataset_3 are available on Courseworks.
Dataset_1 contains the following variables:
- PID - Unique county identifier
- county - County name
- area - Area of county (units unknown). This variable is spread across 5 columns (IL - WI)
- poptotal - Total population
- state - State to which county belongs to
Dataset_2 contains the following variables:
- PID - Unique county identifier
- county - County name
- race - Total population for each racial group (White, Black, American Indians, Asians, Other races)
- inmetro - A factor with levels 0 and 1 to indicate whether county considered in a metro area
- category - Miscellaneous
Dataset_3 contains the following variables:
- PID - Unique county identifier
- county - County name
- popadults - Number of adults
- perchsd - Percent of population with high school diploma
- percollege - Percent of college educated population
- percprof - Percent of population with professional degree
- percbelowpoverty - Percent of population below poverty line
- percchildbelowpovert - Percent of children below poverty line
- percadultpoverty - Percent of adults below poverty line
- percelderlypoverty - Percent of elderly below poverty line
Note: The original description of the dataset was not documented and the current descriptions here (provided by tidyverse package) are based on speculation.
Problem 1
- In Dataset_1 the
areavariable is spread across multiple columns that represent values of thestatevariable. Transform/Tidy the dataset as follows: move the column names (IL-WI) to thestatevariable and the column values to theareavariable. Then, remove all the missing value from the dataset. Name the new datasetdata1.
- In Dataset_2 the
racevariable combines information about the total population for each racial group, given in following order: White, Black, American Indians, Asians, Other. Pull apart this column into multiple columns, by splitting wherever a separator character appers. Name these variablespopwhite,popblack,popamerindian,popasian, andpopother, respectively. Name the new datasetdata2.
-
Merge
data1anddata2: they should be joined by thestatevariable and the output result should contain all the observation fromdata1, regardless of whether they match or not. What are the dimensions of the output? How many missing values does it contain?Repeat the above procedure again by joining the datasets by the
stateandcountyvariables.
-
Merge
data1anddata2: they should be joined by thestatevariable and the output result should contain all the observation fromdata2, regardless of whether they match or not. What are the dimensions of the output? How many missing values does it contain?Repeat the above procedure again by joining the datasets by the
stateandcountyvariables.
-
Merge
data1anddata2: they should be joined by thestatevariable and the output result should contain all the observation from bothdata1anddata2, regardless of whether they match or not. What are the dimensions of the output? How many missing values does it contain?Repeat the above procedure again by joining the datasets by the
stateandcountyvariables.
- Join
data1anddata2by thestateandcountyvariables such that the unmatched rows in either input dataset are not included in the result. Name the output datasetdata.
- Remove
PIDandcategoryfromdataby selecting all variables except these two. Then, remove all duplicates from the dataset.
-
Add a new variable to the dataset (name it
popdensity), which represents a population density in each county (in order to compute a population density you need to divide the total population by the corresponding area).Then, for each racial group, add a new variable to the dataset that will represent the percentage of the total population that belongs to this group (name them
percwhite,percblack,percamerindan,percasian, andpercother, respectively).
- Join
dataandDataset_3by thestateandcountyvariables such that the unmatched rows in either input dataset are not included in the result.
- Add new variables to the dataset that correspond to the
percbelowpovertyandpercadultpovertyvariables and represent the actual number of people that belong to these groups. Name these variablespopcbelowpoverty, andpopcadultpoverty, respectively.
- Move the
inmetrovariable to the end of the dataset; change the position of thepoptotalvariable so that it comes after theareavariables; change the position of thepopdensityvariable so that it comes after thepoptotalvariables.
- What is the biggest/smallest state by the total population? What is the biggest/smallest state by the total area? What is the average population density in each state? In each state, compare the population densities of counties that are in metro area to counties that are not. Hint: you might want to use the grouping functions.
- In all states combined, what is the average percentage of each racial group? In each state separately, what is the average percentage of each racial group? In each state separately, compare the average percentage of each racial group in metro areas to non-metro areas. Discuss the results you obtained.
- For each, state calculate the average percentage of the total population who are below poverty. Now using the same measurement, compare the metro areas to non-metro areas in each state. What do you observe? Now filter out counties with less than 50000 population and repeat the previous step. What do you observe?