Problem Set 4

Problem 1

In this problem use functions from the apply() family of functions whenever you think you need loops.

Create a function (name it fun1) that takes three arguments n, miu, and sd (where n is the number of randomly generated numbers, miu is the mean value of the distribution that numbers come from, and sd is the standard deviation of that distribution) and does the following:
- First it generates n random numbers from a normal distribution with mean = miu and standard deviation = sd
- Then it populates an empty matrix as follows: if n is divisible by 2 but not by 3, it fills a matrix by row, having 2 rows and n/2 columns; if n is divisible by 3 but not by 2, it fills a matrix by column, having 3 columns and n/3 rows; if n is divisible by 2 and by 3, it fills a matrix by row, having 6 rows and n/6 columns; else it fills by row and has only one column
- And returns a vector of mean values of each column in the matrix if none of the randomly generated numbers are negative; it returns a vector of median values of each row in the matrix if some of the randomly generated numbers are negative.
Test your function with (n = 50, miu = 0.5, sd = 2)

Create a function (name it fun2) that will take one argument n, which represents a number of trials in your simulation. Your function should simulate a process of rolling an unfair die with the following probability distribution: “1” - 0.1, “2” - 0.3, “3” - 0.2, “4” - 0.05, “5” - 0.3, “6” - 0.05. It should report the results of n trials by returning the frequency table of the outcomes (for example, how many times you got “1”, “2”, and so on).

Test your function with n = 100.

Create a function (name it fun3) that will take one argument n, which represents a number of trials in your simulation. Your function should simulate a process of rolling two fair dice. It should report the results of n trials by returning vectors of outcomes from both dice (name them Die1 and Die2, respectively) and a vector that compares outcomes obtained from Die1 and Die2 (name it comparison). comparison vector should be created as follows: it should be created using mapply() function and should be a factor with three levels (“Die1 > Die2”, “Die1 < Die2”, “Die1 = Die2”). For instance, if the outcome of Die1 is 5 and the outcome of Die2 is 4, then it should produce “Die1 > Die2”.

Test your function with n = 30.

Problem 2

Once again, you will be working with the ames_housing.csv dataset. It contains information about the homes sold in Ames, Iowa, between 2006 and 2010. The dataset includes 18 variables and covers a wide range of home characteristics. For the purpose of this assignment, you are going to work with the following variables:

Sales_Price - Sale Price
Neighborhood - Physical locations within Ames city limits
Roof_Style - Type of roof
Fence - Fence quality
TotRms_AbvGrd - Total rooms above grade (does not include bathrooms)
Gr_Liv_Area - Above grade(ground) living area square feet

John is a new real estate agent in the area. A friend of his, who worked in this area before, told him that the average price for houses sold in the area is $210,000. John feels a bit skeptical about this claim and wants to test it. Perform an appropriate statistical procedure to test this claim. Use the $\alpha = 0.05$ significance level to make a conclusion. State the null and alternative hypotheses. What is the test statistic? What are the degrees of freedom? What is the p-value? What did you conclude?

John has been in this industry for quite some time. Over the course of past 10 years, he’s noticed that a type of roof that houses have can somehow define the fence quality. In other words, he thinks that these two features (Roof_Style and Fence variables in the dataset) are somehow related. Test his hypothesis based on the sample data at the $\alpha = 0.01$ significance level. State the null and alternative hypotheses. What is the test statistic? What are the degrees of freedom? What is the p-value? What did you conclude?

John is specifically interested in two neighborhoods: North_Ames and College_Creek. He did some research prior his arrival and he thinks that on average houses in the College_Creek neighborhood are $50,000 more expensive than the houses sold in the North_Ames neighborhood. Test his hypothesis at the $\alpha = 0.02$ significance level. State the null and alternative hypotheses. What is the test statistic? What are the degrees of freedom? What is the p-value? What did you conclude?

John decides to split all homes in the dataset into the following three groups based on the total number of rooms above the ground: "2-4", "5-8", and "9 or more". He believes that 10% of homes belong to the "2-4" category, 85% belong to the "5-8" category, and the remaining 5% belong to the "9 or more" category. Test his hypothesis using the appropriate statistical procedure at the $\alpha = 0.01$ significance level. State the null and alternative hypotheses. What is the test statistic? What are the degrees of freedom? What is the p-value? What did you conclude?

Before you perform the test, you need to add a new variable to the dataset (name it Rooms) using the splitting rule proposed by John. In other words, you need to create a factor variable with 3 levels by converting the TotRms_AbvGrd variable. (Hint: you might want to check out the case_when() function from dpyr package).

Problem 3

The Tri-City Office Equipment Corporation sells an imported copier on a franchise basis and performs preventive maintenance and repair service on this copier. The data in copier_maintenance.txt (available on Courseworks) were collected from 45 recent calls on users to perform routine preventive maintenance service; for each observation, they recorded the number of copiers serviced (Copiers, predictor) and the total number of minutes spent by the service person (Minutes, response).

Fit the least squares linear regression model.

Display summary of the model obtained in part 1.

Plot the data and overlay the linear regression function you obtained in part 1.

Obtain a point estimate of the mean service time when 7 copiers are serviced.

Obtain a 95% confidence interval for the slope.

Obtain a 95% confidence interval for the mean service time when 7 copiers are serviced.

Obtain a 95% prediction interval for the mean service time when 7 copiers are serviced.

Module 14

PS 4 Solutions