STA 138 Fall 2015
Homework 6 - Due Wednesday, November 18th
1. Use the flu.csv dataset, as we did in Homework 4 and
5.
(a) Find the best model using forward-backward sub-
set selection using AIC, and report the best tting
model.
(b) Find the best model using forward-backward sub-
set selection using BIC, and report the best tting
model.
(c) Report the AIC and BIC from the model in (b).
(d) Using the model from (b), estimate the probability
that a female aged 54 with an awareness score of 83
would get a u shot (notice you may not have to use
all of the information given based on your model).
2. Continue with problem 1, and use the model found in
problem 1, part (b).
(a) Use the Hosmer-Lemeshow goodness of t test with
g = 8 to test how well our model is tting. State
the null and alternative hypothesis, the value of the
test-statistic, the p-value, and your conclusion.
(b) Plot a histogram of the standardized residuals. Does
it appear that the assumption that the are standard
normal holds? Why or why not?
(c) Are any values of the standardized residuals larger
than 3? If so, identify what combination of X vari-
ables it was for.
(d) What is the observation/s that most in uenced the
change in the coecients (had the largest DFbeta)?
List the observation and the corresponding values of
the predictors.
3. Continue with problem 1, and use the model found in
problem 1, part (b).
(a) Find the value of AUC, the 95% condence interval
for AUC, and plot the ROC.
(b) Does this value of AUC suggest that the model has
t the data well? Explain your answer.
(c) Fit the full model (including all predictors) and re-
peat (a) for the full model.
(d) What does (c) suggest AUC and adding predictors,
if anything?
4. Online you will nd an expanded dataset
largework.csv. It has the following columns:
Column 1. gender: 1 indicates the subject was male, 0
indicates female.
Column 2. age: the age of the subject.
Column 3. marriage: with levels 1 = married, 2 = wid-
owed, 3 = divorced, 5 = never married.
Column 4. min: minutes of Sedentary Activity per Week
Column 5. chol: total cholesterol
Column 6. sysbp: systolic Blood Pressure measurement
Column 7. height: height of the subject
Column 8. y: 1 the subject was obese, 0 otherwise.
Again, assume our response variable is obese.
(a) Display the model formula for the \best" model using
forward subset selection and BIC.
(b) Display the model formula for the \best" model using
backward subset selection and BIC.
(c) Display the model formula for the \best" model using
backward-forward subset selection and BIC.
(d) Display the model formula for the \best" model using
forward-backward subset selection and BIC.
5. Continue with problem 4.
(a) Display the model formula for the \best" model using
all subset selection and BIC.
(b) Display the model formula for the \best" model using
all subset selection and AIC.
(c) For the best model in (a), nd the value of AUC, the
95% condence interval for AUC, and plot the ROC.
(d) For the best model in (b), nd the value of AUC, the
95% condence interval for AUC, and plot the ROC.
6. Continue with problem 4.
but remove the column for marriage. This can be
done by the following (assuming you called your data
largework):
they = as.factor(largework$y)
thex= as.matrix(largework[,-c(3,8)])
(a) Using the lasso penalty, nd the best model accord-
ing to AUC. Write down the estimated logistic
regression model.
(b) Using the ridge penalty, nd the best model accord-
ing to AUC. Write down the estimated logistic
regression model.
(c) What do you think explains the dierence in the
models chosen here, compared to the models selected
in 5 (a) and (b)?
(d) If we had used either AIC or BIC, do you think the
models would have been larger or smaller than the
ones chosen in (a) and (b) of this problem?