The Prostate Dataset
The prostate dataset comes from a study on 97 men with pros
The Prostate Dataset
The prostate dataset comes from a study on 97 men with prostate cancer who were due to receive radical prostatectomy.
The data contain the following variables:
lcavol: log(cancer volume in cm3)
lweight: log(prostate weight in gm)
age: age in years
lbph: log(benign prostatic hyperplasia amount)
svi: seminal vesicle invasion
lcp: log(capsular penetration)
Gleason: Gleason score
pgg45: percentage Gleason scores 4 or 5
lpsa: log(prostate specific antigen in ng/mL)
Question 1
Validate that the prostate data frame contains 97 observations.
Hint: First install the faraway package (if you haven’t already) as instructed on Lesson 1, Slide 49. The following R statement will load the prostate data frame:data(“prostate”, package = “faraway”).
Use the nrow() function to see how many overvaluations (rows) the data frame has. For example: the following statement prints the number of observations in the car data frame: nrow(cars).
Question 2
Calculate descriptive statistics of each of the variables.
Hint: Use the summary() function. For example: summary(cars).
Question 3
Create a new data frame that includes the following variables: lcavol, lweight, age and lpsa.
Use this new data frame for all questions below.
Hint: In the following example, we select two variables (agegp and alcgp) from the esoph data frame and name the new data frame esophSubDf
esophSubDf <- esoph[c("agegp", "alcgp")]
Question 4
Calculate descriptive statistics of each of the variables using the new data frame.
Question 5
Create a scatter plot matrix for all the variables using the new data frame.
Hint: Use the pairs() function (see Lesson 2, Slide 50).
Question 6
Create a (Pearson) correlation matrix for all the variables.
Hint: Use the cor() function (see Lesson 2, Slide 48).
Question 7
Show the same matrix again, but round the correlations (use two decimal places).
Hint: Use the round() function. The following example calculates the correlation matrix for the cars data frame and rounds the numbers:
round(cor(cars),2)
Question 8
Create a regression model:
The predictor variable (X) should be lpsa.
The outcome variable (Y) should be lcavol.
Show the summary of the model.
Hint: Use the lm() and summary() functions (see Lesson 2, Slide 51).
Question 9
Visualize the two variables and the model you just created by doing the following:
Create a scatter plot. Put lcavol in the y-axis and lpsa in the x-axis. Include the regression line and label the axis.
Hint: See Lesson 2, Slide 52.
Question 10
Update the regression model by adding a second predictor: age
Show the regression model summary
Hint: See Lesson 2, Slide 53.