Using the training data (the first 4169 observations), the tables below give the counts for how often Adult and Age are in the documents. The tables with 1 are for the Ham data and the tables with 0 are for the spam data.
smsAdult1
smsAge1 No Yes
No 3598 2
Yes 5 0
smsAdult0
smsAge0 No Yes
No 549 3
Yes 12 0
As in the notes, we will always use observed frequencies to estimate probabilities.
So, if \(x_1\) represents “Age in the document” and \(x_2\) represents “Adult in the document” then we get an estimate of \(p(x_1,x_2 \,|\, Y=1)\) by dividing the first table by its sum.
(a)
Using the tables check that the simple frequency estimate of check \(p(age=yes \,|\, ham)\) =.00138 as in the notes.
(b)
Use the table and the Naive Bayes assumption to estimate \(p(ham \,|\, adult = no, age=yes)\).
(c)
Use the table to estimate \(p(ham \,|\, adult = no, age=yes)\) without assuming Age and Adult are independent given y=ham/spam.
(d)
What happens if we try to estimate \(p(ham \,|\, adult=yes,age=yes)\) without the Naive Bayes assumption?
Get the susedcars.csv data set from the webpage. Plot x=mileage versus y=price. (price is the price of a used car.)
Does the relationship between mileage and price make sense?
Add the fit from a linear regression to the plot. Add the fit from kNN for various values of k to the plot.
For what value of k does the plot look nice?
Using your “nice” value of k, what is the predicted price of a car with 100,000 miles on it?
What is the prediction from a linear fit?
We are going to use the used cars data again.
Previously, we used the “eye-ball” method to choose k for a kNN fit for mileage predicting price.
Use 5-fold cross-validation to choose k. How does your fit compare with the eyeball method?
Plot the data and then add the fit using the k you chose using cross-validation and the k you choose by eye-ball.
Use kNN with the k you chose using cross-validation to get a prediction for a used car with 100,000 miles on it. Use all the observations as training data to get your prediction (given your choice of k).
Use kNN to get a prediction for a 2008 car with 75,000 miles on it!
Remember:
Is your predictive accuracy better using (mileage,year) than it was with just mileage?
In our class we examples we used kernel=“rectangular” when calling kknn.
Have a look at the help for kknn (?kknn).
The rectangular option simple averages the y values over the neighbors.
The other kernel options allow you to use a weighted average. The idea is that closer neighbors should get more weight.
Using the used cars data and predictors (features!!) (mileage,year) see if the optimal kernel option gives different (better?) results than the rectangular option.