ML, Fall 2019, Homework2

1. Naive Bayes

Using the training data (the first 4169 observations), the tables below give the counts for how often Adult and Age are in the documents. The tables with 1 are for the Ham data and the tables with 0 are for the spam data.

      smsAdult1
smsAge1   No  Yes
    No  3598    2
    Yes    5    0
       smsAdult0
smsAge0  No Yes
    No  549   3
    Yes  12   0

As in the notes, we will always use observed frequencies to estimate probabilities.

So, if \(x_1\) represents “Age in the document” and \(x_2\) represents “Adult in the document” then we get an estimate of \(p(x_1,x_2 \,|\, Y=1)\) by dividing the first table by its sum.

(a)

Using the tables check that the simple frequency estimate of check \(p(age=yes \,|\, ham)\) =.00138 as in the notes.

(b)

Use the table and the Naive Bayes assumption to estimate \(p(ham \,|\, adult = no, age=yes)\).

(c)

Use the table to estimate \(p(ham \,|\, adult = no, age=yes)\) without assuming Age and Adult are independent given y=ham/spam.

(d)

What happens if we try to estimate \(p(ham \,|\, adult=yes,age=yes)\) without the Naive Bayes assumption?

2. Fitting kNN to the Cars Data

Get the susedcars.csv data set from the webpage. Plot x=mileage versus y=price. (price is the price of a used car.)

Does the relationship between mileage and price make sense?

Add the fit from a linear regression to the plot. Add the fit from kNN for various values of k to the plot.

For what value of k does the plot look nice?

Using your “nice” value of k, what is the predicted price of a car with 100,000 miles on it?

What is the prediction from a linear fit?

3. Using Cross Validation

We are going to use the used cars data again.

Previously, we used the “eye-ball” method to choose k for a kNN fit for mileage predicting price.

Use 5-fold cross-validation to choose k. How does your fit compare with the eyeball method?

Plot the data and then add the fit using the k you chose using cross-validation and the k you choose by eye-ball.

Use kNN with the k you chose using cross-validation to get a prediction for a used car with 100,000 miles on it. Use all the observations as training data to get your prediction (given your choice of k).

4. kNN, Cars Data with Mileage and Year

Use kNN to get a prediction for a 2008 car with 75,000 miles on it!

Remember:

Use cross-validation to choose k.
Scale your x’s !!

Is your predictive accuracy better using (mileage,year) than it was with just mileage?

5. Choice of Kernel

In our class we examples we used kernel=“rectangular” when calling kknn.

Have a look at the help for kknn (?kknn).

The rectangular option simple averages the y values over the neighbors.

The other kernel options allow you to use a weighted average. The idea is that closer neighbors should get more weight.

Using the used cars data and predictors (features!!) (mileage,year) see if the optimal kernel option gives different (better?) results than the rectangular option.