---
title: "Simple_h2o"
author: "Rob McCulloch"
date: "April 15, 2019"
output:
  pdf_document:
    toc: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Start Up h2o

To use h2o you have to imagine that all the work is being done on a remote server you are communicating with.
For example, we have to load the library and then initialize the server.  

**First we load the library as usual:**

```{r}
library(h2o)
```
\  

**Then we initialize the server**:


```{r}
h2oServer = h2o.init()
```

\  

Notice that on Rob's machine 8 cores are detected and used.  
h2o will use all the cores!!  

# Setup up the data

Now let's make some toy data with a binary y.  
I'll use that boston housing data again but let y =1 if price is above the median and 0 else.  

\  

```{r}
library(MASS)
attach(Boston)
y = Boston$medv
    
## let's make y binary
y = as.factor(y>median(y))

## for x we will use dis and lstat    
x  = cbind(Boston$dis,Boston$lstat)
p = ncol(x)
for(i in 1:p) {
   rgx = range(x)
   x[,i] = (x[,i]-rgx[1])/(rgx[2]-rgx[1])
}
colnames(x) = c("dis","lstat")

## data as a data frame
dfd = data.frame(y,x)

## train and test
set.seed(99)
n=nrow(dfd)
ii = sample(1:n,floor(.75*n))
dftr = dfd[ii,]; ytr = y[ii] #train
dfte = dfd[-ii,] ; yte = y[-ii] #test
```
\  

# Run a logit for comparison

Let's first run a simple logit and see what we get.  

\  

```{r}
glmf = glm(y~.,data=dftr,family=binomial)
yhltr = predict(glmf,type="response")
yhlte = predict(glmf,dfte,type="response")

##in-sample confusion matrix
table(dftr$y,yhltr>.5)
##out-of-sample confusion matrix
ocfl = table(dfte$y,yhlte>.5)
ocfl
cat("% wrong out of sample is",1-sum(diag(ocfl))/sum(ocfl))
```

# Put the data in h2o form

We can look and see what is currently on the server:

\  

```{r}
h2o.ls()
```
\  

Right now there is nothing.

\  

Let's put our data on the server (or cluster) so we can fit our models using h2o.

\  


```{r,collapse=TRUE}
dftrain = as.h2o(dftr, destination_frame = "bost.train")
dftest = as.h2o(dfte, destination_frame = "bost.test")

# see that bost.test and bost.train are now on server
h2o.ls()
```
\  

Now bost.train and bost.test show up on the server.  

dftrain is the R object we use to access bost.train.  

dftest is the R object we use to access bost.test. 

\  

As usual we can print dftrain just by typing its name.  

\  

```{r}
dftrain #h2o prints out first 6 rows
```

\  

There are two kinds of *classes* in R, S3 and S4.  

S3 is a very simple setup.  
dftrain and dftest are S3 classes.

\  

```{r}
cat("is dh2o an S4 class?:\n")
isS4(dftrain)
cat("it is an S3 class with class name:\n")
print(class(dftrain))
```
 \  
 
 We see that dftrain is not S4 but it is S3 with class name H2OFrame.
 
 \  
 
 A simple way to see what information is in the S3 class is to use the *attributes*.  
 \  
 
```{r}
temp = attributes(dftrain)
is.list(temp)
names(temp)

cat("the h2o id of dftrain is: ",attr(dftrain,"id"),"\n")
cat("the class name  of dftrain is: ",attr(dftrain,"class"),"\n")
cat("the number of rows of dftrain is: ",attr(dftrain,"nrow"),"\n")
```
\  

If I just print out temp or do str(dftrain) I will get a ton of information.
 
 
# Fit a Deep Neural Net

Ok, let's try a deep neural net.  

Remember, "deep" just means we have more than one hidden layer.  

I'll do *hidden=c(20,10)* which means two hidden layers where the first layer has 20 units
and the second layer has 10 units.

\  


```{r dodnn}
######################################################################
#  2 hidden layer 10 neurons

nnf = h2o.deeplearning(x=2:3, y=1,
                         training_frame = dftrain,
                         hidden = c(20,10),
                         activation = "Tanh",
                         epochs = 200,
                         model_id = "boston.nn_20-10"
                         )

#nnf is an S4 class:
cat("is model object nnf S4?:\n",isS4(nnf))

#It is S4, to pull of a slot use @
cat("h2o model_id of nnf is",nnf@model_id,"\n")
#check this using h2o.ls
print(h2o.ls())

## to see the whole thing which is a lot:
#print(str(nnf))

```
\  

We can use a predict function.  Let's first get the out-of-sample predictions using dftest.

\  

```{r}
phato = h2o.predict(nnf,dftest)
dim(phato)
names(phato)
```

\  

The first column is the prediction, the second column is $p(y=false | x)$ and the third column is $p(y=true | x)$.  

Let's just pull off the third column and convert it to an R data structure.

```{r}
yhnte = as.matrix(phato[,3])[,1]
```
\  

The as.matrix converts it to R, and the [,1] coverts the matrix with one column to a double vector.

Ok now we can compare neural nets to logit!!   

The yhnte is comparable to the yhlte we got from the logit fit.

\  

```{r}
plot(yhlte,yhnte)
abline(0,1,col="red",lwd=2,lty=3)
```
\  

Let's look at the confusion matrices.

\  
```{r}
ocfn = table(dfte$y,yhnte>.5)
ocfn
cat("neural net, % wrong out of sample is",1-sum(diag(ocfn))/sum(ocfn))
cat("logit, % wrong out of sample is",1-sum(diag(ocfl))/sum(ocfl))
```

\  

Let's look at the lift curves.  

\  

```{r}
source("http://www.rob-mcculloch.org/2019_ml/webpage/notes/lift-loss.R")
yhatL = list(yhlte,yhnte)
lift.many.plot(yhatL,dfte$y)
legend("topleft",legend=c("logit","deep neural net"),lwd=rep(3,1),col=1:2,bty="n",cex=.8)
```

Not too different.  

This is just a toy example, but a deep neural net does ok!!  

*Amazing* given the complexity of the model.

\  
Note that h2o computes a huge amount of performance statistics for you.  
\  

```{r}
print(h2o.confusionMatrix(nnf,thresholds=.5))
print(h2o.performance(nnf))
```

\  

If I compute the in-sample confusion matrix directly from the in sample fitted probabilites
(as I did for the logit)  I get similar results.

\  


```{r}
yhntr = as.matrix(h2o.predict(nnf, dftrain)[,3])[,1] #in sample neural net phat
icfn = table(dftr$y,yhntr>=.5)
icfn  #in sample
```

\  

If you like the neural net fit you can save it.  

\  

```{r}
h2o.saveModel(nnf, path=getwd(),force=TRUE)
```

\  
And then you can read it in:
\  

```{r}
fp = file.path(getwd(), "boston.nn_20-10")
if(file.exists(fp)) {
  nnfl = h2o.loadModel(fp)
}
yhntr2 = as.matrix(h2o.predict(nnfl, dftrain)[,3])[,1] #in sample neural net phat
plot(yhntr2,yhntr)
abline(0,1)
```

