понедельник, 15 августа 2016 г.

DATA ANALYSIS WITH R; lesson6.Rmd

---
title: "Lesson 6"
runtime: shiny
output: html_document
---

```{r setup 1, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## ========================================================

```{r setup 2}
setwd('/Users/olgabelitskaya/version-control/reflections-ud651')
```

## Useful links

```{r Links}
# http://www.theanalysisfactor.com/interpreting-regression-coefficients/
# http://docs.ggplot2.org/current/coord_trans.html
# https://www.r-bloggers.com/interpreting-regression-coefficient-in-r/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29
# http://personality-project.org/r/html/corr.test.html
# https://rpubs.com/hadley/ggplot2-layers
# http://rmarkdown.rstudio.com/articles_integration.html
# https://http://blog.yhat.com/posts/r-lm-summary.html
# http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/
# http://data.princeton.edu/R/linearModels.html
# https://www.datacamp.com/community/tutorials/importing-data-r-part-two#gs.K2RGNF0
# https://cran.r-project.org/doc/manuals/r-devel/R-data.html
```

```{r Libraries 1}
library(devtools)
library(markdown)
library(ggplot2)
library(xtable)
library(ggthemes)
library(xlsx)
library(RColorBrewer)
library(bitops)
library(RCurl)
```

```{r Libraries 2}
library(grid)
library(gridExtra)
library(scales)
library(reshape)
library(plyr)
library(dplyr)
library(tidyr)
library(GGally)
library(scales)
library(memisc)
```

### Scatterplot Review

```{r Scatterplot Review 1}
data(diamonds)
summary(diamonds)

centrobank <- tbl_df(read.xlsx("c_bank.xlsx", sheetName="Data", header=TRUE))
head(centrobank)
```

```{r Scatterplot Review 2}
ggplot(data = diamonds, aes(x = carat, y = price, color = price), xlim = c(0, quantile(diamonds$carat,0.99)), ylim = c(0, quantile(diamonds$price, 0.99))) + geom_point(shape=12, alpha=0.7, size = 2) + scale_colour_gradientn(colours=rainbow(15)) + theme_bw()

ggsave('0601.jpg', width = 16, height = 8)
```

```{r Scatterplot Review 3}
ggplot(data = centrobank, aes(x = date)) + geom_point(aes(y=centrobank$gold), color = 'darkred', shape=10, alpha=0.7, size = 3) + geom_smooth(aes(y=centrobank$gold), color = 'red') + geom_point(aes(y=centrobank$platinum), color = 'darkblue', shape=12, alpha=0.7, size = 3) + geom_smooth(aes(y=centrobank$platinum), color = 'blue') + theme_bw() + ylab('platinum and gold') + ggtitle('Centrobank 2016')

ggsave('0602.jpg', width = 16, height = 8)
```
***

### ggpairs Function

#### install.packages('memisc')

```{r ggpairs Function}
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, lower = list(continuous = wrap("points", shape = I('.'))), upper = list(combo = wrap("box", outlier.shape = I('.'))))

ggsave('0603.jpg', width = 24, height = 16)
```

### The Demand of Diamonds

```{r The Demand of Diamonds}
plot1 <- qplot(x = price, data = diamonds, binwidth = 200, fill = ..count..) + ggtitle('Price') + scale_fill_continuous(low="darkblue", high="red")

plot2 <- qplot(x=price, data = diamonds, binwidth = 0.02, fill = ..count..) + scale_x_log10() + ggtitle('Price (log10)') + scale_fill_continuous(low="darkblue", high="red")

g1 <- grid.arrange(plot1, plot2, ncol = 2)
ggsave("0604.jpg", g1, width = 12, height = 8)
```

***

### Connecting Demand and Price Distributions
Notes:

***

### Scatterplot Transformation

```{r Scatterplot Transformation}
qplot (carat, price, data = diamonds, color = price) + scale_y_continuous(trans = log10_trans()) + ggtitle('Price (log10) by Carat') + scale_colour_gradientn(colours=rainbow(4))

ggsave("0605.jpg",width = 12, height = 8)
```

### Create a new function to transform the carat variable

```{r cuberoot transformation}
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3), inverse = function(x) x^3)
```

#### Use the cuberoot_trans function
```{r Use cuberoot_trans}
ggplot(aes(carat, price), data = diamonds) +
  geom_point(alpha = 0.5, size = 0.75, position = 'jitter', shape = 14, color = 'darkgreen') +
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) +
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
```

***

### Overplotting Revisited

```{r Sort and Head Tables}
head(sort(table(diamonds$price), desreasing = T))
```

```{r Overplotting Revisited}
ggplot(aes(carat, price, color=price), data = diamonds) +
  geom_point(alpha = 0.5, size = 0.5, position = 'jitter', shape = 7) +
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) +
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  scale_colour_continuous(low="firebrick1", high="black") +
  ggtitle('Price (log10) by Cube-Root of Carat')
```

***

### Price vs. Carat and Clarity

Alter the code below.
```{r Price vs. Carat and Clarity 1}
display.brewer.all()
```

```{r Price vs. Carat and Clarity 2}
ggplot(aes(x = carat, y = price, color=clarity), data = diamonds) +
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) +
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
```

***

### Price vs. Carat and Cut

Alter the code below.
```{r Price vs. Carat and Cut}
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div', guide = guide_legend(title = 'Cut', reverse = T, override.aes = list(alpha = 1, size = 2))) +
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) +
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) +
  scale_color_brewer(palette= 'Spectral', type = 'div', guide = guide_legend(title = 'Cut', reverse = T, override.aes = list(alpha = 1, size = 2))) +
  ggtitle('Price (log10) by Cube-Root of Carat and Cut') + theme_bw()
```

***

### Price vs. Carat and Color

Alter the code below.
```{r Price vs. Carat and Color}
ggplot(aes(x = carat, y = price, color = color), data = diamonds) +
  geom_point(alpha = 0.5, size = 0.5, position = 'jitter') +
  scale_color_brewer(palette = 'Set1', type = 'div', guide = guide_legend(title = 'Color', reverse = F, override.aes = list(alpha = 1, size = 2))) +
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) +
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color') + theme_bw()
```

***

### Building the Linear Model

```{r Building the Linear Model}
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
```

Notice how adding cut to our model does not help explain much of the variance
in the price of diamonds. This fits with out exploration earlier.

### A Bigger, Better Data Set
Notes:

```{r A Bigger, Better Data Set}
load('BigDiamonds.Rda')
```

The code used to obtain the data is available here:
https://github.com/solomonm/diamonds-data

## Building a Model Using the Big Diamonds Data Set

```{r Building a Model Using the Big Diamonds Data Set 1}
diamondsbig$logprice = log(diamondsbig$price)
m1 <- lm(logprice ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == "GIA",])
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
```

```{r Building a Model Using the Big Diamonds Data Set 2}
diamondsBigSample <- diamond_samp
diamondsBigSample$logprice = log(diamondsBigSample$price)
m1 <- lm(logprice ~ I(carat^(1/3)), data = diamondsBigSample)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)

models <- mtable(m1, m2, m3, m4, m5)
models
```
***

## Predictions

Example Diamond from BlueNile:
Round 1.00 Very Good I VS1 $5,601

```{r Predicition 1}
#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "Very Good",
                         color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = .95)

exp(modelEstimate)
```

```{r Prediction 2}
dat = data.frame(m4$model, m4$residuals)

with(dat, sd(m4.residuals))

with(subset(dat, carat > .9 & carat < 1.1), sd(m4.residuals))

dat$resid <- as.numeric(dat$m4.residuals)
ggplot(aes(y = resid, x = round(carat, 2)), data = dat) + geom_line(stat = "summary", fun.y = sd,color = 'darkgreen') + theme_bw()
```

Evaluate how well the model predicts the BlueNile diamond's price. Think about the fitted point estimate as well as the 95% CI.

Комментариев нет:

Отправить комментарий