суббота, 13 августа 2016 г.

DATA ANALYSIS WITH R; problemset4.Rmd

---
title: "Problem Set 4"
runtime: shiny
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, results = 'markup')
```

```{r setup2}
setwd('/Users/olgabelitskaya/version-control/reflections-ud651')
```

## Libraries
```{r Libraries 1}
library(ggplot2)
library(lubridate)
```

```{r Libraries 2}
library(ggthemes)
```

```{r Libraries 3}
library(grid)
library(gridExtra)
```

```{r Libraries 4}
library(scales)
library(reshape2)
```

```{r Libraries 5}
library(plyr)
library(dplyr)
library(tidyr)
```

```{r Libraries 6}
library(xlsx)
```


## Useful links

```{r Links}
# http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
# http://www.cookbook-r.com/Graphs/Shapes_and_line_types/
# http://www.ats.ucla.edu/stat/r/faq/smooths.htm
# http://docs.ggplot2.org/current/scale_brewer.html
# http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/
# https://www.r-bloggers.com/from-continuous-to-categorical/
```


## 4.1
#### In this problem set, you'll continue to explore the diamonds data set.
#### Your first task is to create a scatterplot of price vs x, using the ggplot syntax.

```{r Data set}
data(diamonds)

summary(diamonds)
```

```{r Price x}
ggplot(diamonds, aes(x = x, y = price)) + geom_point(position = position_jitter(h=0), shape=15, alpha=1/10, color = 'darkgreen') + coord_cartesian(xlim=c(3, 11)) + scale_y_continuous(breaks=seq(1000, 19000, 1000),label=dollar) + theme_bw()
ggsave("energy01.jpg")
```


## 4.2
#### What are your observations about the scatterplot of price vs x?
#### The data set starts at about x = 3 and increases exponentially till the level 9. The top level of price is about $19k, it spreads much more widely near this point.

## 4.3
#### What is the correlation between price and x (and y, and z)?

```{r Cor.test}
with(diamonds, cor.test(price, x))

with(diamonds, cor.test(price, y))

with(diamonds, cor.test(price, z))
```

Pearson's product-moment correlation

data:  price and x
t = 440.16, df = 53938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8825835 0.8862594
sample estimates:
      cor
0.8844352

Pearson's product-moment correlation

data:  price and y
t = 401.14, df = 53938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8632867 0.8675241
sample estimates:
      cor
0.8654209

Pearson's product-moment correlation

data:  price and z
t = 393.6, df = 53938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8590541 0.8634131
sample estimates:
      cor
0.8612494

## 4.4
#### Create a simple scatter plot of price vs depth.

```{r Price depth 1}
ggplot(diamonds, aes(x = depth, y = price)) + geom_point(position = position_jitter(h=0), shape=6, alpha=1/10, color = 'darkred') + theme_grey()
ggsave("energy02.jpg")
```

## 4.5
#### Change the code to make the transparency of the points to be 1/100 of what they are now and mark the x-axis every 2 units. See the instructor notes for two hints.

```{r Price depth 2}
ggplot(diamonds, aes(x = depth, y = price)) + geom_point(position = position_jitter(h=0), shape=2, alpha=1/100, color = 'darkblue') + scale_x_continuous(breaks = seq(min(diamonds$depth), max(diamonds$depth), 2), labels = seq(min(diamonds$depth), max(diamonds$depth), 2)) + theme_bw()

ggsave("energy03.jpg")
```

## 4.6
#### Based on the scatterplot of depth vs. price, most diamonds are between what values of depth?
#### (58;64)

## 4.7
#### What's the correlation of depth vs. price?
#### Based on the correlation coefficient, would you use depth to predict the price of a diamond? Why?

```{r Price depth 3}
with(diamonds, cor.test(depth, price))
```

Pearson's product-moment correlation

data:  depth and price
t = -2.473, df = 53938, p-value = 0.0134
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.019084756 -0.002208537
sample estimates:
       cor
-0.0106474

No, the correlation is too weak. Only depth it not enough for price prediction.

## 4.8
#### Create a scatterplot of price vs carat and omit the top 1% of price and carat values.

```{r Price carat}
ggplot(diamonds, aes(x = carat, y = price)) + geom_point(position = position_jitter(h=0), shape=9, alpha=0.1, color = 'firebrick1') + scale_x_continuous(breaks=seq(0, 2.5, 0.1), limits=c(0, quantile(diamonds$carat, 0.99))) + scale_y_continuous(breaks=seq(0, 18000, 1000), limits=c(0 , quantile(diamonds$price, 0.99)), labels=dollar) + theme_bw()
ggsave("energy04.jpg")
```


## 4.9
#### Create a scatterplot of price vs. volume (x * y * z). This is a very rough approximation for a diamond's volume.
#### Create a new variable for volume in the diamonds data frame. This will be useful in a later exercise.
#### Don't make any adjustments to the plot just yet.

```{r Price volume 1}
diamonds_v <- diamonds %>%
  mutate(volume=x*y*z)
```

```{r Price volume 2}
ggplot(diamonds_v, aes(x = volume, y = price)) + geom_point() + theme_economist_white()
ggsave("energy05.jpg")
```

## 4.10
#### What are your observations from the price vs volume scatterplot?

#### Prices rise exponentially with volume, the transformations of the x-scale is a possible way to improve this visualization. There are many diamonds with volumes near 0. Also we could see 3 outliers here.

## 4.11
#### What's the correlation of price and volume? Exclude diamonds that have a volume of 0 or that are greater than or equal to 800.

```{r Detach plyr}
detach("package:plyr", unload=TRUE)
```

```{r Price volume 3}
with(subset(diamonds_v, !(volume == 0 | volume >= 800) ), cor.test(price, volume))
```

Pearson's product-moment correlation

data:  price and volume
t = 559.19, df = 53915, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9222944 0.9247772
sample estimates:
      cor
0.9235455

## 4.12
#### Subset the data to exclude diamonds with a volume greater than or equal to 800. Also, exclude diamonds with a volume of 0. Adjust the transparency of the points and add a linear model to the plot. (See the Instructor Notes or look up the documentation of geom_smooth() for more details about smoothers.)

#### We encourage you to think about this next question and to post your thoughts in the discussion section.

#### Do you think this would be a useful model to estimate the price of diamonds? Why or why not?

```{r Price volume 4}
sub_diamonds_v <- diamonds_v %>%
  filter(volume != 0, volume <= 800)
```

```{r Price volume 5}
ggplot(sub_diamonds_v, aes( x = volume, y = price)) + geom_point(position = position_jitter(h=0), shape=2, alpha=0.1, color = 'darkgreen') + geom_smooth(method = "lm", se = TRUE) + theme_bw()
ggsave("energy06.jpg")
```


```{r Price volume 6}
ggplot(sub_diamonds_v, aes( x = volume, y = price)) + geom_point(position = position_jitter(h=0), shape=1, alpha=0.1, color = 'darkviolet') + geom_smooth(method = "gam", se = TRUE) + scale_y_continuous(breaks=seq(0, 18000, 1000), limits=c(0 ,quantile(diamonds$price, 0.99))) + theme_bw()
ggsave("energy07.jpg")
```


## 4.13
##### Use the function dplyr package to create a new data frame containing info on diamonds by clarity.

#### Name the data frame diamondsByClarity. The data frame should contain the following variables in this order.
####      (1) mean_price
####      (2) median_price
####      (3) min_price
####      (4) max_price
####      (5) n
#### where n is the number of diamonds in each level of clarity.

```{r Price clarity}
diamondsByClarity<- diamonds %>%
  group_by(clarity) %>%
  summarise(mean_price = mean(price),
            median_price = median(price),
            min_price = min(price),
            max_price = max(price),
            n = n() ) %>%
  arrange(clarity)
```

## 4.14
#### We’ve created summary data frames with the mean price by clarity and color. You can run the code in R to verify what data is in the variables diamonds_mp_by_clarity and diamonds_mp_by_color.
#### Your task is to write additional code to create two bar plots on one output image using the grid.arrange() function from the package gridExtra.

```{r Price group 1}
diamonds_by_clarity <- group_by(diamonds, clarity)
diamonds_mp_by_clarity <- summarise(diamonds_by_clarity, mean_price = mean(price))

diamonds_by_color <- group_by(diamonds, color)
diamonds_mp_by_color <- summarise(diamonds_by_color, mean_price = mean(price))
```

```{r Price group 2}
p1 <- ggplot(diamonds_mp_by_clarity, aes(x=clarity, y=mean_price, fill= clarity)) + geom_bar(stat = "identity", color = "darkblue") + scale_fill_hue(l=50, c=200) + guides(fill = guide_legend(ncol=1, title.hjust=0.2)) + theme_foundation()
ggsave("energy14.jpg")

p2 <- ggplot(diamonds_mp_by_color, aes(x=color, y=mean_price, fill=color)) + geom_bar(stat = "identity", color = "darkviolet") + scale_fill_brewer(palette="Spectral") + guides(fill = guide_legend(ncol=1, title.hjust=0.2)) + theme_foundation()
ggsave("energy08.jpg")

grid.arrange(p1, p2, ncol=1)

g1 <- grid.arrange(p1, p2, ncol=1)

ggsave(file = "energy15.jpg", g1)

```

## 4.15
#### What do you notice in each of the bar charts for mean price by clarity and mean price by color?

#### In general price decreases with a change of clarity from I1 to IF and increases with color change from D to J.

## 4.16
#### The Gapminder website contains over 500 data sets with information about the world's population. Your task is to continue the investigation you did at the end of Problem Set 3 or you can start fresh and choose a different data set from Gapminder.

#### If you are feeling adventurous or want to try some data munging see if you can find a data set or scrape one from the web.

#### In your investigation, examine pairs of variable and create 2-5 plots that make use of the techniques from Lesson 4.

#### Once you've completed your investigation, create a post in the discussions that includes:
####       1. the variable(s) you investigated, your observations, and any summary statistics
####       2. snippets of code that created the plots
####       3. links to the images of your plots

#### Copy and paste all of the code that you used for your investigation, and submit it when you are ready.
# ====================================================================

```{r Gapminder data 1}
energy <- tbl_df(read.xlsx("energy_use_per_person.xlsx", sheetName="Data", header=TRUE))
```

```{r Gapminder data 2}
row.with.na <- apply(energy, 1, function(x){any(is.na(x))})
sum(row.with.na)
```

```{r Gapminder data 3}
filtered <- energy[!row.with.na,]
names(filtered)[1] <- "country"
```

```{r Gapminder data 4}
energy_db <- melt(filtered, id=c("country"), value.name="energy", variable.name="year")
energy_db <- tbl_df(energy_db)
```

```{r Gapminder data 5}
energy_db <- energy_db %>%
  mutate(year = as.character(year), year = substr(year, 2, 5), year = as.numeric(year))
head(energy_db)
```

```{r Gapminder data 6}
energy_db$year_d <- cut(energy_db$year,  seq(1959,2012,1), right=FALSE, labels=c(1959:2011))
```

```{r Gapminder data 7}
energy1 <- energy_db[which(energy_db$country == 'Japan'),]
energy2 <- energy_db[which(energy_db$country == 'United States'),]
energy3 <- energy_db[which(energy_db$country == 'Canada'),]
country_set = c("Germany", "United Kingdom", "France", "Austria", "Belgium", "Italy", 'Denmark', 'Netherlands', 'Norway', "Finland", "Greece", "Poland", "Portugal")
energy4 <- energy_db[which(energy_db$country %in%  country_set),]
```

```{r Gapminder data 8}
ggplot(energy4, aes(x = year, y = energy, color=country)) + geom_point(position = position_jitter(h=0), shape=9, alpha=0.8) + geom_line() + labs(title="Energy use per person", x="Year", y="Energy")

ggsave("energy09.jpg")
```

```{r Gapminder data 9}
grid_arrange_shared_legend <- function(...) {
    plots <- list(...)
    g <- ggplotGrob(plots[[1]] + theme(legend.position="bottom"))$grobs
    legend <- g[[which(sapply(g, function(x) x$name) == "guide-box")]]
    lheight <- sum(legend$height)
    grid.arrange(
        do.call(arrangeGrob, lapply(plots, function(x)
            x + theme(legend.position="none"))),
        legend,
        ncol = 1,
        heights = unit.c(unit(1, "npc") - lheight, lheight))
}
```

```{r Gapminder data 10}
p3 <- ggplot(energy1, aes(x=year_d, y=energy, fill= year_d)) + geom_bar(stat = "identity", color = "darkblue") + scale_fill_hue(l=50, c=200) + labs(title="Energy per person in Japan", x="Year", y="Energy") + scale_x_discrete(breaks=seq(1960,2011,5), labels=seq(1960,2011,5))
ggsave("energy10.jpg")
```

```{r Gapminder data 11}
p4 <- ggplot(energy2, aes(x=year_d, y=energy, fill= year_d)) + geom_bar(stat = "identity", color = "darkblue") + scale_fill_hue(l=50, c=200) + labs(title="Energy per person in United States", x="Year", y="Energy") + scale_x_discrete(breaks=seq(1960,2011,5), labels=seq(1960,2011,5))
ggsave("energy11.jpg")
```

```{r Gapminder data 12}
p5 <- ggplot(energy3, aes(x=year_d, y=energy, fill= year_d)) + geom_bar(stat = "identity", color = "darkblue") + scale_fill_hue(l=50, c=200) + labs(title="Energy per person in Canada", x="Year", y="Energy") + scale_x_discrete(breaks=seq(1960,2011,5), labels=seq(1960,2011,5))
ggsave("energy12.jpg")
```

```{r Gapminder data 13}
grid_arrange_shared_legend(p3, p4, p5)

g2 <- grid_arrange_shared_legend(p3, p4, p5)
ggsave(file="energy13.jpg", g2, width = 8, height = 12)

Комментариев нет:

Отправить комментарий