Онлайн курсы: DATA ANALYSIS WITH R; problemset5.Rmd

---
title: "Problem Set 5"
runtime: shiny
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Problem Set 5
========================================================

### Working directory and libraries

```{r setup 2}
setwd('/Users/olgabelitskaya/version-control/reflections-ud651')
```

```{r Libraries 1}
library(ggplot2)
library(lubridate)
```

```{r Libraries 2}
library(gridExtra)
library(plyr)
```

```{r Libraries 3}
library(scales)
library(reshape2)
```

```{r Libraries 4}
library(dplyr)
library(tidyr)
```

```{r Libraries 5}
library(xlsx)
library(ggthemes)
```

## Useful links

```{r Links}
# http://docs.ggplot2.org/current/
# http://docs.ggplot2.org/current/coord_trans.html
# http://sape.inf.usi.ch/quick-reference/ggplot2/themes
# http://personality-project.org/r/html/corr.test.html
# https://rpubs.com/hadley/ggplot2-layers
# http://rmarkdown.rstudio.com/articles_integration.html
# https://cran.r-project.org/web/packages/ggthemes/vignettes/ggthemes.html
#http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/
```

## 5.1

#### Create a histogram of diamond prices. Facet the histogram by diamond color and use cut to color the histogram bars.
#### The plot should look something like this: http://i.imgur.com/b5xyrOu.jpg.
#### Note: In the link, a color palette of type 'qual' was used to color the histogram using scale_fill_brewer(type = 'qual')

```{r 5.1.1}
p1 <- ggplot(diamonds, aes(x = price, fill = cut)) + geom_histogram() + facet_wrap(~ color) + scale_fill_brewer(type = 'qual', palette = 'Spectral') + xlab("Price") + ylab("Count") + theme_gray()

```

```{r 5.1.2}
p2 <- ggplot(diamonds, aes(x = price, fill = cut)) + geom_histogram() + facet_wrap(~ color) + scale_x_log10(expression(paste(Log[10], " of Price"))) + ylab("Count") + scale_fill_brewer(type = 'qual', palette = 'Spectral') + theme_gray()

g1 <- grid.arrange(p1, p2, ncol=1)
ggsave("05s01.jpg", g1, width = 8, height = 12)
```

## 5.2

#### Create a scatterplot of diamond price vs. table and color the points by the cut of the diamond.
#### The plot should look something like this: http://i.imgur.com/rQF9jQr.jpg.
#### Note: In the link, a color palette of type 'qual' was used to color the scatterplot using scale_color_brewer(type = 'qual')

```{r 5.2}
ggplot(diamonds, aes(x = table, y = price, color = cut)) + geom_jitter(size = 3, alpha=0.8, shape = 17) + scale_x_continuous(breaks = seq(42, 80, 1), limits = c(42, 80)) + scale_color_brewer(type = 'seq', palette = 'Set1') + theme_bw()

ggsave("05s02.jpg", width = 12, height = 8)
```

## 5.3

#### What is the typical table range for the majority of diamonds of ideal cut?
## (54; 57)

#### What is the typical table range for the majory of diamonds of premium cut?
## (58; 60)

#### Use the graph that you created from the previous exercise to see the answer. You do not need to run summaries.

## 5.4

#### Create a scatterplot of diamond price vs. volume (x * y * z) and color the points by the clarity of diamonds. Use scale on the y-axis to take the log10 of price. You should also omit the top 1% of diamond volumes from the plot.
#### Note: Volume is a very rough approximation of a diamond's actual volume.
#### The plot should look something like this: http://i.imgur.com/excUpea.jpg.
#### Note: In the link, a color palette of type 'div' was used to color the scatterplot using scale_color_brewer(type = 'div').

```{r 5.4.1}
diamonds <- diamonds %>%
mutate(volume = x * y *z)

p3 <-ggplot(subset(diamonds, volume <= quantile(volume, 0.99) & volume > 0 ), aes(x = volume, y = price, color = clarity)) + geom_jitter(size = 3, alpha=0.7, shape = 18) + scale_color_brewer(type = 'div', palette = 'Spectral') + theme_solarized()

p4 <-ggplot(subset(diamonds, volume <= quantile(volume, 0.99) & volume > 0 ), aes(x = volume, y = price, color = clarity)) + scale_y_log10() + geom_jitter(size = 3, alpha=0.7, shape = 18) + scale_color_brewer(type = 'div', palette = 'Spectral') + ylab("log10 of price")+ theme_solarized()

g2 <- grid.arrange(p3, p4, ncol=2)
ggsave("05s03.jpg", g2, width = 16, height = 8)
```

```{r 5.4.2}
diamonds <- diamonds %>%
mutate(volume = x * y *z)

p5 <-ggplot(subset(diamonds, volume <= quantile(volume, 0.99) & volume > 0 ), aes(x = volume, y = price, color = color)) + geom_jitter(size = 3, alpha=0.7, shape = 18) + scale_color_brewer(type = 'div', palette = 'Set1') + theme_solarized()

p6 <-ggplot(subset(diamonds, volume <= quantile(volume, 0.99) & volume > 0 ), aes(x = volume, y = price, color = color)) + scale_y_log10() + geom_jitter(size = 3, alpha=0.7, shape = 18) + scale_color_brewer(type = 'div', palette = 'Set1') + ylab("log10 of price")+ theme_solarized()

g3 <- grid.arrange(p5, p6, ncol=1)
ggsave("05s04.jpg", g3, width = 8, height = 16)
```

## 5.5

#### Many interesting variables are derived from two or more others. For example, we might wonder how much of a person's network on a service like Facebook the user actively initiated. Two users with the same degree (or number of friends) might be very different if one initiated most of those connections on the service, while the other initiated very few. So it could be useful to consider this proportion of existing friendships that the user initiated. This might be a good predictor of how active a user is compared with their peers, or other traits, such as personality (i.e., is this person an extrovert?).

#### Your task is to create a new variable called 'prop_initiated' in the Pseudo-Facebook data set. The variable should contain the proportion of friendships that the user initiated.

```{r 5.5}
pf <- read.delim('pseudo_facebook.tsv')
pf$prop_initiated <- ifelse(pf$friend_count > 0, pf$friendships_initiated/pf$friend_count, 0)

# variant 2
# pf <- pf %>%
# mutate(prop_initiated = ifelse(friend_count > 0, friendships_initiated/friend_count, 0))
```

## 5.6

#### Create a line graph of the median proportion of friendships initiated ('prop_initiated') vs. tenure and color the line segment by year_joined.bucket.
#### Recall, we created year_joined.bucket in Lesson 5 by first creating year_joined from the variable tenure. Then, we used the cut function on year_joined to create four bins or cohorts of users.
#### (2004, 2009]
#### (2009, 2011]
#### (2011, 2012]
#### (2012, 2014]
#### The plot should look something like this: http://i.imgur.com/vNjPtDh.jpg OR this% http://i.imgur.com/IBN1ufQ.jpg

```{r 5.6.1}
pf_yj <- pf %>%
mutate(year_joined = floor(2014 - tenure/365), year_joined_bucket = cut(year_joined, breaks=c(2004, 2009, 2011, 2012, 2014)))
```

```{r 5.6.2}
ggplot(subset(pf_yj, tenure > 0), aes(x=tenure, y=prop_initiated)) + geom_line(aes(color=year_joined_bucket), stat='summary', fun.y=median) + scale_color_brewer(type = 'seq', palette = 'Spectral') + theme_economist()

ggsave("05s05.jpg", width = 12, height = 8)
```

## 5.7

#### Smooth the last plot you created of of prop_initiated vs tenure colored by year_joined.bucket. You can bin together ranges of tenure or add a smoother to the plot.

```{r 5.7}
ggplot(subset(pf_yj, tenure > 0), aes(x=tenure, y=prop_initiated)) + geom_line(aes(color=year_joined_bucket), stat='summary', fun.y=median) + geom_smooth(color=rainbow(80)) + scale_color_brewer(type = 'seq', palette = 'Pastel1') + theme_bw()

ggsave("05s06.jpg", width = 12, height = 8)
```

## 5.8

#### On average, which group initiated the greatest poportion of its Facebook friendships? The plot with the smoother that you created in the last exercise can help you answer this question.

## (2012, 2014]

## 5.9

#### For the group with the largest proportion of friendships initated, what is the group's average (mean) proportion on friendships initiated?

## 0.64

```{r 5.9}
pf_yj %>%
filter(year_joined_bucket == "(2012,2014]") %>%
summarise(avg = mean(prop_initiated, na.rm=TRUE))
```

## 5.10

#### Create a scatter plot of the price/carat ratio of diamonds. The variable x should be assigned to cut. The points should be colored by diamond color, and the plot should be faceted by clarity.
#### The plot should look something like this: http://i.imgur.com/YzbWkHT.jpg.

#### Note: In the link, a color palette of type 'div' was used to color the histogram using scale_color_brewer(type = 'div').

```{r 5.10}
ggplot(diamonds, aes(x = cut, y = price/carat, color = color)) + geom_jitter(size = 2, alpha=0.7, shape = 18) + facet_wrap(~clarity) + scale_color_brewer(type = 'div', palette = "Set1") + theme_gdocs()

ggsave("05s07.jpg", width = 16, height = 8)
```

## 5.11

##### The Gapminder website contains over 500 data sets with information about the world's population. Your task is to continue the investigation you did at the end of Problem Set 4 or you can start fresh and choose a different data set from Gapminder.
#### If you’re feeling adventurous or want to try some data munging see if you can find a data set or scrape one from the web.
#### In your investigation, examine 3 or more variables and create 2-5 plots that make use of the techniques from Lesson 5.
#### You can find a link to the Gapminder website in the Instructor Notes.
#### Once you've completed your investigation, create a post in the discussions that includes:
#### 1. the variable(s) you investigated, your observations, and any summary statistics
#### 2. snippets of code that created the plots
#### 3. links to the images of your plots

```{r Data 5.11.1}
fact <- tbl_df(read.csv2("factbook.csv", header=TRUE))
names(fact)
```

```{r Data 5.11.2}
row.with.na <- apply(fact, 1, function(x){any(is.na(x))})
sum(row.with.na)
fact <- fact[!row.with.na,]
```

```{r Data 5.11.3}
names(fact)[1] <- "country"
names(fact)[2] <- "area"
names(fact)[3] <- "birth_rate"
names(fact)[4] <- "current_account_balance"
names(fact)[5] <- "death_rate"
names(fact)[6] <- "debt_external"
names(fact)[7] <- "electricity_consumption"
names(fact)[8] <- "electricity_production"
names(fact)[9] <- "exports"
names(fact)[10] <- "gdp"
```

```{r Data 5.11.4}
names(fact)[11] <- "gdp_per_cap"
names(fact)[12] <- "gdp_real"
names(fact)[13] <- "aids_adults"
names(fact)[14] <- "aids_deaths"
names(fact)[15] <- "aids_liv"
names(fact)[16] <- "highways"
names(fact)[17] <- "imports"
names(fact)[18] <- "industrial_production_growth_rate"
names(fact)[19] <- "infant_mortality_rate"
names(fact)[20] <- "inflation_rate"
```

```{r Data 5.11.5}
names(fact)[21] <- "internet_hosts"
names(fact)[22] <- "internet_users"
names(fact)[23] <- "investment_gross"
names(fact)[24] <- "labor_force"
names(fact)[25] <- "life_expectancy"
names(fact)[26] <- "military_expenditures"
names(fact)[27] <- "military_expenditures_percent"
names(fact)[28] <- "natural_gas_consumption"
names(fact)[29] <- "natural_gas_exports"
names(fact)[30] <- "natural_gas_imports"
```

```{r Data 5.11.6}
names(fact)[31] <- "natural_gas_production"
names(fact)[32] <- "natural_gas_reserves"
names(fact)[33] <- "oil_consumption"
names(fact)[34] <- "oil_exports"
names(fact)[35] <- "oil_imports"
names(fact)[36] <- "oil_production"
names(fact)[37] <- "oil_reserves"
names(fact)[38] <- "population"
names(fact)[39] <- "public_dept"
names(fact)[40] <- "railways"
```

```{r Data 5.11.7}
names(fact)[41] <- "reserves_foreign_exchange"
names(fact)[42] <- "phone_main_lines"
names(fact)[43] <- "mobile_phones"
names(fact)[44] <- "total_fertility_rate"
names(fact)[45] <- "unemployment_rate"

names(fact)
```

```{r Data 5.11.8}
country_set = c("Czech Republic", "United Kingdom", "Spain", "Austria", "Italy", 'Denmark', 'Hungary', 'Ireland', "Greece", "Poland")
fact1 <- fact[which(fact$country %in% country_set),]

p7 <-ggplot(subset(fact1, internet_hosts <= quantile(internet_hosts, 0.99) & internet_hosts > 0 ), aes(x = internet_hosts, y = internet_users, color = country)) + geom_jitter(size = 5, alpha=0.7, shape = 10) + scale_color_brewer(type = 'div', palette = 'Set1') + theme_solarized()

p8 <-ggplot(subset(fact1, internet_hosts <= quantile(internet_hosts, 0.99) & internet_hosts > 0 ), aes(x = internet_hosts, y = internet_users, color = country)) + scale_y_log10() + geom_jitter(size = 5, alpha=0.7, shape = 10) + scale_color_brewer(type = 'div', palette = 'Set1') + ylab("log10 of internet_users")+ theme_solarized()

g4 <- grid.arrange(p7, p8, ncol=1)
ggsave("05s08.jpg", g4, width = 8, height = 16)

```

```{r Data 5.11.9}

p9 <- ggplot(fact, aes(country)) + geom_point(aes(y=fact$oil_exports), color = 'red', size = 5, shape = 2) + geom_text(data=fact, mapping=aes(x=country, y=oil_exports), label='e', size=4, color ='red') + geom_point(aes(y=oil_imports), color="green", size = 5, shape = 6) + ylab('oil exports and imports') + geom_text(data=fact, mapping=aes(x=country, y=oil_imports), label='i', size=4, color ='green') + theme_bw() + theme(axis.text.x=element_text(size=10, angle=20),axis.title=element_text(size=12), legend.position = "bottom")

p10 <-ggplot(fact, aes(country)) + geom_point(aes(y=fact$natural_gas_exports), color = 'red', size = 5, shape = 2) + geom_text(data=fact, mapping=aes(x=country, y=natural_gas_exports), label='e', size=4, color ='red') + geom_point(aes(y=natural_gas_imports), color="green", size = 5, shape = 6) + geom_text(data=fact, mapping=aes(x=country, y=natural_gas_imports), label='i', size=4, color ='green') + ylab('natural gas exports and imports') + theme_bw() + theme(axis.text.x=element_text(size=10, angle=20),axis.title=element_text(size=12))

g5 <- grid.arrange(p9, p10, ncol=1)

ggsave("05s09.jpg", g5, width = 16, height = 8)
```

Онлайн курсы

воскресенье, 14 августа 2016 г.

DATA ANALYSIS WITH R; problemset5.Rmd

Комментариев нет:

Отправить комментарий

воскресенье, 14 августа 2016 г.

DATA ANALYSIS WITH R; problemset5.Rmd

Комментариев нет:

Отправить комментарий

воскресенье, 14 августа 2016 г.