суббота, 13 августа 2016 г.

DATA ANALYSIS WITH R; problemset3.Rmd

---
title: "Problem Set 3"
runtime: shiny
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r setup2}
setwd('/Users/olgabelitskaya/version-control/reflections-ud651')
```

## Useful links

```{r Links}
# http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
# http://www.cookbook-r.com/Graphs/Shapes_and_line_types/
```

## Libraries
```{r Libraries1}
library(ggplot2)
library(lubridate)
```

```{r Libraries2}
library(ggthemes)
theme_set(theme_minimal(7))
```

```{r Libraries3}
library(gridExtra)
library(plyr)
```

```{r Libraries}
library(scales)
library(reshape2)
```

```{r Libraries4}
library(dplyr)
library(tidyr)
```

```{r Libraries5}
library(xlsx)
```

## Dataset
## 3.1
#### a) Load the 'diamonds' data set in R Studio.
#### How many observations are in the data set? 53940
#### b) How many variables are in the data set? 10
#### c) How many ordered factors are in the set? 3
#### d) What letter represents the best color for a diamonds? D

```{r Data set}
data(diamonds)

summary(diamonds)
str(diamonds)
nrow(diamonds)
ncol(diamonds)
levels(diamonds$color)
```

## 3.2
#### Create a histogram of the price of all the diamonds in the diamond data set.

```{r Histograms1}
qplot(x = price, data = diamonds, binwidth = 400, color = I("blue"), fill = I("red"))
```

## 3.3
#### Describe the shape and center of the price distribution. Include summary statistics like the mean and median.

```{r Histograms2}
summary(diamonds$price)
```

#### Wide-spread, skewed right and, of course, non-symmetric shape; unimodal (with one peak), without many outliers.
#### Median = 2401, mean = 3933; median < mean for this type of distribution.

## 3.4
#### a) How many diamonds cost less than $500? 1729
#### b) How many diamonds cost less than $250? 0
#### c) How many diamonds cost more than $15,000? 1656

```{r Prices}
sum(diamonds$price < 500)

sum(diamonds$price < 250)

sum(diamonds$price >= 15000)
```


## 3.5

#### Explore the largest peak in the price histogram you created earlier.

#### Try limiting the x-axis, altering the bin width, and setting different breaks on the x-axis.

```{r Histograms3}
ggplot(aes(x = price), data = diamonds) + geom_histogram(binwidth = 500, color = 'purple', fill = '#099DD9') + scale_x_continuous(limits = c(500, 10000), breaks = seq(500, 10000, 500))

ggsave("pricehist1.jpg")
```


```{r Histograms4}
ggplot(aes(x = price), data = diamonds) +
  geom_histogram(binwidth = 50, color = 'orange',
                 aes(fill=..count..), alpha = .8) +
  scale_x_continuous(limits = c(500, 2000), breaks = seq(500, 2000, 100)) +
  scale_fill_gradient("Count", low = "blue", high = "red")

ggsave("pricehist2.jpg")
```


```{r Histograms5}
ggplot(data=diamonds, aes(x = price)) +
  geom_histogram(aes(y =..density..), col="red", fill="green", alpha = .7) +
  scale_x_continuous(limits = c(500, 2000), breaks = seq(500, 2000, 100)) +
  geom_density(col=2) +
  labs(title="Histogram for Price") + labs(x="Price", y="Count")

ggsave("pricehist3.jpg")
```


```{r Histograms6}
ggplot(data = diamonds, aes(x=price, fill=cut(..x.., seq(500, 2000, 100)))) +
  scale_x_continuous(limits = c(500, 2000), breaks = seq(500, 2000, 100)) +
  geom_histogram(binwidth=50, color="steelblue")

ggsave("pricehist4.jpg")
```

## 3.6
#### Break out the histogram of diamond prices by cut.
#### You should have five histograms in separate panels on your resulting plot.

```{r Histograms7}
ggplot(aes(x = price), data = diamonds) +
  geom_histogram(binwidth = 500, color = 'red', fill = '#099DD9') +
  scale_x_continuous(limits = c(500, 15000), breaks = seq(500, 15000, 500)) +
  facet_grid(cut ~ .)

ggsave("pricehist5.jpg")
```

## 3.7

#### a) Which cut has the highest priced diamond? Premium
#### b) Which cut has the lowest priced diamond? Premium & Ideal
#### c) Which cut has the lowest median price? Ideal

```{r Cut1}
diamonds %>%
  group_by(cut) %>%
  summarise(max_price = max(price),
            min_price = min(price),
            median_price = median(price))
```

## 3.8

#### In the two last exercises, we looked at the distribution for diamonds by cut.

#### Run the code below in R Studio to generate the histogram as a reminder.

#### In the last exercise, we looked at the summary statistics for diamond price by cut. If we look at the output table, the the median and quartiles are reasonably close to each other.

#### diamonds$cut: Fair

####    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
####     337    2050    3282    4359    5206   18570
#### ------------------------------------------------------------------------
#### diamonds$cut: Good

####    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
####     327    1145    3050    3929    5028   18790
#### ------------------------------------------------------------------------
#### diamonds$cut: Very Good

####    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
####     336     912    2648    3982    5373   18820
#### ------------------------------------------------------------------------
#### diamonds$cut: Premium

####    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
####     326    1046    3185    4584    6296   18820
#### ------------------------------------------------------------------------
#### diamonds$cut: Ideal

####    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
####     326     878    1810    3458    4678   18810

#### This means the distributions should be somewhat similar, but the histograms we created don't show that.

#### The 'Fair' and 'Good' diamonds appear to have different distributions compared to the better cut diamonds. They seem somewhat uniform on the left with long tails on the right.

#### Let's look in to this more.

#### Look up the documentation for facet-wrap in R Studio. Then, scroll back up and add a parameter to facet_wrap so that the y-axis in the histograms is not fixed. You want the y-axis to be different for each histogram.

```{r Cut2}
qplot(x = price, data = diamonds) + facet_wrap(~cut, scales = "free") + geom_histogram(color = 'orange', aes(fill=..count..), alpha = .8) + scale_fill_gradient("Count", low = "green", high = "red")
```

## 3.9
#### Create a histogram of price per carat and facet it by cut. You can make adjustments to the code from the previous exercise to get started.

#### Adjust the bin width and transform the scale of the x-axis using log10.

```{r Cut3}
ggplot(diamonds, aes(x = price/carat)) +
  geom_histogram(color = "purple", aes(fill=..count..), binwidth = .01) +
  scale_x_log10(expression(paste(Log[10], " of Price")),
                breaks = trans_breaks("log10", function(x) 10^x),
                labels = trans_format("log10", math_format(10^.x))) +
  facet_grid(cut~., scale = "free") + ylab("Count") +
  scale_fill_gradient("Count", low = "green", high = "blue")
```


## 3.10

##### Investigate the price of diamonds using box plots, numerical summaries, and one of the following categorical variables: cut, clarity, or color.

#### There won’t be a solution video for this exercise so go to the discussion thread for either BOXPLOTS BY CLARITY, BOXPLOT BY COLOR, or BOXPLOTS BY CUT to share you thoughts and to see what other people found.

#### You can save images by using the ggsave() command. ggsave() will save the last plot created. For example...
####                  qplot(x = price, data = diamonds)
####                  ggsave('priceHistogram.png')

#### ggsave currently recognises the extensions eps/ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg and wmf (windows only).

```{r Cut4}
diamonds %>%
  group_by(cut) %>%
  summarise(count = n(), avg_price = mean(price))
```

```{r Cut5}
ggplot(data=diamonds, aes(x = clarity, y = price, color = cut)) +
  geom_boxplot() + facet_grid(color~.)

ggsave("pricehist6.jpg")
```

## 3.11

#### a) What is the price range for the middle 50% of the diamonds with color D? 911; 4213.5
#### b) What is the price range for the middle 50% of diamonds with color J? 1860.5; 7695
#### c) What is the IQR for diamonds with the best color? 3302.5
#### d) What is the IQR for the diamonds with the worst color? 5834.5

```{r Cut6}
by(diamonds$price, diamonds$color, summary)
```

```{r Cut7}
diamonds %>%
  group_by(color) %>%
  filter(color == "D") %>%
  summarise(Quartile.25 = quantile(price, 0.25),
            Quartile.75 = quantile(price, 0.75),
            IQR = Quartile.75 - Quartile.25)
```

```{r Cut8}
diamonds %>%
  group_by(color) %>%
  filter(color == "J") %>%
  summarise(Quartile.25 = quantile(price, 0.25),
            Quartile.75 = quantile(price, 0.75),
            IQR = Quartile.75 - Quartile.25)
```

## 3.12
#### Investigate the price per carat of diamonds across the different colors of diamonds using boxplots.

```{r Carat1}
ggplot(data = diamonds, aes(x = color, y = price/carat, fill = color)) + geom_boxplot() + scale_y_continuous(limits = c(2000, 6000), labels=dollar) + xlab("Color") + ylab("Price per Carat") + stat_summary(fun.y=mean, geom="point", shape=5, size=4) + coord_flip()

ggsave("pricehist8.jpg")
```

## 3.13
#### Investigate the weight of the diamonds (carat) using a frequency polygon. Use different bin widths to see how the frequency polygon changes. What carat size has a count greater than 2000? Check all that apply.
#### 0.3; 1.01

```{r Carat2}
summary(diamonds$carat)
```

```{r Carat3}
ggplot(diamonds, aes(x=carat)) + geom_freqpoly(binwidth=0.1, color = 'blueviolet', alpha = 0.7) +
  scale_x_continuous(breaks = seq(0, 5, 0.1), expand = c(0,0)) +
  geom_vline(xintercept = c(0.1, 0.8, 1.6, 2.0, 3.0, 5.0), color = "firebrick1", linetype="dotted") +
  geom_vline(xintercept = c(0.3, 1.01), color = "green", linetype = "longdash") +
  geom_hline(yintercept = 2000, color = "dodgerblue", linetype="solid", alpha = 0.5) +
  xlab("Carat") + ylab("Count")

ggsave("pricehist7.jpg")
```


## 3.14
#### The Gapminder website contains over 500 data sets with information about the world's population. Your task is to download a data set of your choice and create 2-5 plots that make use of the techniques from Lesson 3.

#### You might use a simple histogram, a boxplot split over a categorical variable, or a frequency polygon. The choice is yours!

#### You can find a link to the Gapminder website in the Instructor Notes.

#### Once you've completed your investigation, create a post in the discussions that includes:
####       1. any questions you answered, your observations, and summary statistics
####       2. snippets of code that created the plots
####       3. links to the images of your plots

```{r Urban1}
urban <- tbl_df(read.xlsx("indicator.xlsx", sheetName="Data", header=TRUE))
```

```{r Urban2}
row.with.na <- apply(urban, 1, function(x){any(is.na(x))})
sum(row.with.na)
```

```{r Urban3}
filtered <- urban[!row.with.na,]
names(filtered)[1] <- "country"
```

```{r Urban4}
population <- melt(filtered, id=c("country"), value.name="population", variable.name="year")
population <- tbl_df(population)
```

```{r Urban5}
population <- population %>%
  mutate(year = as.character(year), year = substr(year, 2, 5), year = as.numeric(year))
```

```{r Urban6}
color_pallete_function <- colorRampPalette(colors = c("red", "green", "purple"), space = "Lab")
```

```{r Urban7}
country_set1 = c("China", "India", "United States", "Brazil", 'Indonesia')
country_set2 = c("Germany", "United Kingdom", "France", "Iran", "Turkey", "Italy", 'Pakistan', 'Philippines', 'Bangladesh')
```

```{r Urban8}
num_colors <- length(country_set2)
```

```{r Urban9}
color_colors <- color_pallete_function(num_colors)
reorder_size <- function(x) {factor(x, levels = names(sort(table(x))))}
```

```{r Urban10}
population3 <- population[which(population$country == 'Russia'),]
population2011 <- population[which(population$year == 2011 & population$country %in%  country_set2),]
population2 <- population[which(population$country %in%  country_set1),]
```


```{r Urban11}
ggplot(data=population2, aes(x = year, y = population, color = country)) + geom_line() + scale_x_continuous(limits = c(1961, 2011), breaks = seq(1961, 2011, 5)) + labs(title="Urban population's dynamics", x="Year", y="Urban population")

ggsave("urban01.jpg")
```

```{r Urban12}
ggplot(data=population2011, aes(x=country, y=reorder_size(population), fill=population)) + geom_bar(stat="identity", color="darkblue") + scale_fill_gradient("Urban population", low = "purple", high = "firebrick1") + labs(title="Urban population in 2011", x="Country", y="Urban population")

ggsave("urban02.jpg")
```

```{r, echo=FALSE}
fit.lm <-lm(population3$population ~ population3$year)
plot(population3$year, population3$population)
abline(fit.lm)
```

```{r Urban13}
ggplot(data=population3, aes(x=year, y=population, color = cut(..y.., seq(64000000, 122000000, 10000000)))) + geom_point() + geom_line(color = 'red') + scale_x_continuous(limits = c(1961, 2011), breaks = seq(1961, 2011, 5)) + geom_hline(yintercept = 100000000, color = "dodgerblue", linetype="solid", alpha = 0.8) + geom_hline(yintercept = 110000000, color = "firebrick2", linetype="solid", alpha = 0.8)+ labs(title="Urban population in Russia", x="Year", y="Urban population")

ggsave("urban03.jpg")
```

## 3.15
#### How many birthdays are in each month? Which day of the year has the most number of birthdays? Do you have at least 365 friends that have birthdays on everyday of the year?

# **********************************************************************

#### You will need to do some data munging and additional research to complete this task. This task won't be easy, and you may encounter some unexpected challenges along the way. We hope you learn a lot from it though.

#### You can expect to spend 30 min or more on this task depending on if use the provided data or obtain your personal data. We also encourage you to use the lubridate package for working with dates. Read over the documentation in RStudio and search for examples online if you need help.

#### You'll need to export your Facebooks friends' birthdays to a csv file. You may need to create a calendar of your Facebook friends birthdays in a program like Outlook or Gmail and then export the calendar as a csv file.

#### Once you load the data into R Studio, you can use the strptime() function to extract the birth months and birth days. We recommend looking up the documentation for the function and finding examples online.

```{r BD1}
bd_examples <- tbl_df(read.csv("birthdaysExample.csv"))

bdates <- mdy(bd_examples$dates)
```

```{r BD2}
bd_examples <- bd_examples %>%
  mutate(birthday = bdates, weekday = weekdays(bdates, abbr=TRUE),
         year = year(bdates), month = month(bdates, label=TRUE, abbr=FALSE),
         day = day(bdates))
```

```{r BD3}
days_by_weekday <- ddply(bd_examples, .(weekday), summarize, dates=length(dates))
days_by_weekday
```

```{r BD4}
days_by_weekday$weekday <- factor(days_by_weekday$weekday, levels= c("пн", "вт", "ср", "чт", "пт",  "сб", "вс"))
days_by_weekday[order(days_by_weekday$weekday), ]
```

```{r BD5}
ggplot(data=days_by_weekday, aes(x = weekday, y = dates, fill = weekday)) +
  geom_bar(stat="identity", color="darkblue")

ggsave("birthdays01.jpg")
```

```{r BD6}
days_by_day <- ddply(bd_examples, .(day), summarize, dates=length(dates))
days_by_day
```

```{r BD7}
ggplot(data=days_by_day, aes(x = day, y = dates, fill = day)) +
  geom_bar(stat="identity", color="darkblue")

ggsave("birthdays02.jpg")
```

```{r BD8}
days_by_month <- ddply(bd_examples, .(month), summarize, dates=length(dates))
days_by_month
```

```{r BD9}
ggplot(data=days_by_month, aes(x = month, y = dates, color = month)) +
  geom_point(stat="identity", shape=8, size=5)

ggsave("birthdays03.jpg")
```

Комментариев нет:

Отправить комментарий