суббота, 13 августа 2016 г.

DATA ANALYSIS WITH R; lesson3.Rmd

---
title: "Lesson 3"
runtime: shiny
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Lesson 3
========================================================

***

### What to Do First?
Open the directory.

```{r setup 2}
setwd('/Users/olgabelitskaya/version-control/reflections-ud651')
```
***

### Pseudo-Facebook User Data
Read our tsv file and create pf data.frame.

```{r Pseudo-Facebook User Data}
pf <- read.csv('pseudo_facebook.tsv', sep='\t')
names(pf)
```

***

### Histogram of Users' Birthdays
Open the library for plotting.

```{r Histogram of Users\' Birthdays}
library(ggplot2)
qplot(x=dob_day, data=pf, colour = I("blue"))
```

***

#### Useful information for qplot.

```{r ?qplot}
?qplot
```

***

### Other libraries.

```{r Library 1}
library(knitr)
```


```{r Library 2}
library(ggthemes)
theme_set(theme_minimal(7))
```

```{r Library 3}
library(gridExtra)
```

***

### Estimating Your Audience Size
My last post on Facebook is about my learning so I think it has no audience.
***

### Some useful links for plotting

#### http://statistics.ats.ucla.edu/stat/r/modules/factor_variables.htm

#### http://hci.stanford.edu/publications/2013/invisibleaudience/invisibleaudience.pdf

#### http://docs.ggplot2.org/current/

#### https://en.wikipedia.org/wiki/Web_colors

***

### Faceting

```{r Faceting 1}
ggplot(aes(x = dob_day), data = pf) + geom_histogram(binwidth = 1) +
  scale_x_continuous(breaks = 1:31)
```

#### By months

```{r Faceting 2}
ggplot(aes(x = dob_day), data = pf) + geom_histogram(binwidth = 1) +
  scale_x_continuous(breaks = 1:31) + facet_wrap(~dob_month)
```

***

### Information about faceting methods

```{r Faceting 3}
?facet_wrap

?facet_grid
```

***

### Moira's Outlier
#### Which case do you think applies to Moira’s outlier?
Response:
bad data about extreme cases
***

### Friend Count

```{r Friend Count 1}
summary(pf$friend_count)
```

#### Plotting this

```{r Friend Count 2}
qplot(x = friend_count, data = pf)
```

***

### Limiting the Axes and exploring with Bin Width

```{r Limiting the Axes, exploring with Bin Width}
qplot(x = friend_count, data = pf, binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50))
```


***

### Statistics 'by' Gender
```{r Statistics \'by\' Gender}
table(pf$gender)

by(pf$friend_count, pf$gender, summary)
```

***

### Plotting Friend Count by gender
```{r Plotting Friend Count by gender 1}
qplot(x = friend_count, data = pf) + facet_grid(gender ~ .)
```

***
```{r Plotting Friend Count by gender 2}
qplot(x = friend_count, data = pf, binwidth = 10) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)
```
 
### Omitting NA Values

```{r Omitting NA Values}
ggplot(aes(x = friend_count), data = subset(pf, !is.na(gender))) +
  geom_histogram(binwidth = 30, color = 'red', fill = '#099DD9') +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)
```

***

### Tenure
Exploring with colors.

```{r Tenure}
ggplot(aes(x = tenure), data = pf) +
  geom_histogram(binwidth = 30, color = 'green', fill = '#099DD9')

```

***

#### How would you create a histogram of tenure by year?

```{r Tenure Histogram by Year}
ggplot(aes(x = tenure/365), data = pf) +
  geom_histogram(binwidth = .1, color = 'purple', fill = '#00FFFF')
```

***

### Labeling Plots

```{r Labeling Plots}
ggplot(aes(x = tenure / 365), data = pf) +
  geom_histogram(binwidth = .1, color = 'brown', fill = '#F79420') +
  scale_x_continuous(breaks = seq(1, 7, 1), limits = c(0, 7)) +
  xlab('Number of years using Facebook') +
  ylab('Number of users in sample')
```

***

### User Ages

```{r User Ages}
ggplot(aes(x = age), data = pf) +
  geom_histogram(binwidth = 1, color = 'red', fill = '#5760AB') +
  scale_x_continuous(breaks = seq(0, 113, 5))
```

***

### Transforming Data
```{r Transforming Data 1}
summary(pf$friend_count)

summary(log10(pf$friend_count + 1))

summary(sqrt(pf$friend_count))
```

***
### Transforming Data (plotting)
```{r Transforming Data 2}
ggplot(aes(x = friend_count), data = pf) + geom_histogram(binwidth = 30, color = 'green', fill = '#099DD9')
```

***

### Transforming Data2 (plotting)
```{r Transforming Data 3}
ggplot(aes(x = log10(friend_count + 1)), data = pf) + geom_histogram(binwidth = 0.1, color = 'purple', fill = '#099DD9')
```

***

### Transforming Data3 (plotting)
```{r Transforming Data 4}
ggplot(aes(x = sqrt(friend_count)), data = pf) + geom_histogram(binwidth = 1, color = 'red', fill = '#099DD9')
```

***

```{r Transforming Data 5}
?scale_x_log10()
```

***

```{r Transforming Data 6}
p1 <- qplot(x = friend_count, data = pf)
p2 <- qplot(x = log10(friend_count + 1), data = pf)
p3 <- qplot(x = sqrt(friend_count),data = pf)
grid.arrange(p1, p2, p3, ncol=1)
```

***

###  Add a Scaling Layer
```{r  Add a Scaling Layer}
p4 <- ggplot(aes(x = friend_count), data = pf) + geom_histogram() + scale_x_log10()

grid.arrange(p2, p4, ncol=2)
```

***

### Frequency Polygons

```{r Frequency Polygons}
ggplot(aes(x = friend_count, y = ..count../sum(..count..)), data = subset(pf, !is.na(gender))) +
  geom_freqpoly(aes(color = gender), binwidth=10) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
  xlab('Friend Count') +
  ylab('Percentage of users with that friend count')
```

***

### Likes on the Web

```{r Likes on the Web 1}
qplot(x = www_likes, data = pf) + geom_histogram(color = 'red', fill = '#099DD9')
```

```{r Likes on the Web 2}
qplot(x = www_likes, data = subset(pf, !is.na(gender)),
      geom = 'freqpoly', color = gender)
```

```{r Likes on the Web 3}
ggplot(aes(x = www_likes), data = subset(pf, !is.na(gender))) +
  geom_freqpoly(aes(color = gender)) + scale_x_log10()
```

```{r Likes on the Web 4}
summary(pf$www_likes)
by(pf$www_likes, pf$gender, sum)
```

***

### Box Plots
```{r Friend Count by Gender}
qplot(x = friend_count, data = subset(pf, !is.na(gender)),
      binwidth=25, color = gender) +
  scale_x_continuous(limits = c(0, 1000), breaks = seq(0, 1000, 50)) +
  facet_wrap(~gender)
```

#### http://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/

```{r Box Plots}
qplot(x = gender, y = friend_count, data = subset(pf, !is.na(gender)), geom = 'boxplot', color = gender)
```

***

#### Adjust the code to focus on users who have friend counts between 0 and 1000.

```{r 1}
qplot(x = gender, y = friend_count, data = subset(pf, !is.na(gender)), geom = 'boxplot', color = gender) + scale_y_continuous(limits = c(0, 1000))
```

***

```{r 2}
qplot(x = gender, y = friend_count, data = subset(pf, !is.na(gender)), geom = 'boxplot', color = gender, ylim = c(0, 1000))
```

***

```{r 3}
qplot(x = gender, y = friend_count, data = subset(pf, !is.na(gender)), geom = 'boxplot', color = gender) + coord_cartesian(ylim = c(0, 1000))
```

***
### Box Plots, Quartiles, and Friendships

```{r Box Plots, Quartiles, and Friendships}
by(pf$friend_count, pf$gender, sum)
by(pf$friend_count, pf$gender, summary)
```

***

#### Write about some ways that you can verify your answer.

```{r Friend Requests by Gender}
by(pf$friendships_initiated, pf$gender, summary)
```

***

### Getting Logical

```{r Getting Logical 1}
summary(pf$mobile_likes)
summary(pf$mobile_likes > 0)

mobile_check_in <- NA
pf$mobile_check_in <- ifelse(pf$mobile_likes > 0, 1, 0)
pf$mobile_check_in <- factor(pf$mobile_check_in)
summary(pf$mobile_check_in)
```

```{r Getting Logical 2}
sum(pf$mobile_check_in == 1) / length(pf$mobile_check_in)
```

***
### Analyzing One Variable
Reflection:
R is an amazing tool for analyzing and representing data.
I have some very basic skills for now: reading csv files, analyzing one variable, plotting, faceting, transforming, etc.
Let's go ahead for a new knowledge.

Комментариев нет:

Отправить комментарий