Let’s talk about Tidy data 😄

class: title-slide, top, left
background-image: url(img/rc_tidydata_2020.jpg)
background-size: contain

---

# Doing research involves working with data

---

### Overview

* Whether you're using digital or more traditional methods, **most researchers work with data in some digital format**.  
* As researchers cleaning data and creating analysis datasets can be the most **tedious and time-consuming** aspect of conducting analysis.  
* Data comes in many forms but in this context we’re talking about Tabular data.

---

### Preparing data for analysis

Data scientists can spend up to **40-45%** of their time in projects just cleaning and preparing data for analysis.

![](img/survey.jpg)

.footnote[
Source: *Kaggle ML & DS Survey, 2018.* Machine Learning and Data Science Survey [n=23,859]. https://www.kaggle.com/kaggle/kaggle-survey-2018/
]

---

### Tidy data

* This tutorial is about changing the organisational structure of your data by transforming **untidy data into tidy data**.

![](img/maxresdefault.jpg)

---

### Advantages

* Save **time** cleaning/preparing raw data to prepare a dataset usable for analysis.  
* Saving your process aids **reproducibility** so you can easily repeat your analysis.  
* Storing data in a **consistent structure** means its easier to learn the tools to work with it because they have an underlying uniformity.

---

### Aim of this tutorial

* Understand principles of **good practise in data organisation**.
* Understanding the concept of **Tidy data** and the functions from the **tidyr** package.
* Understand how learning some **basic programming skills** can make your life easier.

---
class: center, middle

# Cleaning data

---

.midi[
.pull-left[
![](img/footscray_streets.png)
]
.pull-right[
Issues: 
* __Spelling errors:__ Street, Road, Avenue, Parade..  
* __Abbreviations:__ Str, St  
* __Ambiguities:__  cnr Albert and Buckley Streets; Albert Street or Road  
]
]

---

### Why do we mean by untidy?

**Dirty data** - fix errors, remove duplicates, syntax errors, standardisation, missing values.

![](img/bad_data.png)

**Structure** - summary tables created to favour presentation or data entry over analysis.

---
class: center, middle

# Tidy data

---

.midi[
.left-column[
 
![](img/tidyr.jpg) 
]
.right-column[
The __Tidyverse__ (https://www.tidyverse.org) is a collection of R packages that work together to clean, process, model, and visualise data.

__Tidyr__ (https://tidyr.tidyverse.org/) is a package that makes it easy to 'tidy' your data.

]
]

---

### What do we mean by Tidy data

* Data that is **usable for analysis**.  
* Data that is **easy to model, visualise and aggregate**.

---

### Three interrelated rules for a tidy dataset

![](img/tidy-data.png)

* Each `variable` is in its own column
* Each `observation` in its own row
* Each `value` in its own cell

---
 
.small[
.pull-left[
![](img/color-cubes.jpg)
]

### A simple example

shape   | colour      | frequency
--------|-------------|----------------
cube    | red         | two
cube    | blue        | one
cube    | green       | one
cube    | yellow      | one
cube    | orange      | one
cube    | purple      | one
]
]

Each observation is data about a coloured block. Three variables are recorded for each block: `shape`, `colour` and `frequency`.

---
class: center, middle

# Common causes of Untidy data

---

### Column headers are values

A common problem is a dataset where some of the column names are not names of variables, but values of a variable.

```r
table4a
```

```
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
```

To tidy a dataset like this, we need to pivot the offending columns into a new pair of variables.

---

### Pivoting (Longer)

```r
table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
```

```
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
```

.instructions[
* The set of columns whose **names are values, not variables**. In this example, those are the columns 1999 and 2000.
* The name of the variable to move the column `names to`. Here it is `year`.
* The name of the variable to move the column `values to`. Here it’s `cases`.
]

---

### Pivoting `table4` into a longer, tidy form.

![](img/tidy-9.png)

???

Take `table4a`: the column names 1999 and 2000 represent values of the `year` variable.

The values in the 1999 and 2000 columns represent values of the `cases` variable.

Each row represents `two observations` not one.

---

### An observations is scattered across multiple rows

`pivot_wider()` is the opposite of `pivot_longer()`. In the case of table2: an observation is a country in a year, but each observation is spread across two rows.

```r
head(table2, 10)
```

```
## # A tibble: 10 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
```

---

### Pivoting (Wider)

```r
table2 %>%
    pivot_wider(names_from = type, values_from = count)
```

```
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
```

.instructions[
* The column to take variable names from is `type`.
* The column to take values from is `count`.
]

---

### Pivoting `table2` into a “wider”, tidy form.

![](img/tidy-8.png)

???

pivot_longer() makes wide tables narrower and longer; pivot_wider() makes long tables shorter and wider.

---

### Separate

![](img/tidy-17.png)

`separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears.

---

### Unite

![](img/tidy-18.png)

`unite()` is the inverse of `separate()`: it combines multiple columns into a single column.

---

# Let's recap

---

### tidy data

* Ensuring your data is correct, consistent, and usable for analyses can involve **cleaning the data** to identify any errors or missing values.
* As well as cleaning, creating analysis datasets often requires **restructuring the data**.
* Tidy data is data with a consistent form: **every variable goes in a column, and every column is a variable**.

---

# Where To Next?

---

### Where To Next?

* Download RStudio (IT”S FREE) -> https://rstudio.com/products/rstudio/download/ 
* Go and explore -> ggplot2 (https://www.r-graph-gallery.com/ggplot2-package.html)
* Start an online tutorial or course -> Data Science with R, http://robust-tools.djnavarro.net

---

### Useful resources

* R for Data Science -> https://r4ds.had.co.nz 
* RStudio Education -> https://education.rstudio.com 
* RStudio Essentials -> https://rstudio.com/collections/rstudio-essentials/ 
* Introduction to R -> https://www.datacamp.com/courses/free-introduction-to-r

---

### References

* Grolemund, G., & Wickham, H. (2018). R for Data Science. Retrieved from http://r4ds.had.co.nz 
* Tidyr Reference. Retrieved from https://tidyr.tidyverse.org/reference/index.html 
* Pivoting. Retrieved from https://tidyr.tidyverse.org/articles/pivot.html 
* Tidyr: Crucial Step Reshaping Data with R for Easier Analyses. Retrieved from http://www.sthda.com/english/wiki/tidyr-crucial-step-reshaping-data-with-r-for-easier-analyses

---