+ - 0:00:00
Notes for current slide
Notes for next slide
1 / 36

Doing research involves working with data

2 / 36

Overview

  • Whether you're using digital or more traditional methods, most researchers work with data in some digital format.
  • As researchers cleaning data and creating analysis datasets can be the most tedious and time-consuming aspect of conducting analysis.
  • Data comes in many forms but in this context we’re talking about Tabular data.
3 / 36

Preparing data for analysis

Data scientists can spend up to 40-45% of their time in projects just cleaning and preparing data for analysis.

Source: Kaggle ML & DS Survey, 2018. Machine Learning and Data Science Survey [n=23,859]. https://www.kaggle.com/kaggle/kaggle-survey-2018/

4 / 36

Tidy data

  • This tutorial is about changing the organisational structure of your data by transforming untidy data into tidy data.

5 / 36

Advantages

  • Save time cleaning/preparing raw data to prepare a dataset usable for analysis.
  • Saving your process aids reproducibility so you can easily repeat your analysis.
  • Storing data in a consistent structure means its easier to learn the tools to work with it because they have an underlying uniformity.
6 / 36

Aim of this tutorial

  • Understand principles of good practise in data organisation.
  • Understanding the concept of Tidy data and the functions from the tidyr package.
  • Understand how learning some basic programming skills can make your life easier.
7 / 36

Cleaning data

8 / 36



Issues:

  • Spelling errors: Street, Road, Avenue, Parade..
  • Abbreviations: Str, St
  • Ambiguities: cnr Albert and Buckley Streets; Albert Street or Road
9 / 36

Why do we mean by untidy?

Dirty data - fix errors, remove duplicates, syntax errors, standardisation, missing values.



Structure - summary tables created to favour presentation or data entry over analysis.

10 / 36

Tidy data

11 / 36




The Tidyverse (https://www.tidyverse.org) is a collection of R packages that work together to clean, process, model, and visualise data.


Tidyr (https://tidyr.tidyverse.org/) is a package that makes it easy to 'tidy' your data.

12 / 36

What do we mean by Tidy data

  • Data that is usable for analysis.
  • Data that is easy to model, visualise and aggregate.
13 / 36

Three interrelated rules for a tidy dataset

14 / 36

Three interrelated rules for a tidy dataset

  • Each variable is in its own column
  • Each observation in its own row
  • Each value in its own cell
15 / 36


--

A simple example

shape colour frequency
cube red two
cube blue one
cube green one
cube yellow one
cube orange one
cube purple one



Each observation is data about a coloured block. Three variables are recorded for each block: shape, colour and frequency.

16 / 36

Common causes of Untidy data

17 / 36

Column headers are values

A common problem is a dataset where some of the column names are not names of variables, but values of a variable.

18 / 36

Column headers are values

A common problem is a dataset where some of the column names are not names of variables, but values of a variable.

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
19 / 36

Column headers are values

A common problem is a dataset where some of the column names are not names of variables, but values of a variable.

table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

To tidy a dataset like this, we need to pivot the offending columns into a new pair of variables.

20 / 36

Pivoting (Longer)

table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
21 / 36

Pivoting (Longer)

table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
  • The set of columns whose names are values, not variables. In this example, those are the columns 1999 and 2000.
  • The name of the variable to move the column names to. Here it is year.
  • The name of the variable to move the column values to. Here it’s cases.
22 / 36

Pivoting table4 into a longer, tidy form.

23 / 36

Take table4a: the column names 1999 and 2000 represent values of the year variable.

The values in the 1999 and 2000 columns represent values of the cases variable.

Each row represents two observations not one.

An observations is scattered across multiple rows

pivot_wider() is the opposite of pivot_longer(). In the case of table2: an observation is a country in a year, but each observation is spread across two rows.

head(table2, 10)
## # A tibble: 10 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
24 / 36

Pivoting (Wider)

table2 %>%
pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
25 / 36

Pivoting (Wider)

table2 %>%
pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
  • The column to take variable names from is type.
  • The column to take values from is count.
26 / 36

Pivoting table2 into a “wider”, tidy form.

27 / 36

pivot_longer() makes wide tables narrower and longer; pivot_wider() makes long tables shorter and wider.

Separate

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears.

28 / 36

Unite

unite() is the inverse of separate(): it combines multiple columns into a single column.

29 / 36

Let's recap

30 / 36

tidy data

  • Ensuring your data is correct, consistent, and usable for analyses can involve cleaning the data to identify any errors or missing values.
  • As well as cleaning, creating analysis datasets often requires restructuring the data.
  • Tidy data is data with a consistent form: every variable goes in a column, and every column is a variable.
31 / 36

Where To Next?

32 / 36

Where To Next?

33 / 36

Useful resources

34 / 36

References

35 / 36
36 / 36

Doing research involves working with data

2 / 36
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow