class: title-slide, top, left background-image: url(img/rc_tidydata_2020.jpg) background-size: contain --- class: center, middle # Doing research involves working with data --- ### Overview * Whether you're using digital or more traditional methods, **most researchers work with data in some digital format**. * As researchers cleaning data and creating analysis datasets can be the most **tedious and time-consuming** aspect of conducting analysis. * Data comes in many forms but in this context we’re talking about Tabular data. --- ### Preparing data for analysis Data scientists can spend up to **40-45%** of their time in projects just cleaning and preparing data for analysis. data:image/s3,"s3://crabby-images/c8fa9/c8fa9c723fa9870a5d8aff0e5c3658e06e0e4876" alt="" .footnote[ Source: *Kaggle ML & DS Survey, 2018.* Machine Learning and Data Science Survey [n=23,859]. https://www.kaggle.com/kaggle/kaggle-survey-2018/ ] --- ### Tidy data * This tutorial is about changing the organisational structure of your data by transforming **untidy data into tidy data**. data:image/s3,"s3://crabby-images/14979/149792aaa3cf3ad05afe297c4ff23f994c955d3d" alt="" --- ### Advantages * Save **time** cleaning/preparing raw data to prepare a dataset usable for analysis. * Saving your process aids **reproducibility** so you can easily repeat your analysis. * Storing data in a **consistent structure** means its easier to learn the tools to work with it because they have an underlying uniformity. --- ### Aim of this tutorial * Understand principles of **good practise in data organisation**. * Understanding the concept of **Tidy data** and the functions from the **tidyr** package. * Understand how learning some **basic programming skills** can make your life easier. --- class: center, middle # Cleaning data --- <br> <br> .midi[ .pull-left[ data:image/s3,"s3://crabby-images/d62c4/d62c4d8bc047113df0df5a6dd6b4511e63fd960e" alt="" ] .pull-right[ Issues: * __Spelling errors:__ Street, Road, Avenue, Parade.. * __Abbreviations:__ Str, St * __Ambiguities:__ cnr Albert and Buckley Streets; Albert Street or Road ] ] --- ### Why do we mean by untidy? **Dirty data** - fix errors, remove duplicates, syntax errors, standardisation, missing values. <br> data:image/s3,"s3://crabby-images/89c03/89c03da3535d3365bb4e44c62063a1db17c07b75" alt="" <br> **Structure** - summary tables created to favour presentation or data entry over analysis. --- class: center, middle # Tidy data --- <br> <br> .midi[ .left-column[ <br> data:image/s3,"s3://crabby-images/7b745/7b7451800e55b980f74f5d4d178fd46b3cbad113" alt="" ] .right-column[ The __Tidyverse__ (https://www.tidyverse.org) is a collection of R packages that work together to clean, process, model, and visualise data. <br> __Tidyr__ (https://tidyr.tidyverse.org/) is a package that makes it easy to 'tidy' your data. ] ] --- ### What do we mean by Tidy data * Data that is **usable for analysis**. * Data that is **easy to model, visualise and aggregate**. --- ### Three interrelated rules for a tidy dataset data:image/s3,"s3://crabby-images/c2b0d/c2b0d919c4f862a42d0f5ae8e3be31513093948d" alt="" -- * Each `variable` is in its own column * Each `observation` in its own row * Each `value` in its own cell .footnote[ Source: *Tidy data.* https://r4ds.had.co.nz/tidy-data.html#tidy-data-1 ] --- <br> .small[ .pull-left[ data:image/s3,"s3://crabby-images/96bcc/96bcc26839d1b5de50f607811f99811f9c72dd32" alt="" ] -- .pull-right[ ### A simple example shape | colour | frequency --------|-------------|---------------- cube | red | two cube | blue | one cube | green | one cube | yellow | one cube | orange | one cube | purple | one ] ] <br> <br> Each observation is data about a coloured block. Three variables are recorded for each block: `shape`, `colour` and `frequency`. --- class: center, middle # Common causes of Untidy data --- ### Column headers are values A common problem is a dataset where some of the column names are not names of variables, but values of a variable. -- ```r table4a ``` ``` ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766 ``` -- To tidy a dataset like this, we need to pivot the offending columns into a new pair of variables. .footnote[ Source: *Tidy data.* https://r4ds.had.co.nz/tidy-data.html ] --- ### Pivoting (Longer) ```r table4a %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases") ``` ``` ## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766 ``` -- .instructions[ * The set of columns whose **names are values, not variables**. In this example, those are the columns 1999 and 2000. * The name of the variable to move the column `names to`. Here it is `year`. * The name of the variable to move the column `values to`. Here it’s `cases`. ] --- ### Pivoting `table4` into a longer, tidy form. data:image/s3,"s3://crabby-images/b3b41/b3b41df54caf29467601e30eeac877aa8100db37" alt="" .footnote[ Source: *Tidy data.* https://r4ds.had.co.nz/tidy-data.html ] ??? Take `table4a`: the column names 1999 and 2000 represent values of the `year` variable. The values in the 1999 and 2000 columns represent values of the `cases` variable. Each row represents `two observations` not one. --- ### An observations is scattered across multiple rows `pivot_wider()` is the opposite of `pivot_longer()`. In the case of table2: an observation is a country in a year, but each observation is spread across two rows. ```r head(table2, 10) ``` ``` ## # A tibble: 10 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ``` .footnote[ Source: *Tidy data.* https://r4ds.had.co.nz/tidy-data.html ] --- ### Pivoting (Wider) ```r table2 %>% pivot_wider(names_from = type, values_from = count) ``` ``` ## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` -- .instructions[ * The column to take variable names from is `type`. * The column to take values from is `count`. ] --- ### Pivoting `table2` into a “wider”, tidy form. data:image/s3,"s3://crabby-images/1b85f/1b85f72bb096bff63ad2e4d82990b16b4442a5ca" alt="" .footnote[ Source: *Tidy data.* https://r4ds.had.co.nz/tidy-data.html ] ??? pivot_longer() makes wide tables narrower and longer; pivot_wider() makes long tables shorter and wider. --- ### Separate data:image/s3,"s3://crabby-images/a0383/a0383e888a5ea70f9af49b736fce710abd9e2e43" alt="" `separate()` pulls apart one column into multiple columns, by splitting wherever a separator character appears. .footnote[ Source: *Tidy data.* https://r4ds.had.co.nz/tidy-data.html ] --- ### Unite data:image/s3,"s3://crabby-images/55435/55435e8691976eca540d969daa19b5a981fa6f11" alt="" `unite()` is the inverse of `separate()`: it combines multiple columns into a single column. .footnote[ Source: *Tidy data.* https://r4ds.had.co.nz/tidy-data.html ] --- class: center, middle # Let's recap --- ### tidy data * Ensuring your data is correct, consistent, and usable for analyses can involve **cleaning the data** to identify any errors or missing values. * As well as cleaning, creating analysis datasets often requires **restructuring the data**. * Tidy data is data with a consistent form: **every variable goes in a column, and every column is a variable**. --- class: center, middle # Where To Next? --- ### Where To Next? * Download RStudio (IT”S FREE) -> https://rstudio.com/products/rstudio/download/ * Go and explore -> ggplot2 (https://www.r-graph-gallery.com/ggplot2-package.html) * Start an online tutorial or course -> Data Science with R, http://robust-tools.djnavarro.net --- ### Useful resources * R for Data Science -> https://r4ds.had.co.nz * RStudio Education -> https://education.rstudio.com * RStudio Essentials -> https://rstudio.com/collections/rstudio-essentials/ * Introduction to R -> https://www.datacamp.com/courses/free-introduction-to-r --- ### References * Grolemund, G., & Wickham, H. (2018). R for Data Science. Retrieved from http://r4ds.had.co.nz * Tidyr Reference. Retrieved from https://tidyr.tidyverse.org/reference/index.html * Pivoting. Retrieved from https://tidyr.tidyverse.org/articles/pivot.html * Tidyr: Crucial Step Reshaping Data with R for Easier Analyses. Retrieved from http://www.sthda.com/english/wiki/tidyr-crucial-step-reshaping-data-with-r-for-easier-analyses ---