Data scientists can spend up to 40-45% of their time in projects just cleaning and preparing data for analysis.
Source: Kaggle ML & DS Survey, 2018. Machine Learning and Data Science Survey [n=23,859]. https://www.kaggle.com/kaggle/kaggle-survey-2018/
Issues:
Dirty data - fix errors, remove duplicates, syntax errors, standardisation, missing values.
Structure - summary tables created to favour presentation or data entry over analysis.
The Tidyverse (https://www.tidyverse.org) is a collection of R packages that work together to clean, process, model, and visualise data.
Tidyr (https://tidyr.tidyverse.org/) is a package that makes it easy to 'tidy' your data.
variable
is in its own columnobservation
in its own rowvalue
in its own cellSource: Tidy data. https://r4ds.had.co.nz/tidy-data.html#tidy-data-1
--
shape | colour | frequency |
---|---|---|
cube | red | two |
cube | blue | one |
cube | green | one |
cube | yellow | one |
cube | orange | one |
cube | purple | one |
Each observation is data about a coloured block. Three variables are recorded for each block: shape
, colour
and frequency
.
A common problem is a dataset where some of the column names are not names of variables, but values of a variable.
A common problem is a dataset where some of the column names are not names of variables, but values of a variable.
table4a
## # A tibble: 3 x 3## country `1999` `2000`## * <chr> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766
A common problem is a dataset where some of the column names are not names of variables, but values of a variable.
table4a
## # A tibble: 3 x 3## country `1999` `2000`## * <chr> <int> <int>## 1 Afghanistan 745 2666## 2 Brazil 37737 80488## 3 China 212258 213766
To tidy a dataset like this, we need to pivot the offending columns into a new pair of variables.
Source: Tidy data. https://r4ds.had.co.nz/tidy-data.html
table4a %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
## # A tibble: 6 x 3## country year cases## <chr> <chr> <int>## 1 Afghanistan 1999 745## 2 Afghanistan 2000 2666## 3 Brazil 1999 37737## 4 Brazil 2000 80488## 5 China 1999 212258## 6 China 2000 213766
table4a %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
## # A tibble: 6 x 3## country year cases## <chr> <chr> <int>## 1 Afghanistan 1999 745## 2 Afghanistan 2000 2666## 3 Brazil 1999 37737## 4 Brazil 2000 80488## 5 China 1999 212258## 6 China 2000 213766
names to
. Here it is year
.values to
. Here it’s cases
.table4
into a longer, tidy form.Source: Tidy data. https://r4ds.had.co.nz/tidy-data.html
Take table4a
: the column names 1999 and 2000 represent values of the year
variable.
The values in the 1999 and 2000 columns represent values of the cases
variable.
Each row represents two observations
not one.
pivot_wider()
is the opposite of pivot_longer()
. In the case of table2: an observation is a country in a year, but each observation is spread across two rows.
head(table2, 10)
## # A tibble: 10 x 4## country year type count## <chr> <int> <chr> <int>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272
Source: Tidy data. https://r4ds.had.co.nz/tidy-data.html
table2 %>% pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583
table2 %>% pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 x 4## country year cases population## <chr> <int> <int> <int>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583
type
.count
.table2
into a “wider”, tidy form.Source: Tidy data. https://r4ds.had.co.nz/tidy-data.html
pivot_longer() makes wide tables narrower and longer; pivot_wider() makes long tables shorter and wider.
separate()
pulls apart one column into multiple columns, by splitting wherever a separator character appears.
Source: Tidy data. https://r4ds.had.co.nz/tidy-data.html
unite()
is the inverse of separate()
: it combines multiple columns into a single column.
Source: Tidy data. https://r4ds.had.co.nz/tidy-data.html
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |