Web scraping

Aim

Use data scraping as a method of data collection from educational bloggers to examine the sentiments associated with the transition from classroom based schooling to remote schooling as a result of the coronavirus (COVID-19) pandemic.

Scope

Before any data scraping can commence it’s necessary to identify blogs of interest. This particular exploration only deals with the output of a single blogger, however future work might look at incorporating more blog entries and/or new blogs so as to build a larger dataset. For ethical considerations it is necessary to refer to the data collection plicies of the respective blogging platform (refer to https://scipunimelb.github.io/edu-disruption/slides/rc-apis.html#31).

Requirements for data collection

Confirm keywords (search terms) of interest
Identify blog(s) of interest
Review Terms of Service for use of data
Identify suitable method for data collection

Sample edu-bloggers

Use of data (Privacy & Ethics)

I began by looking at the following blogging platforms that are popular with teachers: edublogs.org, blogger.com and edutopia.org. Two platforms grant permission for data from their websites to be used and one does not.

Website	Allow data to be used	Mechanism	Details
edublogs.org	Yes	Scraping	https://edublogs.org/robots.txt
blogger.com	Yes	API	https://developers.google.com/blogger
edutopia.org	No	-	https://www.edutopia.org/terms-of-use

Method

This section steps through the process of identifying suitable blogs for scraping, through to scraping the blog using the R package rvest.

Methods for data collection

Common approaches to text mining include:

Web scraping refers to the process of automatically extracting textual data from web pages and other digital files (Ignatow & Mihalcea, 2018).
APIs provide data in a digital format that computers can understand and use (often referred to as machine-readable data). (Sherratt, 2019).

Identify blogs of interest

You can find education blogs by using the following search terms in Google:

site:www.website.com and Search Term
for example site:edublogs.org covid pedagogy

The search results will include both blogs created by teachers and blogs created by education providers (EG: Google for Education.

Scrape text from blog

The website used in this instance was https://visualisingideas.edublogs.org/2020/03/.

The first step is to gather the HTML page from the blog.

# set-up url for scraping
march20_archive = "https://visualisingideas.edublogs.org/2020/03/"

# Read the HTML code from the website
march_posts <- read_html(march20_archive)

A web page contains a lot of content and we may not need everything. Web scraping is performed in such a way to target the pieces of content we are most interested in.

march_titles <- html_nodes(march_posts,'.entry-title')   #’.entry-title’ is the blog Title
march_paras <- html_nodes(march_posts,'p')   #’p’ are the paragraphs

How many titles do we have (NB: this url is an archive of the blog posts that were written in March 2020).

## [1] "Visualising a Discussion Prompt for Students on Studying Habits at Home"                 
## [2] "Double Book Post: “The House of Spirits” by Allende & “Where the Crawdads Sing” by Owens"
## [3] "Compare and Infer!"                                                                      
## [4] "Online Teaching for Students Who Never Read Instructions"                                
## [5] "“Women’s Day”, Being A Teacher & “The Mermaid Chair” by Sue Monk Kidd"

Text as data

Examine the results of the new dataset and carry out some preliminary sentiment analysis.

What are the most commonly used words?

Generally most text analysis algorithms involve detecting patterns, such as identifying word frequency relatively unique to a particular dataset. The simplest word frequency analysis is assessing the most common words in text. We can remove all the stop words that are not informative (ie: the, and, to, of, a, …).

What are the prevalent sentiments (positive, negative)?

In sentiment analysis assigns a word to one or more sentiments. The lexicon used here divides words into positive and negative sentiments.

Next steps

Some options for further developent.

Larger dataset

This proof of concept only uses a small dataset consisting of blog entries over a one month period by a single blogger. There is potential to extend this analysis through collecting a larger dataset. A small scale analysis could be done comparing pre-covid and post-covid blog posts.

Different Lexicon

The lexicon could be replaced or expanded depending on areas of interest (EG: teacher’s feelings (sentiments) aboput transitioning to remote learning &/or changes in their pedagogical approaches as they pivot to online teaching).

Other sources of data

Tweets could provide another source of data to examine text published by an educator through social media.

Tweet threads are a string of tweets that are linked together by the author and will relate to a topic in some way. When composing a thread, usually people indicate this by starting it with [Thread] or 1/n or some other way to highlight it’s a thread. (source: https://www.t4rstats.com/how-threads-work.html)

It is possible to combine and display the complete thread by including @threadreaderapp unroll as a comment.

See Dispaly twitter thread