Use data scraping as a method of data collection from educational bloggers to examine the sentiments associated with the transition from classroom based schooling to remote schooling as a result of the coronavirus (COVID-19) pandemic.
Before any data scraping can commence it’s necessary to identify blogs of interest. This particular exploration only deals with the output of a single blogger, however future work might look at incorporating more blog entries and/or new blogs so as to build a larger dataset. For ethical considerations it is necessary to refer to the data collection plicies of the respective blogging platform (refer to https://scipunimelb.github.io/edu-disruption/slides/rc-apis.html#31).
I began by looking at the following blogging platforms that are popular with teachers: edublogs.org, blogger.com and edutopia.org. Two platforms grant permission for data from their websites to be used and one does not.
Website | Allow data to be used | Mechanism | Details |
---|---|---|---|
edublogs.org | Yes | Scraping | https://edublogs.org/robots.txt |
blogger.com | Yes | API | https://developers.google.com/blogger |
edutopia.org | No | - | https://www.edutopia.org/terms-of-use |
This section steps through the process of identifying suitable blogs for scraping, through to scraping the blog using the R package rvest
.
Common approaches to text mining include:
You can find education blogs by using the following search terms in Google:
site:www.website.com
and Search Term
The search results will include both blogs created by teachers and blogs created by education providers (EG: Google for Education.
The website used in this instance was https://visualisingideas.edublogs.org/2020/03/.
The first step is to gather the HTML page from the blog.
# set-up url for scraping
march20_archive = "https://visualisingideas.edublogs.org/2020/03/"
# Read the HTML code from the website
march_posts <- read_html(march20_archive)
A web page contains a lot of content and we may not need everything. Web scraping is performed in such a way to target the pieces of content we are most interested in.
march_titles <- html_nodes(march_posts,'.entry-title') #’.entry-title’ is the blog Title
march_paras <- html_nodes(march_posts,'p') #’p’ are the paragraphs
How many titles do we have (NB: this url is an archive of the blog posts that were written in March 2020).
## [1] "Visualising a Discussion Prompt for Students on Studying Habits at Home"
## [2] "Double Book Post: “The House of Spirits” by Allende & “Where the Crawdads Sing” by Owens"
## [3] "Compare and Infer!"
## [4] "Online Teaching for Students Who Never Read Instructions"
## [5] "“Women’s Day”, Being A Teacher & “The Mermaid Chair” by Sue Monk Kidd"
Examine the results of the new dataset and carry out some preliminary sentiment analysis.
Generally most text analysis algorithms involve detecting patterns, such as identifying word frequency relatively unique to a particular dataset. The simplest word frequency analysis is assessing the most common words in text. We can remove all the stop words
that are not informative (ie: the, and, to, of, a, …).
In sentiment analysis assigns a word to one or more sentiments. The lexicon used here divides words into positive and negative sentiments.
Some options for further developent.
This proof of concept only uses a small dataset consisting of blog entries over a one month period by a single blogger. There is potential to extend this analysis through collecting a larger dataset. A small scale analysis could be done comparing pre-covid and post-covid blog posts.
The lexicon could be replaced or expanded depending on areas of interest (EG: teacher’s feelings (sentiments) aboput transitioning to remote learning &/or changes in their pedagogical approaches as they pivot to online teaching).
Tweets could provide another source of data to examine text published by an educator through social media.
Tweet threads are a string of tweets that are linked together by the author and will relate to a topic in some way. When composing a thread, usually people indicate this by starting it with [Thread] or 1/n
or some other way to highlight it’s a thread. (source: https://www.t4rstats.com/how-threads-work.html)
It is possible to combine and display the complete thread by including @threadreaderapp unroll
as a comment.