Qs scraper
Adds QS scraper and associated data (to address Issue #1 (closed))
Additions/use
Adds QS scraper function. This can be run with:
ur_scrape_qs(2015)
Accepts years 2013 - 2019. Note that for some years the rankings ran over two years, in these cases the year refers to the latter of the two years (so would input 2014 to get rankings for 2013/2014).
Alternatively can call using JSON URL, e.g.,:
ur_scrape_qs(qs_json_url = ur_scrape_qs(qs_json_url = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051_indicators.txt")
Also adds ur_data_qs data set. This can be accessed with:
ur_data_qs
Notes
- Currently ur_scrape_qs.R file includes hardcoded JSON URLs for each available year. Can potentially shift these into Google Docs file as per Times scraper approach.
- Currently scraper gets data for primary indicators only (i.e., those that appear on main rankings pages such as https://www.topuniversities.com/university-rankings/world-university-rankings/2019). Does not get Subject Rankings or Graduate Employability rankings (such as those found one specific university pages such as https://www.topuniversities.com/universities/massachusetts-institute-technology-mit).
- Different years include different selections of primary indicators, e.g., 2013 includes 'Arts & Humanities' score. This is not included in 2019 data. In
ur_data_qs
data, indicator data that is unavailable for a given year will show as NA. - Data scraped from JSON URL includes columns with suffix
_rank
and_rank_d
. I'm not completely sure what these refer too. It looks like the_rank_d
may be the rank number that is displayed on the website for the rank (i.e., in some cases there are duplicates within the same list, presumably where given same rank). And_rank
is the order the item actual appears in the list. However, this may not be correct. Let me know if you have any ideas, otherwise I can continue to explore data further. I have included both for now but can potentially remove one of them if not required. - Includes only very minimal cleaning of data beyond adjusting column names. This means there is some inconsistent coding of missing data etc in
ur_data_qs
.
TODO
Currently includes skeleton documentation for ur_data_qs only. Full documentation still to be added.