Skip to content

Qs scraper

Stewart Craig requested to merge qs-scraper into master

Adds QS scraper and associated data (to address Issue #1 (closed))

Additions/use

Adds QS scraper function. This can be run with:

ur_scrape_qs(2015)

Accepts years 2013 - 2019. Note that for some years the rankings ran over two years, in these cases the year refers to the latter of the two years (so would input 2014 to get rankings for 2013/2014).

Alternatively can call using JSON URL, e.g.,:

ur_scrape_qs(qs_json_url = ur_scrape_qs(qs_json_url = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051_indicators.txt")

Also adds ur_data_qs data set. This can be accessed with:

ur_data_qs

Notes

  • Currently ur_scrape_qs.R file includes hardcoded JSON URLs for each available year. Can potentially shift these into Google Docs file as per Times scraper approach.
  • Currently scraper gets data for primary indicators only (i.e., those that appear on main rankings pages such as https://www.topuniversities.com/university-rankings/world-university-rankings/2019). Does not get Subject Rankings or Graduate Employability rankings (such as those found one specific university pages such as https://www.topuniversities.com/universities/massachusetts-institute-technology-mit).
  • Different years include different selections of primary indicators, e.g., 2013 includes 'Arts & Humanities' score. This is not included in 2019 data. In ur_data_qs data, indicator data that is unavailable for a given year will show as NA.
  • Data scraped from JSON URL includes columns with suffix _rank and _rank_d. I'm not completely sure what these refer too. It looks like the _rank_d may be the rank number that is displayed on the website for the rank (i.e., in some cases there are duplicates within the same list, presumably where given same rank). And _rank is the order the item actual appears in the list. However, this may not be correct. Let me know if you have any ideas, otherwise I can continue to explore data further. I have included both for now but can potentially remove one of them if not required.
  • Includes only very minimal cleaning of data beyond adjusting column names. This means there is some inconsistent coding of missing data etc in ur_data_qs.

TODO

Currently includes skeleton documentation for ur_data_qs only. Full documentation still to be added.

Merge request reports