unirank merge requests

unirank merge requests https://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests 2021-09-10T13:34:22+10:00 https://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/10 rewrite THE to not require google sheet for JSON URLs, include 2022 ranking 2021-09-10T13:34:22+10:00 Phillip Oakley

rewrite THE to not require google sheet for JSON URLs, include 2022 ranking

Times Higher Ed rankings have been released for 2022. THE scraper doesn't work as it relies on an external document listing JSON URLs. This updated finds the URLS as part of the scraping process, and removes the need for the external lis... Times Higher Ed rankings have been released for 2022. THE scraper doesn't work as it relies on an external document listing JSON URLs. This updated finds the URLS as part of the scraping process, and removes the need for the external list. Also includes scraped 2022 data. https://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/9 Rewrite ARWU global and subject scrapers to use 2021 website structure 2021-09-08T11:00:09+10:00 Phillip Oakley

Rewrite ARWU global and subject scrapers to use 2021 website structure

ARWU changed its website structure, so all scrapers stopped working. This is a rewrite of those, and an update of the data. ARWU changed its website structure, so all scrapers stopped working. This is a rewrite of those, and an update of the data. https://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/8 Fix times scraping to use fromJSON 2021-03-18T21:59:56+11:00 Stewart Craig

Fix times scraping to use fromJSON

`ur_scrape_times` previously used `httr::GET` to pull in data from `times_json_url`. This started throwing errors as data was no longer getting neatly pulled out of response into a data frame - potentially due to update of data format f... `ur_scrape_times` previously used `httr::GET` to pull in data from `times_json_url`. This started throwing errors as data was no longer getting neatly pulled out of response into a data frame - potentially due to update of data format from Times Higher Education (THE). Updated to use `jsonlite::fromJSON` to pull in `time_json_url` data instead. Stewart Craig Stewart Craig https://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/7 Update QS data format 2021-03-04T10:33:51+11:00 Stewart Craig

Update QS data format

QS rankings made minor changes to data format which results in columns not being named correctly in output. Updated approach to column naming so that correct columns now show. QS rankings made minor changes to data format which results in columns not being named correctly in output. Updated approach to column naming so that correct columns now show. Stewart Craig Stewart Craig https://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/6 Fix arwu subject 2020-07-14T15:52:03+10:00 Stewart Craig

Fix arwu subject

Includes updates to fix ARWU subject scrape bug (#12) and updates to data sets and other minor cleaning. Details of updates: * `ur_scrape_arwu_subject` fixed to handle scraping of 2020 (fixed #12). Due to change in format of data on AR... Includes updates to fix ARWU subject scrape bug (#12) and updates to data sets and other minor cleaning. Details of updates: * `ur_scrape_arwu_subject` fixed to handle scraping of 2020 (fixed #12). Due to change in format of data on ARWU website for 2020 subject ranks, error caused when trying to attach column names (as now additional National/Regional Rank column). Added function `get_subject_colnames` that checks which year data is requested for and returns appropriate column names. As part of the, also returns new Q1 metric column name that has replaced PUB metric * Updated data sets for Times and QS to include 2020 data * Fixes bug that was caused in Times example documentation due to outdated JSON URLs for rankings. Replaced to include correct JSON URLS * Minor updates to documentation as relevant * Replace `as_data_frame` with `as_tibble` (`as_data_frame` is now deprecated) Stewart Craig Stewart Craig https://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/5 Fix ARWU 2019 scraping 2019-08-16T11:06:08+10:00 Stewart Craig

Fix ARWU 2019 scraping

Fixes #11. Errors were occuring when scraping overall ARWU 2019 data. This was due to ARWU website using 'null' instead of blanks for N&S scores causing columns to be read in as characters. Updated to replace 'null' with NA and reparse c... Fixes #11. Errors were occuring when scraping overall ARWU 2019 data. This was due to ARWU website using 'null' instead of blanks for N&S scores causing columns to be read in as characters. Updated to replace 'null' with NA and reparse columns. Stewart Craig Stewart Craig https://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/4 Fix QS scrape column selection 2019-02-08T12:01:19+11:00 Stewart Craig

Fix QS scrape column selection

Fix for #9 Errors had been occurring in QS scraping for some subjects where indicators were showing as '-' on ranking website (example [here](https://www.topuniversities.com/university-rankings/university-subject-rankings/2018/art-desi... Fix for #9 Errors had been occurring in QS scraping for some subjects where indicators were showing as '-' on ranking website (example [here](https://www.topuniversities.com/university-rankings/university-subject-rankings/2018/art-design)). This led to error in QS scraping when attempting to select relevant columns from data frame based on indicator names. The names exist in indicator list but not in data frame so threw error because columns could not be found. Updated approach to selecting columns to prevent error occurring if column does not exist. Relevant test also added for example subject/year combination where this had been occurring. Bug Oscar Lane Oscar Lane https://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/3 QS scrape subjects 2019-02-07T10:26:43+11:00 Stewart Craig

QS scrape subjects

Updates QS scraping to get JSON URLs from Google Sheet. This reflects approach already used for Times scraping. Additional updates include: * Add `ranking` argument to QS scrape to specify specific type of rank data to scrape. Allows s... Updates QS scraping to get JSON URLs from Google Sheet. This reflects approach already used for Times scraping. Additional updates include: * Add `ranking` argument to QS scrape to specify specific type of rank data to scrape. Allows scraping of subject rankings * Add `ur_show_available_qs_data()` function to return table of JSON URLs available in Google Sheet * Update `ur_data_qs.rda` to include year 2016 - 2019 only to reflect years available in JSON URLs Google Sheet Oscar Lane Oscar Lane https://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/2 Qs scraper 2018-11-21T14:34:45+11:00 Stewart Craig

Qs scraper

Adds QS scraper and associated data (to address Issue #1) ## Additions/use Adds QS scraper function. This can be run with: ``` ur_scrape_qs(2015) ``` Accepts years 2013 - 2019. Note that for some years the rankings ran over two years,... Adds QS scraper and associated data (to address Issue #1) ## Additions/use Adds QS scraper function. This can be run with: ``` ur_scrape_qs(2015) ``` Accepts years 2013 - 2019. Note that for some years the rankings ran over two years, in these cases the year refers to the latter of the two years (so would input 2014 to get rankings for 2013/2014). Alternatively can call using JSON URL, e.g.,: ``` ur_scrape_qs(qs_json_url = ur_scrape_qs(qs_json_url = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051_indicators.txt") ``` Also adds ur_data_qs data set. This can be accessed with: ``` ur_data_qs ``` ## Notes * Currently ur_scrape_qs.R file includes hardcoded JSON URLs for each available year. Can potentially shift these into Google Docs file as per Times scraper approach. * Currently scraper gets data for primary indicators only (i.e., those that appear on main rankings pages such as https://www.topuniversities.com/university-rankings/world-university-rankings/2019). Does not get Subject Rankings or Graduate Employability rankings (such as those found one specific university pages such as https://www.topuniversities.com/universities/massachusetts-institute-technology-mit). * Different years include different selections of primary indicators, e.g., 2013 includes 'Arts & Humanities' score. This is not included in 2019 data. In `ur_data_qs` data, indicator data that is unavailable for a given year will show as NA. * Data scraped from JSON URL includes columns with suffix `_rank` and `_rank_d`. I'm not completely sure what these refer too. It looks like the `_rank_d` may be the rank number that is displayed on the website for the rank (i.e., in some cases there are duplicates within the same list, presumably where given same rank). And `_rank` is the order the item actual appears in the list. However, this may not be correct. Let me know if you have any ideas, otherwise I can continue to explore data further. I have included both for now but can potentially remove one of them if not required. * Includes only very minimal cleaning of data beyond adjusting column names. This means there is some inconsistent coding of missing data etc in `ur_data_qs`. ## TODO Currently includes skeleton documentation for ur_data_qs only. Full documentation still to be added. Feature Oscar Lane Oscar Lane