unirank merge requestshttps://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests2021-09-10T13:34:22+10:00https://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/10rewrite THE to not require google sheet for JSON URLs, include 2022 ranking2021-09-10T13:34:22+10:00Phillip Oakleyrewrite THE to not require google sheet for JSON URLs, include 2022 rankingTimes Higher Ed rankings have been released for 2022. THE scraper doesn't work as it relies on an external document listing JSON URLs. This updated finds the URLS as part of the scraping process, and removes the need for the external lis...Times Higher Ed rankings have been released for 2022. THE scraper doesn't work as it relies on an external document listing JSON URLs. This updated finds the URLS as part of the scraping process, and removes the need for the external list. Also includes scraped 2022 data.https://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/9Rewrite ARWU global and subject scrapers to use 2021 website structure2021-09-08T11:00:09+10:00Phillip OakleyRewrite ARWU global and subject scrapers to use 2021 website structureARWU changed its website structure, so all scrapers stopped working. This is a rewrite of those, and an update of the data.ARWU changed its website structure, so all scrapers stopped working. This is a rewrite of those, and an update of the data.https://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/8Fix times scraping to use fromJSON2021-03-18T21:59:56+11:00Stewart CraigFix times scraping to use fromJSON`ur_scrape_times` previously used `httr::GET` to pull in data from `times_json_url`.
This started throwing errors as data was no longer getting neatly pulled out of response into a data frame - potentially due to update of data format f...`ur_scrape_times` previously used `httr::GET` to pull in data from `times_json_url`.
This started throwing errors as data was no longer getting neatly pulled out of response into a data frame - potentially due to update of data format from Times Higher Education (THE).
Updated to use `jsonlite::fromJSON` to pull in `time_json_url` data instead.Stewart CraigStewart Craighttps://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/7Update QS data format2021-03-04T10:33:51+11:00Stewart CraigUpdate QS data formatQS rankings made minor changes to data format which results in columns not being named correctly in output.
Updated approach to column naming so that correct columns now show.QS rankings made minor changes to data format which results in columns not being named correctly in output.
Updated approach to column naming so that correct columns now show.Stewart CraigStewart Craighttps://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/6Fix arwu subject2020-07-14T15:52:03+10:00Stewart CraigFix arwu subjectIncludes updates to fix ARWU subject scrape bug (#12) and updates to data sets and other minor cleaning.
Details of updates:
* `ur_scrape_arwu_subject` fixed to handle scraping of 2020 (fixed #12). Due to change in format of data on AR...Includes updates to fix ARWU subject scrape bug (#12) and updates to data sets and other minor cleaning.
Details of updates:
* `ur_scrape_arwu_subject` fixed to handle scraping of 2020 (fixed #12). Due to change in format of data on ARWU website for 2020 subject ranks, error caused when trying to attach column names (as now additional National/Regional Rank column). Added function `get_subject_colnames` that checks which year data is requested for and returns appropriate column names. As part of the, also returns new Q1 metric column name that has replaced PUB metric
* Updated data sets for Times and QS to include 2020 data
* Fixes bug that was caused in Times example documentation due to outdated JSON URLs for rankings. Replaced to include correct JSON URLS
* Minor updates to documentation as relevant
* Replace `as_data_frame` with `as_tibble` (`as_data_frame` is now deprecated)Stewart CraigStewart Craighttps://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/5Fix ARWU 2019 scraping2019-08-16T11:06:08+10:00Stewart CraigFix ARWU 2019 scrapingFixes #11. Errors were occuring when scraping
overall ARWU 2019 data. This was due to
ARWU website using 'null' instead of blanks for
N&S scores causing columns to be read in as
characters. Updated to replace 'null' with
NA and reparse c...Fixes #11. Errors were occuring when scraping
overall ARWU 2019 data. This was due to
ARWU website using 'null' instead of blanks for
N&S scores causing columns to be read in as
characters. Updated to replace 'null' with
NA and reparse columns.Stewart CraigStewart Craighttps://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/4Fix QS scrape column selection2019-02-08T12:01:19+11:00Stewart CraigFix QS scrape column selectionFix for #9
Errors had been occurring in QS scraping for some subjects where indicators were showing as '-' on ranking website (example [here](https://www.topuniversities.com/university-rankings/university-subject-rankings/2018/art-desi...Fix for #9
Errors had been occurring in QS scraping for some subjects where indicators were showing as '-' on ranking website (example [here](https://www.topuniversities.com/university-rankings/university-subject-rankings/2018/art-design)).
This led to error in QS scraping when attempting to select relevant columns from data frame based on indicator names. The names exist in indicator list but not in data frame so threw error because columns could not be found.
Updated approach to selecting columns to prevent error occurring if column does not exist.
Relevant test also added for example subject/year combination where this had been occurring.Oscar LaneOscar Lanehttps://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/3QS scrape subjects2019-02-07T10:26:43+11:00Stewart CraigQS scrape subjectsUpdates QS scraping to get JSON URLs from Google Sheet. This reflects approach already used for Times scraping.
Additional updates include:
* Add `ranking` argument to QS scrape to specify specific type of rank data to scrape. Allows s...Updates QS scraping to get JSON URLs from Google Sheet. This reflects approach already used for Times scraping.
Additional updates include:
* Add `ranking` argument to QS scrape to specify specific type of rank data to scrape. Allows scraping of subject rankings
* Add `ur_show_available_qs_data()` function to return table of JSON URLs available in Google Sheet
* Update `ur_data_qs.rda` to include year 2016 - 2019 only to reflect years available in JSON URLs Google SheetOscar LaneOscar Lanehttps://gitlab.erc.monash.edu.au/oscar.lane/unirank/-/merge_requests/2Qs scraper2018-11-21T14:34:45+11:00Stewart CraigQs scraperAdds QS scraper and associated data (to address Issue #1)
## Additions/use
Adds QS scraper function. This can be run with:
```
ur_scrape_qs(2015)
```
Accepts years 2013 - 2019. Note that for some years the rankings ran over two years,...Adds QS scraper and associated data (to address Issue #1)
## Additions/use
Adds QS scraper function. This can be run with:
```
ur_scrape_qs(2015)
```
Accepts years 2013 - 2019. Note that for some years the rankings ran over two years, in these cases the year refers to the latter of the two years (so would input 2014 to get rankings for 2013/2014).
Alternatively can call using JSON URL, e.g.,:
```
ur_scrape_qs(qs_json_url = ur_scrape_qs(qs_json_url = "https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051_indicators.txt")
```
Also adds ur_data_qs data set. This can be accessed with:
```
ur_data_qs
```
## Notes
* Currently ur_scrape_qs.R file includes hardcoded JSON URLs for each available year. Can potentially shift these into Google Docs file as per Times scraper approach.
* Currently scraper gets data for primary indicators only (i.e., those that appear on main rankings pages such as https://www.topuniversities.com/university-rankings/world-university-rankings/2019). Does not get Subject Rankings or Graduate Employability rankings (such as those found one specific university pages such as
https://www.topuniversities.com/universities/massachusetts-institute-technology-mit).
* Different years include different selections of primary indicators, e.g., 2013 includes 'Arts & Humanities' score. This is not included in 2019 data. In `ur_data_qs` data, indicator data that is unavailable for a given year will show as NA.
* Data scraped from JSON URL includes columns with suffix `_rank` and `_rank_d`. I'm not completely sure what these refer too. It looks like the `_rank_d` may be the rank number that is displayed on the website for the rank (i.e., in some cases there are duplicates within the same list, presumably where given same rank). And `_rank` is the order the item actual appears in the list. However, this may not be correct. Let me know if you have any ideas, otherwise I can continue to explore data further. I have included both for now but can potentially remove one of them if not required.
* Includes only very minimal cleaning of data beyond adjusting column names. This means there is some inconsistent coding of missing data etc in `ur_data_qs`.
## TODO
Currently includes skeleton documentation for ur_data_qs only. Full documentation still to be added.Oscar LaneOscar Lane