Add scraping function for ARWU Subject Rankings
Thanks Oscar for this clever (and much needed) R package.
At the Monash Business School, we've previously used the rvest package and Python (in other cases) to scrape data from the ARWU website. I'd be interested in seeing if your package could incorporate a scraper for subject field rankings. The fields we'd be interested in are as follows:
- Economics
- Business Administration
- Finance
- Management
The 2017 and 2018 results can be toggled through a dropdown box on the webpage. See Economics for illustration: http://www.shanghairanking.com/Shanghairanking-Subject-Rankings/economics.html.
This doesn't necessarily have to be only for business-related subject field rankings as such a scraper could benefit other faculties. I've provided some R code that I've used to scrape the results previously. Admittedly, it may be incomplete (it does successfully scrape) so your assistance would be appreciated!
# Load packages
library(tidyverse)
library(rvest)
library(countrycode) # to assign continents/regions
library(reshape2)
# List of ARWU URL
url <- data.frame(`URL` = c("http://www.shanghairanking.com/Shanghairanking-Subject-Rankings/economics.html",
"http://www.shanghairanking.com/Shanghairanking-Subject-Rankings/management.html",
"http://www.shanghairanking.com/Shanghairanking-Subject-Rankings/finance.html"),
`URL Name` = c("arwu_2017_eco",
"arwu_2017_mgmt",
"arwu_2017_fin"),
`Subject Name` = c("Economics",
"Management",
"Finance")) %>%
mutate(URL = as.character(URL))
# Loop through the URLs, scrape and create data frames
for(i in 1:nrow(url)){
assign(paste0(url[i,2],"_df"),
url[i,1] %>%
read_html() %>%
html_nodes(xpath = "//*[@id='UniversityRanking']") %>%
html_table(fill = TRUE) %>%
data.frame() %>%
mutate(`Ranking Name` = "ARWU Subject Rankings",
`Ranking Year` = 2017,
`Subject Ranking` = url[i,3]) %>%
cbind(., # use image file name to identify country name
url[i,1] %>%
read_html() %>%
html_nodes(xpath = "//*[@id='UniversityRanking']//td[3]//img") %>%
html_attr("src") %>%
data.frame() %>%
setNames(.,
c("Country / Region")) %>%
mutate(`Country / Region` = sub('.*\\/', '', sub(".png", "", `Country / Region`)))) %>%
setNames(., # --------------------- Rename the columns
c("World Ranking",
"Institution Name",
"Country / Region No Data",
"Total Score",
"Score on PUB",
"Score on CNCI",
"Score on IC",
"Score on TOP",
"Score on AWARD",
"Ranking Name",
"Ranking Year",
"Subject Ranking",
"Country / Region")) %>%
select(-`Country / Region No Data`) %>%
select(`Ranking Year`,
everything()) %>%
mutate(`Abbreviated Country Name` = countrycode(`Country / Region`, 'country.name', 'iso3c'),
`Region` = countrycode(`Country / Region`, 'country.name', 'continent')) %>% # ---------------- Retrieve continent
mutate(`Region` = ifelse(`Country / Region` == "Australia",
"Australia", `Region`))
)
}