Парсинг рейтингов университетов
Одна из самых популярных «пвузомерок» — международные рейтинги университетов. Есть 🇷🇺 российские, есть зарубежные. В данном материале собраны 👨🍳 «рецепты» парсинга следующих рейтингов:
- Московский рейтинг университетов — три миссии университета
- Рейтинг университетов Интерфакса
- RAEX ТОП 100
- Шанхайский рейтинг университетов, в т.ч. по предметам
- Рейтинг Best Global Universities, в т.ч. по предметам
- Рейтинг университетов от Times Higher Education
- QS-рейтинг университетов, в т.ч. по предметам
Рейтинги выходят (обновляются) в разное время года. На момент написания данного материала (июль 2023 г.), многие рейтинги включали 2023 год (внимание на переменную years
, если она используется в коде).
Московский рейтинг университетов
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
years = list(range(2020, 2022 + 1))
msk_data = []
for y in years:
# Обратите внимание на вариативность в url (последний год отличается от всех остальных)
msk_url = f"https://mosiur.org/{'ranking' if y == 2022 else 'ranking' + str(y)}/"
msk_resp = requests.get(msk_url)
if msk_resp.status_code == 200:
msk_soup = BeautifulSoup(msk_resp.text)
for tr in msk_soup.select('#top_table tbody tr'):
tds = tr.find_all('td')
'year': y,
'rank': tds[0].text,
'university': tds[1].text,
'country': tds[2].text
print(f"{msk_url} return status code {msk_resp.status_code}")
# Бережём источник данных
msk_df = pd.DataFrame(msk_data)
msk_df.to_csv(f"data/msk_{min(years)}-{max(years)}.csv", index=False)
msk_df.to_excel(f"data/msk_{min(years)}-{max(years)}.xlsx", index=False)
year | rank | university | country | |
0 | 2020 | 1 | Harvard University | США |
1 | 2020 | 2 | Massachusetts Institute of Technology | США |
2 | 2020 | 3 | University of Cambridge | Великобритания |
3 | 2020 | 4 | University of Oxford | Великобритания |
4 | 2020 | 5 | University of Pennsylvania | США |
Данные «Московского рейтинга» получены. Давайте посмотрим представительство стран по годам (ТОП 10).
msk_pivot = pd.pivot_table(
msk_pivot['sum'] = msk_pivot.sum(axis=1)
msk_pivot.sort_values(by='sum', ascending=False).head(10)
year | 2020 | 2021 | 2022 | sum |
country | ||||
США | 220.0 | 239.0 | 253.0 | 712.0 |
Китай | 122.0 | 144.0 | 173.0 | 439.0 |
Россия | 101.0 | 112.0 | 146.0 | 359.0 |
Великобритания | 98.0 | 106.0 | 108.0 | 312.0 |
Япония | 93.0 | 102.0 | 101.0 | 296.0 |
Германия | 69.0 | 73.0 | 74.0 | 216.0 |
Италия | 50.0 | 54.0 | 54.0 | 158.0 |
Индия | 45.0 | 51.0 | 58.0 | 154.0 |
Испания | 42.0 | 55.0 | 56.0 | 153.0 |
Франция | 40.0 | 45.0 | 54.0 | 139.0 |
Рейтинг Интерфакса
import time
import requests
import pandas as pd
def get_interfax_list_by_year(year, page=1):
url = f"https://academia.interfax.ru/data/rating/?rating=1&year={year}&page={page}"
result = []
resp = requests.get(url)
# Бережём источник данных
if resp.status_code == 200:
json_data = resp.json()
result += json_data['universities']
page_count = json_data['page_count']
if page != page_count:
result += get_interfax_list_by_year(year, page + 1)
print(f"{url} return status code {resp.status_code}")
return result
years = list(range(2020, 2023 + 1))
interfax_df = pd.DataFrame()
for y in years:
interfax_df_by_year = pd.DataFrame(get_interfax_list_by_year(y))
interfax_df_by_year['year'] = y
interfax_df = pd.concat([interfax_df, interfax_df_by_year], ignore_index=True)
interfax_df.to_csv(f"data/interfax_{min(years)}-{max(years)}.csv", index=False)
interfax_df.to_excel(f"data/interfax_{min(years)}-{max(years)}.xlsx", index=False)
change | description | is_close | point | rank | name | id | url | year | |
0 | 0 | None | False | 1000 | 1 | Московский государственный университет имени М... | 1 | https://www.msu.ru | 2020 |
1 | 0 | None | False | 963 | 2 | Национальный исследовательский ядерный универс... | 2 | https://mephi.ru | 2020 |
2 | 0 | None | False | 961 | 3 | Московский физико-технический институт (национ... | 4 | https://mipt.ru | 2020 |
3 | 0 | None | False | 857 | 4 | Национальный исследовательский университет «Вы... | 6 | https://www.hse.ru | 2020 |
4 | 1 | None | False | 848 | 5 | Новосибирский национальный исследовательский г... | 3 | http://www.nsu.ru/?lang=ru | 2020 |
В данных рейтинга Интерфакса уже есть значения про движение университета в рамках рейтинга. Давайте посмотрим на ТОП 10 университетов, которые поднялись в рейтинге за последние года больше всех.
interfax_pivot = pd.pivot_table(
interfax_pivot['sum'] = interfax_pivot.sum(axis=1)
interfax_pivot.sort_values(by='sum', ascending=False).head(10)
year | 2020 | 2021 | 2022 | 2023 | sum |
name | |||||
Университет Синергия | -1.0 | 78.0 | 43.0 | 44.0 | 164.0 |
Московский институт психоанализа | -8.0 | 3.0 | 128.0 | 25.0 | 148.0 |
Чеченский государственный университет имени А.А.Кадырова | -22.0 | 83.0 | 58.0 | 14.0 | 133.0 |
Московский государственный психолого-педагогический университет | 15.0 | 78.0 | 18.0 | 12.0 | 123.0 |
Уральский государственный аграрный университет | -21.0 | 40.0 | 104.0 | -4.0 | 119.0 |
Красноярский государственный аграрный университет | -1.0 | -12.0 | 122.0 | 3.0 | 112.0 |
Уральский государственный горный университет | 25.0 | 17.0 | 21.0 | 48.0 | 111.0 |
Кировский государственный медицинский университет | 25.0 | 12.0 | 31.0 | 40.0 | 108.0 |
Кубанский государственный технологический университет | 24.0 | 18.0 | 61.0 | -9.0 | 94.0 |
Государственный университет просвещения (МГОУ) | 5.0 | 28.0 | 51.0 | 1.0 | 85.0 |
import re
import time
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
years = list(range(2020, 2023 + 1))
raex_list = []
for y in years:
raex_url = f"https://raex-rr.com/pro/education/russian_universities/top-100_universities/{y}/"
raex_resp = requests.get(raex_url)
if raex_resp.status_code == 200:
raex_soup = BeautifulSoup(raex_resp.text)
raex_trs = raex_soup.select('#rrp_table_wrapper > table > tbody.list > tr')
for i, tr in enumerate(raex_trs):
th = tr.findChildren('th', recursive=False)
td = tr.findChildren('td', recursive=False)
'year': y,
'rank': int(th[0].text),
'title': re.sub(r'[\s\n]+', ' ', th[1].text).strip(),
'previous': float(td[0].text) if td[0].text != '-' else np.NaN,
'points': float(td[1].text),
'quality': int(td[2].text),
'graduates': int(td[3].text),
'science': int(td[4].text)
print(f"{raex_url} return status code {raex_resp.status_code}")
# Бережём источник данных
raex_df = pd.DataFrame(raex_list)
raex_df.to_csv(f"data/raex_{min(years)}-{max(years)}.csv", index=False)
raex_df.to_excel(f"data/raex_{min(years)}-{max(years)}.xlsx", index=False)
year | rank | title | previous | points | quality | graduates | science | |
0 | 2020 | 1 | Московский государственный университет имени М... | 1.0 | 4.8419 | 1 | 1 | 1 |
1 | 2020 | 2 | Московский физико-технический институт (национ... | 2.0 | 4.7734 | 2 | 7 | 2 |
2 | 2020 | 3 | Национальный исследовательский ядерный универс... | 3.0 | 4.5535 | 5 | 5 | 4 |
3 | 2020 | 4 | Санкт-Петербургский государственный университет | 4.0 | 4.5394 | 3 | 12 | 7 |
4 | 2020 | 5 | Национальный исследовательский университет "Вы... | 5.0 | 4.4933 | 6 | 2 | 11 |
В отличие от предыдущих рейтингов, в данном рейтинге каждому университету присваивается конкретная позиция в рейтинге. Давайте посмотрим ТОП 10 лидеров по средней позиции в рейтинге за скачанные года.
raex_df[['title', 'rank']].groupby(by='title').mean().sort_values(by='rank').head(10)
rank | |
title | |
Московский государственный университет имени М.В. Ломоносова | 1.000000 |
Московский физико-технический институт (национальный исследовательский университет) | 2.000000 |
Национальный исследовательский ядерный университет «МИФИ» | 3.500000 |
Санкт-Петербургский государственный университет | 3.500000 |
Национальный исследовательский университет "Высшая школа экономики" | 5.250000 |
Московский государственный технический университет имени Н.Э. Баумана (национальный исследовательский университет) | 6.000000 |
МГИМО МИД России | 6.750000 |
Санкт-Петербургский политехнический университет Петра Великого | 8.250000 |
Национальный исследовательский Томский политехнический университет | 8.666667 |
Томский политехнический университет | 9.000000 |
Обратите внимание на то, что рейтинговое агентство RAEX «богато» на рейтинги (полный список тут). Для парсинга других рейтингов, механики будут +/- аналогичные.
Шанхайский рейтинг университетов
import os
import re
import json
import time
import js2py
import requests
import pandas as pd
Общий рейтинг
url_ranking = 'https://www.shanghairanking.com/api/pub/v1/inst'
resp_ranking = requests.get(url_ranking)
ranking = pd.DataFrame()
if resp_ranking.status_code == 200:
json_data = resp_ranking.json()
ranking = pd.json_normalize(json_data['data'])
print(f"{url_ranking} return status code {resp_ranking.status_code}")
ranking.to_csv('data/shanghai.csv', index=False)
ranking.to_excel('data/shanghai.xlsx', index=False)
nameEn | univLogo | univUp | region | rankingInfo | ranking | |
0 | Harvard University | logo/032bd1b77.png | harvard-university | United States | ARWU 2022 | 1 |
1 | Stanford University | logo/13de8913b.png | stanford-university | United States | ARWU 2022 | 2 |
2 | Massachusetts Institute of Technology (MIT) | logo/79165fd8b.png | massachusetts-institute-of-technology-mit | United States | ARWU 2022 | 3 |
3 | University of Cambridge | logo/8d9861b69.png | university-of-cambridge | United Kingdom | ARWU 2022 | 4 |
4 | University of California, Berkeley | logo/0ff179fb8.png | university-of-california-berkeley | United States | ARWU 2022 | 5 |
Пока данных не много и можно, для примера, посмотреть ТОП 10 🇷🇺 российских университетов в рейтинге.
ranking[ranking['region'] == 'Russia'][['nameEn', 'ranking']].head(10)
nameEn | ranking | |
114 | Moscow State University | 101-150 |
347 | Saint Petersburg State University | 301-400 |
529 | Moscow Institute of Physics and Technology | 501-600 |
626 | HSE University | 601-700 |
654 | Sechenov University | 601-700 |
742 | Novosibirsk State University | 701-800 |
753 | Skolkovo Institute of Science and Technology | 701-800 |
761 | Tomsk State University | 701-800 |
797 | Ural Federal University | 701-800 |
847 | National Research Nuclear University MEPhI (Mo... | 801-900 |
Предметные и ретроспектива
Для того чтобы получить данные по предметным рейтингам, а также получить ретроспективные данные, потребуется обойти все страницы университетов на сайте. Страниц много (более 3 тыс.), поэтому запасаемся терпением и ждём, переодически проверяем содержимое папки data/downloads/shanghai
data_folder = 'data/downloads/shanghai'
for i, r in ranking.iterrows():
file_path = f"{data_folder}/{i+1}.json"
if not os.path.isfile(file_path):
university_path = r['univUp']
university_url = f'https://www.shanghairanking.com/institution/{university_path}'
html_resp = requests.get(university_url)
if html_resp.status_code == 200:
html = html_resp.text
reg = f"\/_nuxt\/static\/\d+\/institution\/{university_path}\/payload\.js"
payload = re.search(f"/_nuxt/static/\d+/institution/{university_path}/payload\.js", html)[0]
university_id = payload.replace('/_nuxt/static/', '').replace(f"/{university_path}/institution/payload.js", '')
js_url = f"https://www.shanghairanking.com/{payload}"
js_resp = requests.get(js_url)
if js_resp.status_code == 200:
js_data = js_resp.text
js_data = js2py.eval_js(js_data.replace(f'__NUXT_JSONP__("/institution/{university_path}", ', '')[:-2])
university_data = js_data.data[0].univData.to_dict()
with open(file_path, 'w') as f:
f.write(json.dumps(university_data, ensure_ascii=False, indent=4))
print(f"{university_url} return status code {html_resp.status_code}")
print(f"{university_url} return status code {html_resp.status_code}")
# Бережём источник данных
Каждый скачанный файл университета содержит намного больше данных, чем просто позиция в рейтинге. Для примера, посмотрим на данные одного университета из списка.
example = json.load(open(f"{data_folder}/493.json", 'r'))
"address": "Moskovskij Gosudarstvennyj Universitet im. M.V. Lomonosova, Leninskie Gory, Moscow 119992, Russian Federation",
"detail": {
"arwu": {
"datasetId": 1,
"intro": "The Academic Ranking of World Universities (ARWU) was first published in June 2003 by the Center for World-Class Universities (CWCU), Graduate School of Education (formerly the Institute of Higher Education) of Shanghai Jiao Tong University, China, and updated on an annual basis. Since 2009 the Academic Ranking of World Universities (ARWU) has been published and copyrighted by ShanghaiRanking Consultancy. ShanghaiRanking Consultancy is a fully independent organization on higher education intelligence and not legally subordinated to any universities or government agencies. ARWU uses six objective indicators to rank world universities, including the number of alumni and staff winning Nobel Prizes and Fields Medals, number of highly cited researchers selected by Clarivate, number of articles published in journals of Nature and Science, number of articles indexed in Science Citation Index Expanded™ and Social Sciences Citation Index™ in the Web of Science™, and per capita performance of a university. More than 2500 universities are actually ranked by ARWU every year and the best 1000 are published.",
"latestVerNo": 2022,
"nameId": "ARWU",
"rkHistory": [
{"ranking": "93", "yr": 2020},
{"ranking": "97", "yr": 2021},
{"ranking": "101-150", "yr": 2022}
"rkLatest": {"name": "Academic Ranking of World Universities", "ranking": "101-150"}
"bcur": null,
"gras": {
"latestVerNo": 2022,
"nameId": "GRAS",
"subjAdva": [
{"categoryCode": "RS01", "categoryNameShort": "SCI", "code": "RS0102", "name": "Physics", "ranking": "76-100"},
{"categoryCode": "RS02", "categoryNameShort": "ENG", "code": "RS0207", "name": "Instruments Science & Technology", "ranking": "201-300"},
{"categoryCode": "RS01", "categoryNameShort": "SCI", "code": "RS0101", "name": "Mathematics", "ranking": "301-400"},
{"categoryCode": "RS01", "categoryNameShort": "SCI", "code": "RS0103", "name": "Chemistry", "ranking": "301-400"},
{"categoryCode": "RS01", "categoryNameShort": "SCI", "code": "RS0106", "name": "Ecology", "ranking": "301-400"},
{"categoryCode": "RS02", "categoryNameShort": "ENG", "code": "RS0214", "name": "Nanoscience & Nanotechnology", "ranking": "301-400"},
{"categoryCode": "RS02", "categoryNameShort": "ENG", "code": "RS0220", "name": "Biotechnology", "ranking": "401-500"},
{"categoryCode": "RS04", "categoryNameShort": "MED", "code": "RS0401", "name": "Clinical Medicine", "ranking": "401-500"}
"subjCategory": [
"code": "RS01",
"name": "Natural Sciences",
"nameShort": "SCI",
"subj": [
{"code": "RS0101", "name": "Mathematics", "ranking": "301-400"},
{"code": "RS0102", "name": "Physics", "ranking": "76-100"},
{"code": "RS0103", "name": "Chemistry", "ranking": "301-400"},
{"code": "RS0106", "name": "Ecology", "ranking": "301-400"}
"code": "RS02",
"name": "Engineering\n",
"nameShort": "ENG",
"subj": [
{"code": "RS0207", "name": "Instruments Science & Technology", "ranking": "201-300"},
{"code": "RS0214", "name": "Nanoscience & Nanotechnology", "ranking": "301-400"},
{"code": "RS0220", "name": "Biotechnology", "ranking": "401-500"}
"code": "RS04",
"name": "Medical Sciences",
"nameShort": "MED",
"subj": [
{"code": "RS0401", "name": "Clinical Medicine", "ranking": "401-500"
"foundYear": 1755,
"introEn": "",
"nameEn": "Moscow State University",
"programs": [],
"ranking": "101-150",
"rankingInfo": "ARWU 2022",
"region": "Russia",
"regionDetail": "Eastern Europe",
"studentsStatis": [
{"nameEn": "Total Enrollment", "ratio": "", "value": "29235"},
{"nameEn": "International Students", "ratio": "29.0%","value": ""}
{"nameEn": "Undergraduate Enrollment", "ratio": "", "value": ""},
{"nameEn": "International Students", "ratio": "", "value": ""}
{"nameEn": "Graduate Enrollment", "ratio": "", "value": ""},
{"nameEn": "International Students", "ratio": "", "value": ""}
"univLogo": "logo/98123a275.png",
"univUp": "moscow-state-university",
"website": "http://www.msu.ru"
Для упрощения возьмём только данные связанные с ретроспективой по основному рейтингу и предметные рейтинги.
file_numbers = sorted(list([int(x.replace('.json', '')) for x in os.listdir(data_folder)]))
rating_retro = pd.DataFrame()
rating_subject = pd.DataFrame()
for fn in file_numbers:
file_path = f"{data_folder}/{fn}.json"
json_data = json.load(open(file_path, 'r'))
if json_data['detail']['arwu'] and 'rkHistory' in json_data['detail']['arwu'].keys():
chunk_retro = pd.json_normalize(json_data['detail']['arwu']['rkHistory'])
chunk_retro['nameEn'], chunk_retro['region'] = json_data['nameEn'], json_data['region']
rating_retro = pd.concat([rating_retro, chunk_retro], ignore_index=True)
if json_data['detail']['gras']['subjAdva']:
chunk_subject = pd.json_normalize(json_data['detail']['gras']['subjAdva'])
chunk_subject['nameEn'], chunk_subject['region'] = json_data['nameEn'], json_data['region']
rating_subject = pd.concat([rating_subject, chunk_subject], ignore_index=True)
retro_years = list(rating_retro['yr'].unique())
rating_retro.to_csv(f"data/shanghai_retro_{min(retro_years)}-{max(retro_years)}.csv", index=False)
rating_retro.to_excel(f"data/shanghai_retro_{min(retro_years)}-{max(retro_years)}.xlsx", index=False)
rating_subject.to_csv('data/shanghai_subject.csv', index=False)
rating_retro.to_excel('data/shanghai_subject.xlsx', index=False)
В качестве результатов по ретроспективе, давайте посмотрим количество российских университетов по годам.
rating_retro[rating_retro['region'] == 'Russia'][['yr', 'nameEn']].\
groupby(by='yr').count().rename(columns={'nameEn': 'count'})
count | |
yr | |
2020 | 13 |
2021 | 13 |
2022 | 13 |
В предметных рейтингах выделим самые представленные области университетами 🇷🇺 России.
rating_subject[rating_subject['region'] == 'Russia'][['name', 'nameEn']].\
groupby(by='name').count().rename(columns={'nameEn': 'count'}).\
sort_values(by='count', ascending=False).head(10)
count | |
name | |
Physics | 6 |
Metallurgical Engineering | 6 |
Biological Sciences | 4 |
Nanoscience & Nanotechnology | 4 |
Mathematics | 4 |
Materials Science & Engineering | 3 |
Agricultural Sciences | 2 |
Economics | 2 |
Pharmacy & Pharmaceutical Sciences | 2 |
Mechanical Engineering | 2 |
Best Global Universities
import os
import json
import time
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
Особенности данного рейтинга, что на сайте не сохраняется ретроспектива. Парсить будем что есть в сейчас на сайте рейтинга (на момент написания рейтинг 2022-2023).
Для начала станем немного больше похожими на браузер.
session = requests.session()
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'ru',
'cache-control': 'no-cache',
'pragma': 'no-cache',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36'
Общий рейтинг
data_list = []
url_tpl = 'https://www.usnews.com/education/best-global-universities/search?format=json&page='
url_first = url_tpl + '1'
resp_first = session.get(url_first)
if resp_first.status_code == 200:
data_first = resp_first.json()
total_pages = data_first['total_pages']
data_list += data_first['items']
for page_number in range(2, total_pages + 1):
url = url_tpl + str(page_number)
resp = session.get(url)
if resp.status_code == 200:
json_data = resp.json()
data_list += json_data['items']
print(f"{url} return status code {resp.status_code}")
# Бережём источник данных
print(f"{url_first} return status code {resp_first.status_code}")
ranking = pd.DataFrame(data_list)
url | id | name | city | country_name | three_digit_country_code | ranks | stats | image_url | blurb | |
0 | https://www.usnews.com/education/best-global-u... | 166027 | Harvard University | Cambridge (U.S.) | United States | USA | [{'value': '1', 'is_tied': False, 'is_ranked':... | [{'value': '100.0', 'label': 'Global Score'}, ... | {'src': 'https://www.usnews.com/object/image/0... | Founded in 1636, Harvard University is the old... |
1 | https://www.usnews.com/education/best-global-u... | 166683 | Massachusetts Institute of Technology (MIT) | Cambridge (U.S.) | United States | USA | [{'value': '2', 'is_tied': False, 'is_ranked':... | [{'value': '97.7', 'label': 'Global Score'}, {... | {'src': 'https://www.usnews.com/object/image/0... | Massachusetts Institute of Technology, founded... |
2 | https://www.usnews.com/education/best-global-u... | 243744 | Stanford University | Stanford | United States | USA | [{'value': '3', 'is_tied': False, 'is_ranked':... | [{'value': '95.2', 'label': 'Global Score'}, {... | {'src': 'https://www.usnews.com/object/image/0... | Stanford University was founded in 1885 and is... |
3 | https://www.usnews.com/education/best-global-u... | 110635 | University of California Berkeley | Berkeley | United States | USA | [{'value': '4', 'is_tied': False, 'is_ranked':... | [{'value': '88.7', 'label': 'Global Score'}, {... | {'src': 'https://www.usnews.com/object/image/0... | The University of California—Berkeley is situa... |
4 | https://www.usnews.com/education/best-global-u... | 503637 | University of Oxford | Oxford | United Kingdom | GBR | [{'value': '5', 'is_tied': False, 'is_ranked':... | [{'value': '86.8', 'label': 'Global Score'}, {... | {'src': 'https://www.usnews.com/object/image/0... | The exact date of the University of Oxford’s f... |
Вытащим данные про рейтинг из столбца ranks
и сохраним результаты, а далее выстроим, для примера, производный рейтинг стран по медианной позиции университетов в рейтинге (посмотрим ТОП 10).
for i, r in ranking.iterrows():
rank_value = r['ranks'][0]['value']
ranking.loc[i, 'ranks.value'] = np.NaN if rank_value == 'Unranked' else int(rank_value.replace(',', ''))
ranking.loc[i, 'ranks.is_tied'] = r['ranks'][0]['is_tied']
ranking.loc[i, 'ranks.is_ranked'] = r['ranks'][0]['is_ranked']
ranking.loc[i, 'ranks.label'] = r['ranks'][0]['label']
# Сохраняем результаты
ranking.to_csv('data/best_global_universities.csv', index=False)
ranking.to_excel('data/best_global_universities.xlsx', index=False)
# ТОП 10 стран по медиане
ranking[['country_name', 'id', 'ranks.value']].groupby(by='country_name').agg({
'id': 'count',
'ranks.value': 'median'
'id': 'count',
'ranks.value': 'rank_median'
count | rank_median | |
country_name | ||
Netherlands | 15 | 97.5 |
Hong Kong | 7 | 100.0 |
Switzerland | 12 | 150.0 |
Singapore | 4 | 231.0 |
Denmark | 7 | 261.0 |
Australia | 39 | 273.0 |
Belgium | 11 | 292.0 |
Sweden | 19 | 346.5 |
Finland | 11 | 447.0 |
Iceland | 1 | 452.0 |
По предметам данных получается намного больше, поэтому разобьём весь процесс на этапы. Начнём со списка предметов.
subjects = []
subject_first_url = 'https://www.usnews.com/education/best-global-universities/rankings'
subject_first_resp = session.get(subject_first_url)
if subject_first_resp.status_code == 200:
subject_html = subject_first_resp.text
subject_soup = BeautifulSoup(subject_html)
subject_select = subject_soup.find('select', {'name': 'subject'})
for o in subject_select.find_all('option'):
if o['value']:
'title': o.text,
'value': o['value']
print(f"{subject_first_url} return status code {subject_first_resp.status_code}")
title | value | |
0 | Agricultural Sciences | agricultural-sciences |
1 | Artificial Intelligence | artificial-intelligence |
2 | Arts and Humanities | arts-and-humanities |
3 | Biology and Biochemistry | biology-biochemistry |
4 | Biotechnology and Applied Microbiology | biotechnology-applied-microbiology |
Теперь скачаем данные по тематикам и сохраним их на диске. Это займёт какое-то время, наберитесь терпения. Процесс можно отслеживать в папке, куда сохраняются данные: data/downloads/bgu_by_subjects
Если что-то «залипнет», останавливайте и перезапускайте процесс, в том числе, на всякий случай, блок выше, где создаётся объект session
. Код каждый запуск будет скачивать только то, что не успел скачать в прошлый раз.
subject_data_folder = 'data/downloads/bgu_by_subjects'
subject_url_prefix = 'https://www.usnews.com/education/best-global-universities'
for i, s in enumerate(subjects):
subject_value = s['value']
subject_dir = f"{subject_data_folder}/{subject_value.replace('-', '_')}"
# Создаём папки для данных, если их не существует
if not os.path.exists(subject_dir):
url_subject_tpl = f"{subject_url_prefix}/{subject_value}?format=json&page="
first_url_subject = url_subject_tpl + '1'
first_resp_subject = session.get(first_url_subject)
if first_resp_subject.status_code == 200:
first_json_subject = first_resp_subject.json()
total_pages_subject = first_json_subject['total_pages']
with open(f"{subject_dir}/1.json", 'w') as f:
f.write(json.dumps(first_resp_subject.json()['items'], ensure_ascii=False, indent=4))
for page_number in range(2, total_pages_subject + 1):
subject_page_file_data = f"{subject_dir}/{page_number}.json"
if os.path.isfile(subject_page_file_data):
url_subject = url_subject_tpl + str(page_number)
resp_subject = session.get(url_subject)
if resp_subject.status_code == 200:
with open(subject_page_file_data, 'w') as f:
f.write(json.dumps(resp_subject.json()['items'], ensure_ascii=False, indent=4))
print(f"{url_subject} return status code {resp_subject.status_code}")
# Бережём источник данных
print(f"{first_url_subject} return status code {first_resp_subject.status_code}")
Соберём всё в кучку из скачанных файлов.
df_by_subject = pd.DataFrame()
for s in subjects:
subject_value = s['value']
subject_dir = f"{subject_data_folder}/{subject_value.replace('-', '_')}"
bgu_files_numbers_in_subject = list(sorted([int(x.replace('.json', '')) for x in os.listdir(subject_dir)]))
for n in bgu_files_numbers_in_subject:
file_path = f"{subject_dir}/{n}.json"
json_data = json.load(open(file_path, 'r'))
chunk = pd.json_normalize(json_data)
for i, r in chunk.iterrows():
for rank_idx, rank in enumerate(r['ranks']):
rank_value = rank['value']
chunk.loc[i, f'rank-{rank_idx + 1}.value'] = np.NaN if rank_value == 'Unranked' else int(rank_value.replace(',', ''))
chunk.loc[i, f'rank-{rank_idx + 1}.is_tied'] = rank['is_tied']
chunk.loc[i, f'rank-{rank_idx + 1}.is_ranked'] = rank['is_ranked']
chunk.loc[i, f'rank-{rank_idx + 1}.label'] = rank['label']
df_by_subject = pd.concat([df_by_subject, chunk], ignore_index=True)
df_by_subject.to_csv('data/best_global_universities_subject.csv', index=False)
df_by_subject.to_excel('data/best_global_universities_subject.xlsx', index=False)
url | id | name | city | country_name | three_digit_country_code | ranks | stats | blurb | image_url.src | image_url.medium | image_url.large | rank-1.value | rank-1.is_tied | rank-1.is_ranked | rank-1.label | rank-2.value | rank-2.is_tied | rank-2.is_ranked | rank-2.label | |
0 | https://www.usnews.com/education/best-global-u... | 501112 | Wageningen University & Research | Wageningen | Netherlands | NLD | [{'value': '1', 'is_tied': False, 'is_ranked':... | [{'value': '100.0', 'label': 'Subject Score'},... | NaN | NaN | NaN | 1.0 | False | True | Best Universities for Agricultural Sciences | 89.0 | True | True | Best Global Universities | |
1 | https://www.usnews.com/education/best-global-u... | 500688 | China Agricultural University | Beijing | China | CHN | [{'value': '2', 'is_tied': False, 'is_ranked':... | [{'value': '96.3', 'label': 'Subject Score'}, ... | NaN | NaN | NaN | 2.0 | False | True | Best Universities for Agricultural Sciences | 332.0 | False | True | Best Global Universities | |
2 | https://www.usnews.com/education/best-global-u... | 504800 | Jiangnan University | Wuxi | China | CHN | [{'value': '3', 'is_tied': False, 'is_ranked':... | [{'value': '93.9', 'label': 'Subject Score'}, ... | NaN | NaN | NaN | 3.0 | False | True | Best Universities for Agricultural Sciences | 598.0 | True | True | Best Global Universities | |
3 | https://www.usnews.com/education/best-global-u... | 505115 | South China University of Technology | Guangzhou | China | CHN | [{'value': '4', 'is_tied': False, 'is_ranked':... | [{'value': '91.7', 'label': 'Subject Score'}, ... | NaN | NaN | NaN | 4.0 | False | True | Best Universities for Agricultural Sciences | 219.0 | True | True | Best Global Universities | |
4 | https://www.usnews.com/education/best-global-u... | 166629 | University of Massachusetts Amherst | Amherst | United States | USA | [{'value': '5', 'is_tied': False, 'is_ranked':... | [{'value': '88.3', 'label': 'Subject Score'}, ... | NaN | NaN | NaN | 5.0 | False | True | Best Universities for Agricultural Sciences | 160.0 | True | True | Best Global Universities |
Давайте посмотрим ТОП 10 предметных рейтингов по количеству университетов, в которых представлена 🇷🇺 Россия.
df_by_subject[df_by_subject['three_digit_country_code'] == 'RUS'][['rank-1.label', 'id']].\
groupby(by='rank-1.label').count().rename(columns={'id': 'count'}).\
sort_values(by='count', ascending=False).head(10)
count | |
rank-1.label | |
Best Universities for Physics | 20 |
Best Universities for Chemistry | 17 |
Best Universities for Materials Science | 14 |
Best Universities for Engineering | 9 |
Best Universities for Physical Chemistry | 9 |
Best Universities for Optics | 8 |
Best Universities for Mathematics | 7 |
Best Universities for Condensed Matter Physics | 6 |
Best Universities for Geosciences | 5 |
Best Universities for Computer Science | 4 |
Times Higher Education
Общий рейтинг
import re
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
Для парсинга этого рейтинга тоже создаём объект session
session = requests.session()
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'ru',
'cache-control': 'no-cache',
'pragma': 'no-cache',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Linux"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36'
years = range(2020, 2024 + 1)
Дальше собираем данные по годам.
url_tpl = 'https://www.timeshighereducation.com/world-university-rankings/{year}/world-ranking'
ranking = pd.DataFrame(columns=['year'])
for year in years:
html_url = url_tpl.format(year=year)
html_resp = session.get(html_url)
if html_resp.status_code == 200:
html = html_resp.text
soup = BeautifulSoup(html)
scripts = soup.find_all('script')
search_words = 'init_drupal_core_settings'
drupal_settings_str = None
for script in scripts:
script_text = script.text
if script_text.find(search_words) > -1:
drupal_settings_str = script_text
if drupal_settings_str:
found = re.search(f"world_university_rankings_{year}_0__[\w\d]+\.json", drupal_settings_str)
if found:
data_url = f'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/{found[0]}'
data_resp = session.get(data_url)
if data_resp.status_code == 200:
data = data_resp.json()
chunk = pd.json_normalize(data['data'])
chunk['year'] = year
ranking = pd.concat([ranking, chunk], ignore_index=True)
print(f"{data_url} return status code {data_resp.status_code}")
print(f"{html_url} return status code {html_resp.status_code}")
# Бережём источник данных
ranking.to_csv(f"data/the_{min(years)}-{max(years)}.csv", index=False)
ranking.to_excel(f"data/the_{min(years)}-{max(years)}.xlsx", index=False)
year | rank_order | rank | name | scores_overall | scores_overall_rank | scores_teaching | scores_teaching_rank | scores_research | scores_research_rank | ... | stats_pc_intl_students | stats_female_male_ratio | aliases | subjects_offered | closed | unaccredited | disabled | apply_link | cta_button.link | cta_button.text | |
0 | 2020 | 10 | 1 | University of Oxford | 95.4 | 10 | 90.5 | 6 | 99.6 | 1 | ... | 41% | 46 : 54 | University of Oxford | Accounting & Finance,General Engineering,Commu... | False | False | False | https://www.timeshighereducation.com/student/r... | https://www.timeshighereducation.com/student/r... | Admissions Support |
1 | 2020 | 20 | 2 | California Institute of Technology | 94.5 | 20 | 92.1 | 2 | 97.2 | 4 | ... | 30% | 34 : 66 | California Institute of Technology caltech | Languages, Literature & Linguistics,Economics ... | False | False | False | https://www.timeshighereducation.com/student/r... | https://www.timeshighereducation.com/student/r... | Admissions Support |
2 | 2020 | 30 | 3 | University of Cambridge | 94.4 | 30 | 91.4 | 4 | 98.7 | 2 | ... | 37% | 47 : 53 | University of Cambridge | Business & Management,General Engineering,Art,... | False | False | False | https://www.timeshighereducation.com/student/r... | https://www.timeshighereducation.com/student/r... | Admissions Support |
3 | 2020 | 40 | 4 | Stanford University | 94.3 | 40 | 92.8 | 1 | 96.4 | 5 | ... | 23% | 43 : 57 | Stanford University | Physics & Astronomy,Computer Science,Politics ... | False | False | False | NaN | https://www.timeshighereducation.com/student/r... | Admissions Support |
4 | 2020 | 50 | 5 | Massachusetts Institute of Technology | 93.6 | 50 | 90.5 | 5 | 92.4 | 10 | ... | 34% | 39 : 61 | Massachusetts Institute of Technology | Mathematics & Statistics,Languages, Literature... | False | False | False | https://www.timeshighereducation.com/student/r... | https://www.timeshighereducation.com/student/r... | Admissions Support |
5 rows × 33 columns
Из полученных данных соберём количество 🇷🇺 российских университетов в рейтинге по годам.
ranking[ranking['location'] == 'Russian Federation'][['year', 'location']].\
groupby(by='year').count().rename(columns={'location': 'count'})
count | |
year | |
2020 | 39 |
2021 | 48 |
2022 | 100 |
2023 | 103 |
2024 | 108 |
Код практически идентичен предыдущему, только появляется обход по каждому предмету рейтингу.
url_subject_tpl = 'https://www.timeshighereducation.com/world-university-rankings/{year}/subject-ranking/{subject}'
ranking_subject = pd.DataFrame(columns=['year', 'subject'])
subject_map = [
{'title': 'Arts & humanities', 'url_suffix': 'arts-and-humanities', 'file_prefix': 'arts_humanities_rankings'},
{'title': 'Business & economics', 'url_suffix': 'business-and-economics', 'file_prefix': 'business_economics_rankings'},
{'title': 'Clinical & health', 'url_suffix': 'clinical-pre-clinical-health', 'file_prefix': 'clinical_pre_clinical_health_ran'},
{'title': 'Computer science', 'url_suffix': 'computer-science', 'file_prefix': 'computer_science_rankings'},
{'title': 'Education', 'url_suffix': 'education', 'file_prefix': 'education_rankings'},
{'title': 'Engineering', 'url_suffix': 'engineering-and-it', 'file_prefix': 'engineering_technology_rankings'},
{'title': 'Law', 'url_suffix': 'law', 'file_prefix': 'law_rankings'},
{'title': 'Life sciences', 'url_suffix': 'life-sciences', 'file_prefix': 'life_sciences_rankings'},
{'title': 'Physical sciences', 'url_suffix': 'physical-sciences', 'file_prefix': 'physical_sciences_rankings'},
{'title': 'Psychology', 'url_suffix': 'psychology', 'file_prefix': 'psychology_rankings'},
{'title': 'Social sciences', 'url_suffix': 'social-sciences', 'file_prefix': 'social_sciences_rankings'}
for year in years:
for subject in subject_map:
html_url = url_subject_tpl.format(year=year, subject=subject['url_suffix'])
html_resp = session.get(html_url)
if html_resp.status_code == 200:
html = html_resp.text
soup = BeautifulSoup(html)
scripts = soup.find_all('script')
search_words = 'init_drupal_core_settings'
drupal_settings_str = None
for script in scripts:
script_text = script.text
if script_text.find(search_words) > -1:
drupal_settings_str = script_text
if drupal_settings_str:
found = re.search(f"{subject['file_prefix']}_{year}_0__[\w\d]+\.json", drupal_settings_str)
if found:
data_url = f'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/{found[0]}'
data_resp = session.get(data_url)
if data_resp.status_code == 200:
data = data_resp.json()
chunk = pd.json_normalize(data['data'])
chunk[['year', 'subject']] = year, subject['title']
ranking_subject = pd.concat([ranking_subject, chunk], ignore_index=True)
print(f"{data_url} return status code {data_resp.status_code}")
print(f"{html_url} return status code {html_resp.status_code}")
# Бережём источник данных
ranking_subject.to_csv(f"data/the_{min(years)}-{max(years)}_subject.csv", index=False)
ranking_subject.to_excel(f"data/the_{min(years)}-{max(years)}_subject.xlsx", index=False)
year | subject | rank_order | rank | name | scores_overall | scores_overall_rank | scores_citations | scores_citations_rank | scores_industry_income | ... | stats_pc_intl_students | stats_female_male_ratio | aliases | subjects_offered | closed | unaccredited | disabled | cta_button.link | cta_button.text | apply_link | |
0 | 2020 | Arts & humanities | 10 | 1 | Stanford University | 89.9 | 10 | 80.9 | 6 | 72.4 | ... | 23% | 43 : 57 | Stanford University | History, Philosophy & Theology,Art, Performing... | False | False | False | https://www.timeshighereducation.com/student/r... | Admissions Support | NaN |
1 | 2020 | Arts & humanities | 20 | 2 | University of Cambridge | 86.7 | 20 | 62.8 | 144 | 58.8 | ... | 37% | 47 : 53 | University of Cambridge | History, Philosophy & Theology,Architecture,Ar... | False | False | False | https://www.timeshighereducation.com/student/r... | Admissions Support | https://www.timeshighereducation.com/student/r... |
2 | 2020 | Arts & humanities | 30 | 3 | University of Oxford | 86.4 | 30 | 68.1 | 85 | 35.2 | ... | 41% | 46 : 54 | University of Oxford | Art, Performing Arts & Design,History, Philoso... | False | False | False | https://www.timeshighereducation.com/student/r... | Admissions Support | https://www.timeshighereducation.com/student/r... |
3 | 2020 | Arts & humanities | 40 | 4 | Massachusetts Institute of Technology | 85.8 | 40 | 76.3 | 18 | 56.5 | ... | 34% | 39 : 61 | Massachusetts Institute of Technology | Architecture,Archaeology,Languages, Literature... | False | False | False | https://www.timeshighereducation.com/student/r... | Admissions Support | https://www.timeshighereducation.com/student/r... |
4 | 2020 | Arts & humanities | 50 | 5 | Harvard University | 84.3 | 50 | 68.3 | 84 | 36.0 | ... | 24% | 49 : 51 | Harvard University | Architecture,History, Philosophy & Theology,La... | False | False | False | https://www.timeshighereducation.com/student/r... | Admissions Support | https://www.timeshighereducation.com/student/r... |
Для пример давайте посмотрим динамику университетов некоторых стран в предметных рейтингах.
'Russian Federation', 'China',
'India', 'Germany', 'Turkey',
'United Kingdom', 'United States',
'France', 'Brazil', 'Iran',
'Indonesia', 'Pakistan', 'Japan',
'Thailand', 'Italy'
).sort_values(by=[2024], ascending=False)
year | 2020 | 2021 | 2022 | 2023 | 2024 |
location | |||||
United States | 1385 | 1470 | 1499 | 1495 | 1447 |
United Kingdom | 695 | 737 | 757 | 785 | 793 |
China | 396 | 468 | 513 | 528 | 499 |
Japan | 322 | 331 | 351 | 347 | 360 |
Italy | 271 | 288 | 319 | 336 | 352 |
Brazil | 243 | 287 | 307 | 341 | 332 |
Germany | 284 | 286 | 310 | 320 | 320 |
Turkey | 122 | 145 | 181 | 216 | 271 |
India | 153 | 172 | 195 | 221 | 270 |
France | 219 | 235 | 231 | 228 | 243 |
Russian Federation | 130 | 166 | 196 | 230 | 241 |
Iran | 107 | 128 | 146 | 167 | 185 |
Pakistan | 36 | 50 | 65 | 89 | 120 |
Indonesia | 23 | 40 | 62 | 79 | 106 |
Thailand | 58 | 65 | 74 | 85 | 97 |
QS-рейтинг университетов
Доступ к сайту данного рейтинга заблокирован по решению Пучежского районного суда (Ивановской области) № 2а-62/2023 от 19.01.2023. С учётом этого, а также политизированной позиции в отношении 🇷🇺 Российских университетов руководителей рейтинга, ниже представленный код публикуется без проверки.
По каким-то причинам на сайте суда <данные изъяты>
и не видно доменного имени, может потребоваться эта ссылка на реестр Роскомнадзора. В поле «Искомый ресурс» укажите — www.topuniversities.com
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
year = 2022
page_url = f'https://www.topuniversities.com/university-rankings/world-university-rankings/{year}'
page_resp = requests.get(page_url)
if page_resp.status_code == 200:
page_html = page_resp.text
page_soup = BeautifulSoup(page_html)
drupal_settings = page_soup.find('script', attrs={'data-drupal-selector': 'drupal-settings-json'}).text
drupal_settings_json = json.loads(drupal_settings)
nid = drupal_settings_json['statistics']['data']['nid']
rank_resp = requests.get(f"https://www.topuniversities.com/rankings/endpoint?nid={nid}&page=0&items_per_page=2000")
if rank_resp.status_code == 200:
rank_json = rank_resp.json()
df = pd.json_normalize(rank_json['score_nodes'])
import json
import time
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Собираем список предметов
year = 2022
first_url = f"https://www.topuniversities.com/university-rankings/university-subject-rankings/{year}/natural-sciences"
first_resp = requests.get(first_url)
subjects = []
if first_resp.status_code == 200:
first_html = first_resp.text
first_soup = BeautifulSoup(first_html)
drupal_settings = first_soup.find('script', attrs={'data-drupal-selector': 'drupal-settings-json'}).text
drupal_settings_json = json.loads(drupal_settings)
first_nid = drupal_settings_json['qs_rankings_rest_api']['nid']
subject_url = f"https://www.topuniversities.com/rankings/filter/endpoint?nid={first_nid}"
subject_resp = requests.get(subject_url)
if subject_resp.status_code == 200:
subject_json = subject_resp.json()
for c, subjects_list in subject_json['subjects'].items():
for s in subjects_list:
subjects.append({'category': c, 'name': s['name'], 'url': s['url']})
print(f"{subject_resp} return status code {subject_resp.status_code}")
print(f"{first_url} return status code {first_resp.status_code}")
# Забираем данные по категориям
ranking = pd.DataFrame(columns=['subject_category', 'subject'])
for i, s in enumerate(subjects):
url = f"https://www.topuniversities.com/{s['url']}"
resp = requests.get(url)
if resp.status_code == 200:
html = resp.text
soup = BeautifulSoup(html)
drupal_settings = soup.find('script', attrs={'data-drupal-selector': 'drupal-settings-json'}).text
drupal_settings_json = json.loads(drupal_settings)
nid = drupal_settings_json['qs_rankings_rest_api']['nid']
url_subject = f"https://www.topuniversities.com/rankings/endpoint?nid={nid}&items_per_page=10000&tab=?&page=0"
resp_subject = requests.get(url_subject)
if resp_subject.status_code == 200:
json_subject = resp_subject.json()
chunk_subject = pd.json_normalize(json_subject['score_nodes'])
chunk_subject['subject_category'] = s['category']
chunk_subject['subject'] = s['name']
ranking = pd.concat([ranking, chunk_subject], ignore_index=True)
print(f"{url_subject} return status code {resp_subject.status_code}")
print(f"{url} return status code {resp.status_code}")
print(f"{i}/{len(subjects)}: {s['name']} is done!", end='\r')
Исходники и данные
Исходные Jupiter Notebook-и для данной страницы находится в папке по ссылке в Github.
Получившиеся наборы данных, на момент написания, вы можете скачать по ссылкам ниже: