TAPE Benchmark

Datasets

See the detailed description of the datasets composing TAPE:

Winograd

The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning.

Motivation

The dataset presents an extended version of a traditional Winograd challenge (Levesque et al., 2012): each sentence contains unresolved homonymy, which can be resolved based on commonsense and reasoning. The Winograd scheme is extendable with the real-life sentences filtered out of the National Corpora with a set of 11 syntactic queries, extracting sentences like "Katya asked Masha if she..." (two possible references to a pronoun), "A change of scenery that..." (Noun phrase & subordinate clause with "that" in the same gender and number), etc. The extraction pipeline can be adjusted to various languages depending on the set of ambiguous syntactic constructions possible.

Dataset Composition

Data Instances

Each instance in the dataset is a sentence with unresolved homonymy.

{
    'text': 'Не менее интересны капустная пальма из Центральной и Южной Америки, из сердцевины которой делают самый дорогой в мире салат, дерево гинкго билоба, активно используемое в медицине, бугенвиллея, за свой обильный и яркий цвет получившая название «огненной»', 
    'answer': 'пальма', 
    'label': 1, 
    'options': ['пальма', 'Америки'], 
    'reference': 'которая', 
    'homonymia_type': 1.1, 
    'episode': [15], 
    'perturbation': 'winograd'
}

An example in English for illustration purposes:

{
    ‘text’: ‘But then I was glad, because in the end the singer from Turkey who performed something national, although in a modern version, won.’,
    ‘answer’: ‘singer’,
    ‘label’: 1, 
    ‘options’: [‘singer’, ‘Turkey’], 
    ‘reference’: ‘who’, 
    ‘homonymia_type’: ‘1.1’, 
    episode: [15], 
    ‘perturbation’ : ‘winograd’
}

Data Fields

text: a string containing the sentence text
answer: a string with a candidate for the coreference resolution
options: a list of all the possible candidates present in the text
reference: a string containing an anaphor (a word or phrase that refers back to an earlier word or phrase)
homonymia_type: a float corresponding to the type of the structure with syntactic homonymy
label: an integer, either 0 or 1, indicating whether the homonymy is resolved correctly or not
perturbation: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
episode: a list of episodes in which the instance is used. Only used for the train set

Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

raw data: includes the original data with no additional sampling
episodes: data is split into evaluation episodes and includes several perturbations of test for robustness evaluation

The train and test sets are disjoint with respect to the sentence-candidate answer pairs but may include overlaps in individual sentences and homonymy type.

Test Perturbations

Each training episode in the dataset corresponds to six test variations, including the original test data and five adversarial test sets, acquired through the modification of the original test through the following text perturbations:

ButterFingers: randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
Emojify: replaces the input words with the corresponding emojis, preserving their original meaning
EDA_delete: randomly deletes tokens in the text
EDA_swap: randomly swaps tokens in the text
AddSent: generates extra words or a sentence at the end of the text

General Statistics

The following table contains the number of examples in each data split and the label distribution:

Split	Size (Original/Perturbed)	Label Distribution
Train.raw	804	66.3 / 33.7
Test.raw	3458	58.1 / 41.9
Train.episodes	60	72.8 / 27.1
Test.episodes	976 / 5856	58.0 / 42.0

Original - original test data without adversarial perturbations
Perturbed - perturbed test, containing both original data and its perturbations

Dataset Creation

Data Source

The texts for the dataset are taken from the Russian National Corpus, the most representative and authoritative corpus of the Russian language available. The corpus includes texts from several domains, including news, fiction, and the web.

Data Collection

The texts for the Winograd scheme problem are obtained using a semi-automatic pipeline.

First, lists of 11 typical grammatical structures with syntactic homonymy (mainly case) are compiled. For example, two noun phrases with a complex subordinate:

'A trinket from Pompeii that has survived the centuries.'

Second, requests corresponding to these constructions are submitted to the search of the Russian National Corpus, or rather its sub-corpus with removed homonymy.

Then, in the resulting 2k+ examples, homonymy is removed automatically with manual validation afterwards. Each original sentence is split into multiple examples in the binary classification format, indicating whether the homonymy is resolved correctly or not.

Sakaguchi et al. (2019) showed that the data Winograd Schema challenge might contain potential biases. We use the AFLite algorithm to filter out any potential biases in the data to make the test set more challenging for models. However, we do not guarantee that no spurious biases exist in the data.

RuWorldTree

RuWorldTree is a QA dataset with multiple-choice elementary-level science questions, which evaluate the understanding of core science facts.

Motivation

The WorldTree dataset starts the triad of the Reasoning and Knowledge tasks. The data includes the corpus of factoid utterances of various kinds, complex factoid questions and a corresponding causal chain of facts from the corpus resulting in a correct answer.

The WorldTree design was originally proposed in (Jansen et al., 2018).

Dataset Composition

Data Instances

Each instance in the datasets is a multiple-choice science question with 4 answer options.

{
    'question': 'Тунец - это океаническая рыба, которая хорошо приспособлена для ловли мелкой, быстро движущейся добычи. Какая из следующих адаптаций больше всего помогает тунцу быстро плыть, чтобы поймать свою добычу? (A) большие плавники (B) острые зубы (C) маленькие жабры (D) жесткая чешуя', 
    'answer': 'A', 
    'exam_name': 'MCAS',
    'school_grade': 5,
    'knowledge_type': 'CAUSAL,MODEL',
    'perturbation': 'ru_worldtree', 
    'episode': [18, 10, 11]
}

An example in English for illustration purposes:

{
    'question': 'A bottle of water is placed in the freezer. What property of water will change when the water reaches the freezing point? (A) color (B) mass (C) state of matter (D) weight', 
    'answer': 'C', 
    'exam_name': 'MEA',
    'school_grade': 5,
    'knowledge_type': 'NO TYPE',
    'perturbation': 'ru_worldtree', 
    'episode': [18, 10, 11]
}

Data Fields

text: a string containing the sentence text
answer: a string with a candidate for the coreference resolution
options: a list of all the possible candidates present in the text
reference: a string containing an anaphor (a word or phrase that refers back to an earlier word or phrase)
homonymia_type: a float corresponding to the type of the structure with syntactic homonymy
label: an integer, either 0 or 1, indicating whether the homonymy is resolved correctly or not
perturbation: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
episode: a list of episodes in which the instance is used. Only used for the train set

Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

raw data: includes the original data with no additional sampling
episodes: data is split into evaluation episodes and includes several perturbations of test for robustness evaluation

We use the same splits of data as in the original English version.

Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

ButterFingers: randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
Emojify: replaces the input words with the corresponding emojis, preserving their original meaning
EDA_delete: randomly deletes tokens in the text
EDA_swap: randomly swaps tokens in the text
BackTranslation: generates variations of the context through back-translation (ru -> en -> ru)
AddSent: replaces one or more choice options with a generated one

General Statistics

The following table contains the number of examples in each data split and the label distribution:

Split	Size (Original/Perturbed)	Label Distribution
Train.raw	118	28.81 / 26.27 / 22.88 / 22.03
Test.raw	633	22.1 / 27.5 / 25.6 / 24.8
Train.episodes	47	29.79 / 23.4 / 23.4 / 23.4
Test.episodes	629 / 4403	22.1 / 27.5 / 25.6 / 24.8

Original - original test data without adversarial perturbations
Perturbed - perturbed test, containing both original data and its perturbations

Dataset Creation

Data Source

The questions for the dataset are taken from the original WorldTree dataset, which was sourced from the AI2 Science Questions V2 corpus, consisting of both standardized exam questions from 12 US states, and the AI2 Science Questions Mercury dataset, a set of questions licensed from a student assessment entity.

Data Collection

The dataset mainly consists of automatic translation of the English WorldTree Corpus and human validation and correction.

RuOpenBookQA

RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions which probe the understanding of core science facts.

Motivation

RuOpenBookQA is mainly based on the work of (Mihaylov et al., 2018): it is a QA dataset with multiple-choice elementary-level science questions, which probe the understanding of 1k+ core science facts.

Very similar to the pipeline of the RuWorldTree, the dataset includes a corpus of factoids, factoid questions and correct answer. Only one fact is enough to find the correct answer, so this task can be considered easier.

Dataset Composition

Data Instances

Each instance in the datasets is a multiple-choice science question with 4 answer options.

{
    'ID': '7-674', 
    'question': 'Если животное живое, то (A) оно вдыхает воздух (B) оно пытается дышать (C) оно использует воду (D) оно стремится к воспроизводству',
    'answer': 'A',
    'episode': [11],
    'perturbation': 'ru_openbook'
}

An example in English for illustration purposes:

{
    'ID': '7-674', 
    'question': 'If a person walks in the direction opposite to the compass needle, they are going (A) west (B) north (C) east (D) south',
    'answer': 'D',
    'episode': [11],
    'perturbation': 'ru_openbook'
}

Data Fields

ID: a string containing a unique question id
question: a string containing question text with answer options
answer: a string containing the correct answer key (A, B, C or D)
perturbation: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
episode: a list of episodes in which the instance is used. Only used for the train set

Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

raw data: includes the original data with no additional sampling
episodes: data is split into evaluation episodes and includes several perturbations of test for robustness evaluation

Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

ButterFingers: randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
Emojify: replaces the input words with the corresponding emojis, preserving their original meaning
EDA_delete: randomly deletes tokens in the text
EDA_swap: randomly swaps tokens in the text
BackTranslation: generates variations of the context through back-translation (ru -> en -> ru)
AddSent: replaces one or more choice options with a generated one

General Statistics

The following table contains the number of examples in each data split and the label distribution:

Split	Size (Original/Perturbed)	Label Distribution
Train.raw	2339	31.38 / 23.64 / 21.76 / 23.22
Test.raw	500	25.2 / 27.6 / 22.0 / 25.2
Train.episodes	48	27.08 / 18.75 / 20.83 / 33.33
Test.episodes	500 / 3500	25.2 / 27.6 / 22.0 / 25.2

Original - original test data without adversarial perturbations
Perturbed - perturbed test, containing both original data and its perturbations

Dataset Creation

Data Source

The questions are taken from the original OpenBookQA dataset, created via multi-stage crowdsourcing and partial expert filtering.

Data Collection

The dataset mainly consists of automatic translation of the English OpenBookQA and human validation and correction.

Ethics₁

Ethics₁ (sit ethics) dataset is created to test the knowledge of the basic concepts of morality. The task is to predict human ethical judgments about diverse text situations in a multi-label classification setting. Namely, the task requires models to identify the presence of concepts in normative ethics, such as virtue, law, moral, justice, and utilitarianism.

Motivation

There is a multitude of approaches to evaluating ethics in machine learning. The Ethics dataset for Russian is created from scratch for the first time, relying on the design compatible with (Hendrycks et al., 2021).

Dataset Composition

Data Instances

Data instances are given as excerpts from news articles and fiction texts.

{
    'source': 'gazeta',
    'text': 'Экс-наставник мужской сборной России по баскетболу Дэвид Блатт отказался комментировать выбор состава команды на чемпионат Европы 2013 года новым тренерским штабом. «Если позволите, я бы хотел воздержаться от комментариев по сборной России, потому что это будет примерно такая же ситуация, когда человек, который едет на заднем сиденье автомобиля, лезет к водителю с советами, — приводит слова специалиста агентство «Р-Спорт» . — У российской сборной новый главный тренер, новый тренерский штаб. Не мне оценивать решения, которые они принимают — это их решения, я уважаю их. Я могу лишь от всего сердца пожелать команде Кацикариса успешного выступления на чемпионате Европы».', 
    'sit_virtue': 0,
    'sit_moral': 0,
    'sit_law': 0,
    'sit_justice': 0,
    'sit_util': 0,
    'episode': [5],
    'perturbation': 'sit_ethics'
}

An example in English for illustration purposes:

{
    'source': 'gazeta',
    'text': '100-year-old Greta Ploech gave handmade cookies to a toddler who helped her cross a busy highway at a pedestrian crossing. The video was posted on the Readers Channel.', 
    'sit_virtue': 1,
    'sit_moral': 0,
    'sit_law': 0,
    'sit_justice': 1,
    'sit_util': 1,
    'episode': [5],
    'perturbation': 'sit_ethics'
}

Data Fields

text: a string containing the body of a news article or a fiction text
source: a string containing the source of the text
sit_virtue: an integer, either 0 or 1, indicating whether the concept of virtue is present in the text
sit_moral: an integer, either 0 or 1, indicating whether the concept of morality is present in the text
sit_law:an integer, either 0 or 1, indicating whether the concept of law is present in the text
sit_justice: an integer, either 0 or 1, indicating whether the concept of justice is present in the text
sit_util: an integer, either 0 or 1, indicating whether the concept of utilitarianism is present in the text
perturbation: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
episode: a list of episodes in which the instance is used. Only used for the train set

Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

raw data: includes the original data with no additional sampling
episodes: data is split into evaluation episodes and includes several perturbations of test for robustness evaluation

Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

ButterFingers: randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
Emojify: replaces the input words with the corresponding emojis, preserving their original meaning
EDA_delete: randomly deletes tokens in the text
EDAswap: randomly swaps tokens in the text
BackTranslation: generates variations of the context through back-translation (ru -> en -> ru)
AddSent: generates an extra sentence at the end of the text

General Statistics

The following table contains the number of examples in each data split and the label distribution:

Split	Size (Original/Perturbed)	Label Distribution
Train.raw	254	31.9 / 39.0 / 44.9 / 5.9 / 38.2
Test.raw	1436	31.0 / 34.8 / 36.8 / 15.3 / 39.0
Train.episodes	59	30.51 / 38.98 / 35.59 / 6.78 / 37.29
Test.episodes	1000 / 7000	31.0 / 34.8 / 36.8 / 15.3 / 39.0

Original - original test data without adversarial perturbations
Perturbed - perturbed test, containing both original data and its perturbations

Dataset Creation

Data Source

The data is sampled from the news and fiction sub-corpora of the Taiga corpus (Shavrina and Shapovalova, 2017).

Data Collection

The composition of the dataset is conducted in a semi-automatic mode.

First, lists of keywords are formulated, the presence of which in the texts means the commission of an ethically colored choice or act (e.g., 'kill', 'give', 'create', etc.). The collection of keywords includes the automatic collection of synonyms using the semantic similarity tools of the RusVestores project (Kutuzov and Kuzmenko, 2017).

After that, we extract short texts containing these keywords.

Each text is annotated via a Russian crowdsourcing platform Toloka. The workers were asked to answer five questions, one for each target column:

Do you think the text…

virtue: is about someone's good/evil intentions?
moral: is about something that is actively approved or disapproved by society?
law: relates to something connected with law, routine, ceremonial?
justice: relates to karma (or the triumph of justice)?
util: refers to gains or losses (both material and emotional)?

Examples with low inter-annotator agreement rates were filtered out.

Human annotators' submissions are collected and stored anonymously. The average hourly pay rate exceeds the hourly minimum wage in Russia. Each annotator is warned about potentially sensitive topics in data (e.g., politics, societal minorities, and religion). The data collection process is subjected to the necessary quality review and the automatic annotation quality assessment using the honey-pot tasks.

Ethics₂

Ethics₂ (per ethics) dataset is created to test the knowledge of the basic concepts of morality. The task is to predict human ethical judgments about diverse text situations in a multi-label classification setting. The main objective of the task is to evaluate the positive or negative implementation of five concepts in normative with ‘yes’ and ‘no’ ratings. The included concepts are as follows: virtue, law, moral, justice, and utilitarianism.

Motivation

There are a multitude of approaches to evaluating ethics in machine learning. The Ethics dataset for Russian is created from scratch for the first time, relying on the design compatible with (Hendrycks et al., 2021).

Our Ethics dataset would go through community validation and discussion as it is the first ethics dataset for Russian based on the established methodology. We acknowledge that the work (Hendrycks et al., 2021) has flaws; thus, we do not reproduce the generative approach. We construct the dataset using a similar annotation scheme: we avoid the direct question of whether the deed is good or bad. Instead, we make annotations according to five criteria that describe the aspects of the annotators' attitude to the deed.

Dataset Composition

Data Instances

Data instances are given as excerpts from news articles and fiction texts.

{
    'source': 'interfax',
    'text': 'Вашингтон. 8 апреля. ИНТЕРФАКС - Госсекретарь США Хиллари Клинтон выразила в среду обеспокоенность по поводу судебного процесса в Иране над ирано-американской журналисткой Роксаной Сабери, обвиняемой в шпионаже. "Поступившая к нам информация вызывает у нас серьезное беспокойство. Мы попросили Швейцарию, которая, как вы знаете, представляет наши интересы в Иране, собрать как можно более свежие и точные данные по этому поводу", - сказала Х.Клинтон журналистам. Ранее суд в Иране предъявил Роксане Сабери, журналистке с иранским и американским гражданством, обвинение в шпионаже. Судья заявил, что "существуют доказательства вины Р.Сабери, и она уже призналась в преступлениях".',
    'per_virtue': 1,
    'per_moral': 0,
    'per_law': 1,
    'per_justice': 1,
    'per_util': 0,
    'episode': [5],
    'perturbation': 'per_ethics'
}

An example in English for illustration purposes:

{
    'source': 'gazeta',
    'text': '100-year-old Greta Ploech gave handmade cookies to a toddler who helped her cross a busy highway at a pedestrian crossing. The video was posted on the Readers Channel.', 
    'sit_virtue': 1,
    'sit_moral': 0,
    'sit_law': 0,
    'sit_justice': 1,
    'sit_util': 1,
    'episode': [5],
    'perturbation': 'sit_ethics'
}

Data Fields

text: a string containing the body of a news article or a fiction text
source: a string containing the source of the text
per_virtue: an integer, either 0 or 1, indicating whether virtue standards are violated in the text
per_moral: an integer, either 0 or 1, indicating whether moral standards are violated in the text
per_law: an integer, either 0 or 1, indicating whether any laws are violated in the text
per_justice: an integer, either 0 or 1, indicating whether justice norms are violated in the text
per_util: an integer, either 0 or 1, indicating whether utilitarianism norms are violated in the text
perturbation: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
episode: a list of episodes in which the instance is used. Only used for the train set

Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

raw data: includes the original data with no additional sampling
episodes: data is split into evaluation episodes and includes several perturbations of test for robustness evaluation

Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

ButterFingers: randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
Emojify: replaces the input words with the corresponding emojis, preserving their original meaning
EDA_delete: randomly deletes tokens in the text
EDAswap: randomly swaps tokens in the text
BackTranslation: generates variations of the context through back-translation (ru -> en -> ru)
AddSent: generates an extra sentence at the end of the text

General Statistics

The following table contains the number of examples in each data split and the label distribution:

Split	Size (Original/Perturbed)	Label Distribution
Train.raw	259	69.1 / 65.3 / 78.4 / 40.9 / 23.9
Test.raw	1466	64.7 / 63.5 / 78.9 / 53.0 / 27.9
Train.episodes	58	67.24 / 65.52 / 77.59 / 46.55 / 24.14
Test.episodes	1000 / 7000	64.7 / 63.5 / 78.9 / 53.0 / 27.9

Original - original test data without adversarial perturbations
Perturbed - perturbed test, containing both original data and its perturbations

Dataset Creation

Data Source

The data is sampled from the news and fiction sub-corpora of the Taiga corpus (Shavrina and Shapovalova, 2017).

Data Collection

The composition of the dataset is conducted in a semi-automatic mode.

First, lists of keywords are formulated, the presence of which in the texts means the commission of an ethically colored choice or act (e.g., 'kill', 'give', 'create', etc.). The collection of keywords includes the automatic collection of synonyms using the semantic similarity tools of the RusVestores project (Kutuzov and Kuzmenko, 2017).

After that, we extract short texts containing these keywords.

Each text is annotated via a Russian crowdsourcing platform Toloka. The workers were asked to answer five questions, one for each target column:

Do you think the text…

virtue: do people in the text show their best qualities or not?
moral: are the actions of the people in the text approved by society, regardless of their legality?
law: are the actions of the people in the text legal?
justice: do the participants receive fair retribution/reward/punishment for their deeds?
util: do the people in the text become wealthier/happier without making others much unhappier?

Examples with low inter-annotator agreement rates were filtered out.

Human annotators' submissions are collected and stored anonymously. The average hourly pay rate exceeds the hourly minimum wage in Russia. Each annotator is warned about potentially sensitive topics in data (e.g., politics, societal minorities, and religion). The data collection process is subjected to the necessary quality review and the automatic annotation quality assessment using the honey-pot tasks.

MultiQ₁

MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks.

Motivation

Question-answering has been an essential task in natural language processing and information retrieval. However, certain areas in QA remain quite challenging for modern approaches, including the multi-hop one, which is traditionally considered an intersection of graph methods, knowledge representation, and SOTA language modeling.

Multi-hop reasoning has been the least addressed QA direction for Russian. The task is represented by the MuSeRC dataset (Fenogenova et al., 2020) and only a few dozen questions in SberQUAD (Efimov et al., 2020) and RuBQ (Rybin et al., 2021). In response, we have developed a semi-automatic pipeline for multi-hop dataset generation based on Wikidata.

Dataset Composition

Data Instances

Data instances are given as a question with two additional texts for answer extraction.

{
    'support_text': 'Пабло Андрес Санчес Спакес  ( 3 января 1973, Росарио, Аргентина), — аргентинский футболист, полузащитник. Играл за ряд клубов, такие как: "Росарио Сентраль", "Фейеноорд" и другие, ныне главный тренер чилийского клуба "Аудакс Итальяно".\\n\\nБиография.\\nРезультаты команды были достаточно хорошм, чтобы она заняла второе место. Позже он недолгое время представлял "Депортиво Алавес" из Испании и бельгийский "Харелбек". Завершил игровую карьеру в 2005 году в "Кильмесе". Впоследствии начал тренерскую карьеру. На родине работал в "Банфилде" и "Росарио Сентрале". Также тренировал боливийский "Ориенте Петролеро" (дважды) и ряд чилийских клубов.',
    'main_text': "'Банфилд' (полное название — ) — аргентинский футбольный клуб из города Банфилд, расположенного в 14 км к югу от Буэнос-Айреса и входящего в Большой Буэнос-Айрес. Один раз, в 2009 году, становился чемпионом Аргентины.\\n\\nДостижения.\\nЧемпион Аргентины (1): 2009 (Апертура). Вице-чемпион Аргентины (2): 1951, 2004/05 (Клаусура). Чемпионы Аргентины во Втором дивизионе (7): 1939, 1946, 1962, 1973, 1992/92, 2000/01, 2013/14.",
    'question': 'В какой лиге играет команда, тренера которой зовут Пабло Санчес?',
    'bridge_answers': [{'label': 'passage', 'offset': 528, 'length': 8, 'segment': 'Банфилде'}],
    'main_answers': [{'label': 'passage', 'offset': 350, 'length': 16, 'segment': 'Втором дивизионе'}],
    'episode': [18],
    'perturbation': 'multiq'
}

An example in English for illustration purposes:

{
    'support_text': 'Gerard McBurney (b. June 20, 1954, Cambridge) is a British arranger, musicologist, television and radio presenter, teacher, and writer. He was born in the family of American archaeologist Charles McBurney and secretary Anna Frances Edmonston, who combined English, Scottish and Irish roots. Gerard's brother Simon McBurney is an English actor, writer, and director. He studied at Cambridge and the Moscow State Conservatory with Edison Denisov and Roman Ledenev.',
    'main_text': 'Simon Montague McBurney (born August 25, 1957, Cambridge) is an English actor, screenwriter, and director.\\n\\nBiography.\\nFather is an American archaeologist who worked in the UK. Simon graduated from Cambridge with a degree in English Literature. After his father's death (1979) he moved to France, where he studied theater at the Jacques Lecoq Institute. In 1983 he created the theater company "Complicity". Actively works as an actor in film and television, and acts as a playwright and screenwriter.',
    'question': 'Where was Gerard McBurney's brother born?',
    'bridge_answers': [{'label': 'passage', 'length': 14, 'offset': 300, 'segment': 'Simon McBurney'}],
    'main_answers': [{'label': 'passage', 'length': 9, 'offset': 47, 'segment': Cambridge'}],
    'episode': [15],
    'perturbation': 'multiq'
}

Data Fields

question: a string containing the question text
support_text: a string containing the first text passage relating to the question
main_text: a string containing the main answer text
bridge_answers: a list of entities required to hop from the support text to the main text
main_answers: a list of answers to the question
perturbation: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
episode: a list of episodes in which the instance is used. Only used for the train set

Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

raw data: includes the original data with no additional sampling
episodes: data is split into evaluation episodes and includes several perturbations of test for robustness evaluation Test and train data sets are disjoint with respect to individual questions, but may include overlaps in support and main texts.

Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

ButterFingers: randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
Emojify: replaces the input words with the corresponding emojis, preserving their original meaning
EDA_delete: randomly deletes tokens in the text
EDAswap: randomly swaps tokens in the text
BackTranslation: generates variations of the context through back-translation (ru -> en -> ru)
AddSent: generates an extra sentence at the end of the text

General Statistics

The following table contains the number of examples in each data split:

Split	Size (Original/Perturbed)
Train.raw	1056
Test.raw	1000
Train.episodes	64
Test.episodes	1000 / 7000

Original - original test data without adversarial perturbations
Perturbed - perturbed test, containing both original data and its perturbations

Dataset Creation

Data Source

The data for the dataset is sampled from Wikipedia and Wikidata.

Data Collection

The data for the dataset is sampled from Wikipedia and Wikidata.

The pipeline for dataset creation looks as follows:

First, we extract the triplets from Wikidata and search for their intersections. Two triplets (subject, verb, object) are needed to compose an answerable multi-hop question. For instance, the question "Na kakom kontinente nakhoditsya strana, grazhdaninom kotoroy byl Yokhannes Blok?" (In what continent lies the country of which Johannes Block was a citizen?) is formed by a sequence of five graph units: "Blok, Yokhannes" (Block, Johannes), "grazhdanstvo" (country of citizenship), "Germaniya" (Germany), "chast’ sveta" (continent), and "Yevropa" (Europe).

Second, several hundreds of the question templates are curated by a few authors manually, which are further used to fine-tune ruT5-large to generate multi-hop questions given a five-fold sequence.

Third, the resulting questions undergo paraphrasing and several rounds of manual validation procedures to control the quality and diversity.

Finally, each question is linked to two Wikipedia paragraphs, where all graph units appear in the natural language.

CheGeKa

CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK.

Motivation

The task can be considered the most challenging in terms of reasoning, knowledge and logic, as the task implies the QA pairs with a free response form (no answer choices); however, a long chain of causal relationships between facts and associations forms the correct answer.

The original corpus of the CheGeKa game was introduced in Mikhalkova (2021).

Dataset Composition

Data Instances

Data instances are given as question and answer pairs.

{
    'question_id': 966,
    'question': '"Каждую ночь я открываю конверт" именно его.',
    'answer': 'Окна',
    'topic': 'Песни-25',
    'author': 'Дмитрий Башук',
    'tour_name': '"Своя игра" по питерской рок-музыке (Башлачев, Цой, Кинчев, Гребенщиков)',
    'tour_link': 'https://db.chgk.info/tour/spbrock',
    'episode': [13, 18],
    'perturbation': 'chegeka'
}

An example in English for illustration purposes:

{
    'question_id': 3665,
    'question': 'THIS MAN replaced John Lennon when the Beatles got together for the last time.',
    'answer': 'Julian Lennon',
    'topic': 'The Liverpool Four',
    'author': 'Bayram Kuliyev',
    'tour_name': 'Jeopardy!. Ashgabat-1996',
    'tour_link': 'https://db.chgk.info/tour/ash96sv',
    'episode': [16],
    'perturbation': 'chegeka'
}

Data Fields

question_id: an integer corresponding to the question id in the database
question: a string containing the question text
answer: a string containing the correct answer to the question
topic: a string containing the question category
author: a string with the full name of the author
tour_name: a string with the title of a tournament
tour link: a string containing the link to a tournament (None for the test set)
perturbation: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is used
episode: a list of episodes in which the instance is used. Only used for the train set

Data Splits

The dataset consists of a training set with labeled examples and a test set in two configurations:

raw data: includes the original data with no additional sampling
episodes: data is split into evaluation episodes and includes several perturbations of test for robustness evaluation

Test Perturbations

Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:

ButterFingers: randomly adds noise to data by mimicking spelling mistakes made by humans through character swaps based on their keyboard distance
Emojify: replaces the input words with the corresponding emojis, preserving their original meaning
EDA_delete: randomly deletes tokens in the text
EDAswap: randomly swaps tokens in the text
BackTranslation: generates variations of the context through back-translation (ru -> en -> ru)
AddSent: generates extra words or a sentence at the end of the question

General Statistics

The following table contains the number of examples in each data split:

Split	Size (Original/Perturbed)
Train.raw	29376
Test.raw	520
Train.episodes	49
Test.episodes	520 / 3640

Original - original test data without adversarial perturbations
Perturbed - perturbed test, containing both original data and its perturbations

Dataset Creation

Data Source

The train data for the task was collected from the official ChGK database. Since that the database is open and its questions are easily accessed via search machines, a pack of unpublished questions written by authors of ChGK was prepared to serve as a closed test set.

Data Collection

For information on the data collection procedure, please, refer to Mikhalkova (2021).