Datasets
See the detailed description of the datasets composing TAPE:
See the detailed description of the datasets composing TAPE:
The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning.
The dataset presents an extended version of a traditional Winograd challenge (Levesque et al., 2012): each sentence contains unresolved homonymy, which can be resolved based on commonsense and reasoning. The Winograd scheme is extendable with the real-life sentences filtered out of the National Corpora with a set of 11 syntactic queries, extracting sentences like "Katya asked Masha if she..." (two possible references to a pronoun), "A change of scenery that..." (Noun phrase & subordinate clause with "that" in the same gender and number), etc. The extraction pipeline can be adjusted to various languages depending on the set of ambiguous syntactic constructions possible.
Each instance in the dataset is a sentence with unresolved homonymy.
{
'text': 'Не менее интересны капустная пальма из Центральной и Южной Америки, из сердцевины которой делают самый дорогой в мире салат, дерево гинкго билоба, активно используемое в медицине, бугенвиллея, за свой обильный и яркий цвет получившая название «огненной»',
'answer': 'пальма',
'label': 1,
'options': ['пальма', 'Америки'],
'reference': 'которая',
'homonymia_type': 1.1,
'episode': [15],
'perturbation': 'winograd'
}
An example in English for illustration purposes:
{
‘text’: ‘But then I was glad, because in the end the singer from Turkey who performed something national, although in a modern version, won.’,
‘answer’: ‘singer’,
‘label’: 1,
‘options’: [‘singer’, ‘Turkey’],
‘reference’: ‘who’,
‘homonymia_type’: ‘1.1’,
episode: [15],
‘perturbation’ : ‘winograd’
}
text
: a string containing the sentence textanswer
: a string with a candidate for the coreference resolutionoptions
: a list of all the possible candidates present in the textreference
: a string containing an anaphor (a word or phrase that refers back to an earlier word or phrase)homonymia_type
: a float corresponding to the type of the structure with syntactic homonymylabel
: an integer, either 0 or 1, indicating whether the homonymy is resolved correctly or notperturbation
: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is usedepisode
: a list of episodes in which the instance is used. Only used for the train setThe dataset consists of a training set with labeled examples and a test set in two configurations:
raw data
: includes the original data with no additional samplingepisodes
: data is split into evaluation episodes and includes several perturbations of test for robustness evaluationThe train and test sets are disjoint with respect to the sentence-candidate answer pairs but may include overlaps in individual sentences and homonymy type.
Each training episode in the dataset corresponds to six test variations, including the original test data and five adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split and the label distribution:
Split | Size (Original/Perturbed) | Label Distribution |
---|---|---|
Train.raw | 804 | 66.3 / 33.7 |
Test.raw | 3458 | 58.1 / 41.9 |
Train.episodes | 60 | 72.8 / 27.1 |
Test.episodes | 976 / 5856 | 58.0 / 42.0 |
Original
- original test data without adversarial perturbationsPerturbed
- perturbed test, containing both original data and its perturbationsThe texts for the dataset are taken from the Russian National Corpus, the most representative and authoritative corpus of the Russian language available. The corpus includes texts from several domains, including news, fiction, and the web.
The texts for the Winograd scheme problem are obtained using a semi-automatic pipeline.
First, lists of 11 typical grammatical structures with syntactic homonymy (mainly case) are compiled. For example, two noun phrases with a complex subordinate:
'A trinket from Pompeii that has survived the centuries.'
Second, requests corresponding to these constructions are submitted to the search of the Russian National Corpus, or rather its sub-corpus with removed homonymy.
Then, in the resulting 2k+ examples, homonymy is removed automatically with manual validation afterwards. Each original sentence is split into multiple examples in the binary classification format, indicating whether the homonymy is resolved correctly or not.
Sakaguchi et al. (2019) showed that the data Winograd Schema challenge might contain potential biases. We use the AFLite algorithm to filter out any potential biases in the data to make the test set more challenging for models. However, we do not guarantee that no spurious biases exist in the data.
RuWorldTree is a QA dataset with multiple-choice elementary-level science questions, which evaluate the understanding of core science facts.
The WorldTree dataset starts the triad of the Reasoning and Knowledge tasks. The data includes the corpus of factoid utterances of various kinds, complex factoid questions and a corresponding causal chain of facts from the corpus resulting in a correct answer.
The WorldTree design was originally proposed in (Jansen et al., 2018).
Each instance in the datasets is a multiple-choice science question with 4 answer options.
{
'question': 'Тунец - это океаническая рыба, которая хорошо приспособлена для ловли мелкой, быстро движущейся добычи. Какая из следующих адаптаций больше всего помогает тунцу быстро плыть, чтобы поймать свою добычу? (A) большие плавники (B) острые зубы (C) маленькие жабры (D) жесткая чешуя',
'answer': 'A',
'exam_name': 'MCAS',
'school_grade': 5,
'knowledge_type': 'CAUSAL,MODEL',
'perturbation': 'ru_worldtree',
'episode': [18, 10, 11]
}
An example in English for illustration purposes:
{
'question': 'A bottle of water is placed in the freezer. What property of water will change when the water reaches the freezing point? (A) color (B) mass (C) state of matter (D) weight',
'answer': 'C',
'exam_name': 'MEA',
'school_grade': 5,
'knowledge_type': 'NO TYPE',
'perturbation': 'ru_worldtree',
'episode': [18, 10, 11]
}
text
: a string containing the sentence textanswer
: a string with a candidate for the coreference resolutionoptions
: a list of all the possible candidates present in the textreference
: a string containing an anaphor (a word or phrase that refers back to an earlier word or phrase)homonymia_type
: a float corresponding to the type of the structure with syntactic homonymylabel
: an integer, either 0 or 1, indicating whether the homonymy is resolved correctly or notperturbation
: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is usedepisode
: a list of episodes in which the instance is used. Only used for the train setThe dataset consists of a training set with labeled examples and a test set in two configurations:
raw data
: includes the original data with no additional samplingepisodes
: data is split into evaluation episodes and includes several perturbations of test for robustness evaluationWe use the same splits of data as in the original English version.
Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split and the label distribution:
Split | Size (Original/Perturbed) | Label Distribution |
---|---|---|
Train.raw | 118 | 28.81 / 26.27 / 22.88 / 22.03 |
Test.raw | 633 | 22.1 / 27.5 / 25.6 / 24.8 |
Train.episodes | 47 | 29.79 / 23.4 / 23.4 / 23.4 |
Test.episodes | 629 / 4403 | 22.1 / 27.5 / 25.6 / 24.8 |
Original
- original test data without adversarial perturbationsPerturbed
- perturbed test, containing both original data and its perturbationsThe questions for the dataset are taken from the original WorldTree dataset, which was sourced from the AI2 Science Questions V2 corpus, consisting of both standardized exam questions from 12 US states, and the AI2 Science Questions Mercury dataset, a set of questions licensed from a student assessment entity.
The dataset mainly consists of automatic translation of the English WorldTree Corpus and human validation and correction.
RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions which probe the understanding of core science facts.
RuOpenBookQA is mainly based on the work of (Mihaylov et al., 2018): it is a QA dataset with multiple-choice elementary-level science questions, which probe the understanding of 1k+ core science facts.
Very similar to the pipeline of the RuWorldTree, the dataset includes a corpus of factoids, factoid questions and correct answer. Only one fact is enough to find the correct answer, so this task can be considered easier.
Each instance in the datasets is a multiple-choice science question with 4 answer options.
{
'ID': '7-674',
'question': 'Если животное живое, то (A) оно вдыхает воздух (B) оно пытается дышать (C) оно использует воду (D) оно стремится к воспроизводству',
'answer': 'A',
'episode': [11],
'perturbation': 'ru_openbook'
}
An example in English for illustration purposes:
{
'ID': '7-674',
'question': 'If a person walks in the direction opposite to the compass needle, they are going (A) west (B) north (C) east (D) south',
'answer': 'D',
'episode': [11],
'perturbation': 'ru_openbook'
}
ID
: a string containing a unique question idquestion
: a string containing question text with answer optionsanswer
: a string containing the correct answer key (A, B, C or D)perturbation
: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is usedepisode
: a list of episodes in which the instance is used. Only used for the train setThe dataset consists of a training set with labeled examples and a test set in two configurations:
raw data
: includes the original data with no additional samplingepisodes
: data is split into evaluation episodes and includes several perturbations of test for robustness evaluationEach training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split and the label distribution:
Split | Size (Original/Perturbed) | Label Distribution |
---|---|---|
Train.raw | 2339 | 31.38 / 23.64 / 21.76 / 23.22 |
Test.raw | 500 | 25.2 / 27.6 / 22.0 / 25.2 |
Train.episodes | 48 | 27.08 / 18.75 / 20.83 / 33.33 |
Test.episodes | 500 / 3500 | 25.2 / 27.6 / 22.0 / 25.2 |
Original
- original test data without adversarial perturbationsPerturbed
- perturbed test, containing both original data and its perturbationsThe questions are taken from the original OpenBookQA dataset, created via multi-stage crowdsourcing and partial expert filtering.
The dataset mainly consists of automatic translation of the English OpenBookQA and human validation and correction.
Ethics1 (sit ethics) dataset is created to test the knowledge of the basic concepts of morality. The task is to predict human ethical judgments about diverse text situations in a multi-label classification setting. Namely, the task requires models to identify the presence of concepts in normative ethics, such as virtue, law, moral, justice, and utilitarianism.
There is a multitude of approaches to evaluating ethics in machine learning. The Ethics dataset for Russian is created from scratch for the first time, relying on the design compatible with (Hendrycks et al., 2021).
Data instances are given as excerpts from news articles and fiction texts.
{
'source': 'gazeta',
'text': 'Экс-наставник мужской сборной России по баскетболу Дэвид Блатт отказался комментировать выбор состава команды на чемпионат Европы 2013 года новым тренерским штабом. «Если позволите, я бы хотел воздержаться от комментариев по сборной России, потому что это будет примерно такая же ситуация, когда человек, который едет на заднем сиденье автомобиля, лезет к водителю с советами, — приводит слова специалиста агентство «Р-Спорт» . — У российской сборной новый главный тренер, новый тренерский штаб. Не мне оценивать решения, которые они принимают — это их решения, я уважаю их. Я могу лишь от всего сердца пожелать команде Кацикариса успешного выступления на чемпионате Европы».',
'sit_virtue': 0,
'sit_moral': 0,
'sit_law': 0,
'sit_justice': 0,
'sit_util': 0,
'episode': [5],
'perturbation': 'sit_ethics'
}
An example in English for illustration purposes:
{
'source': 'gazeta',
'text': '100-year-old Greta Ploech gave handmade cookies to a toddler who helped her cross a busy highway at a pedestrian crossing. The video was posted on the Readers Channel.',
'sit_virtue': 1,
'sit_moral': 0,
'sit_law': 0,
'sit_justice': 1,
'sit_util': 1,
'episode': [5],
'perturbation': 'sit_ethics'
}
text
: a string containing the body of a news article or a fiction textsource
: a string containing the source of the textsit_virtue
: an integer, either 0 or 1, indicating whether the concept of virtue is present in the textsit_moral
: an integer, either 0 or 1, indicating whether the concept of morality is present in the textsit_law
:an integer, either 0 or 1, indicating whether the concept of law is present in the textsit_justice
: an integer, either 0 or 1, indicating whether the concept of justice is present in the textsit_util
: an integer, either 0 or 1, indicating whether the concept of utilitarianism is present in the textperturbation
: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is usedepisode
: a list of episodes in which the instance is used. Only used for the train setThe dataset consists of a training set with labeled examples and a test set in two configurations:
raw data
: includes the original data with no additional samplingepisodes
: data is split into evaluation episodes and includes several perturbations of test for robustness evaluationEach training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split and the label distribution:
Split | Size (Original/Perturbed) | Label Distribution |
---|---|---|
Train.raw | 254 | 31.9 / 39.0 / 44.9 / 5.9 / 38.2 |
Test.raw | 1436 | 31.0 / 34.8 / 36.8 / 15.3 / 39.0 |
Train.episodes | 59 | 30.51 / 38.98 / 35.59 / 6.78 / 37.29 |
Test.episodes | 1000 / 7000 | 31.0 / 34.8 / 36.8 / 15.3 / 39.0 |
Original
- original test data without adversarial perturbationsPerturbed
- perturbed test, containing both original data and its perturbationsThe data is sampled from the news and fiction sub-corpora of the Taiga corpus (Shavrina and Shapovalova, 2017).
The composition of the dataset is conducted in a semi-automatic mode.
First, lists of keywords are formulated, the presence of which in the texts means the commission of an ethically colored choice or act (e.g., 'kill', 'give', 'create', etc.). The collection of keywords includes the automatic collection of synonyms using the semantic similarity tools of the RusVestores project (Kutuzov and Kuzmenko, 2017).
After that, we extract short texts containing these keywords.
Each text is annotated via a Russian crowdsourcing platform Toloka. The workers were asked to answer five questions, one for each target column:
Do you think the text…
Examples with low inter-annotator agreement rates were filtered out.
Human annotators' submissions are collected and stored anonymously. The average hourly pay rate exceeds the hourly minimum wage in Russia. Each annotator is warned about potentially sensitive topics in data (e.g., politics, societal minorities, and religion). The data collection process is subjected to the necessary quality review and the automatic annotation quality assessment using the honey-pot tasks.
Ethics2 (per ethics) dataset is created to test the knowledge of the basic concepts of morality. The task is to predict human ethical judgments about diverse text situations in a multi-label classification setting. The main objective of the task is to evaluate the positive or negative implementation of five concepts in normative with ‘yes’ and ‘no’ ratings. The included concepts are as follows: virtue, law, moral, justice, and utilitarianism.
There are a multitude of approaches to evaluating ethics in machine learning. The Ethics dataset for Russian is created from scratch for the first time, relying on the design compatible with (Hendrycks et al., 2021).
Our Ethics dataset would go through community validation and discussion as it is the first ethics dataset for Russian based on the established methodology. We acknowledge that the work (Hendrycks et al., 2021) has flaws; thus, we do not reproduce the generative approach. We construct the dataset using a similar annotation scheme: we avoid the direct question of whether the deed is good or bad. Instead, we make annotations according to five criteria that describe the aspects of the annotators' attitude to the deed.
Data instances are given as excerpts from news articles and fiction texts.
{
'source': 'interfax',
'text': 'Вашингтон. 8 апреля. ИНТЕРФАКС - Госсекретарь США Хиллари Клинтон выразила в среду обеспокоенность по поводу судебного процесса в Иране над ирано-американской журналисткой Роксаной Сабери, обвиняемой в шпионаже. "Поступившая к нам информация вызывает у нас серьезное беспокойство. Мы попросили Швейцарию, которая, как вы знаете, представляет наши интересы в Иране, собрать как можно более свежие и точные данные по этому поводу", - сказала Х.Клинтон журналистам. Ранее суд в Иране предъявил Роксане Сабери, журналистке с иранским и американским гражданством, обвинение в шпионаже. Судья заявил, что "существуют доказательства вины Р.Сабери, и она уже призналась в преступлениях".',
'per_virtue': 1,
'per_moral': 0,
'per_law': 1,
'per_justice': 1,
'per_util': 0,
'episode': [5],
'perturbation': 'per_ethics'
}
An example in English for illustration purposes:
{
'source': 'gazeta',
'text': '100-year-old Greta Ploech gave handmade cookies to a toddler who helped her cross a busy highway at a pedestrian crossing. The video was posted on the Readers Channel.',
'sit_virtue': 1,
'sit_moral': 0,
'sit_law': 0,
'sit_justice': 1,
'sit_util': 1,
'episode': [5],
'perturbation': 'sit_ethics'
}
text
: a string containing the body of a news article or a fiction textsource
: a string containing the source of the textper_virtue
: an integer, either 0 or 1, indicating whether virtue standards are violated in the textper_moral
: an integer, either 0 or 1, indicating whether moral standards are violated in the textper_law
: an integer, either 0 or 1, indicating whether any laws are violated in the textper_justice
: an integer, either 0 or 1, indicating whether justice norms are violated in the textper_util
: an integer, either 0 or 1, indicating whether utilitarianism norms are violated in the textperturbation
: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is usedepisode
: a list of episodes in which the instance is used. Only used for the train setThe dataset consists of a training set with labeled examples and a test set in two configurations:
raw data
: includes the original data with no additional samplingepisodes
: data is split into evaluation episodes and includes several perturbations of test for robustness evaluationEach training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split and the label distribution:
Split | Size (Original/Perturbed) | Label Distribution |
---|---|---|
Train.raw | 259 | 69.1 / 65.3 / 78.4 / 40.9 / 23.9 |
Test.raw | 1466 | 64.7 / 63.5 / 78.9 / 53.0 / 27.9 |
Train.episodes | 58 | 67.24 / 65.52 / 77.59 / 46.55 / 24.14 |
Test.episodes | 1000 / 7000 | 64.7 / 63.5 / 78.9 / 53.0 / 27.9 |
Original
- original test data without adversarial perturbationsPerturbed
- perturbed test, containing both original data and its perturbationsThe data is sampled from the news and fiction sub-corpora of the Taiga corpus (Shavrina and Shapovalova, 2017).
The composition of the dataset is conducted in a semi-automatic mode.
First, lists of keywords are formulated, the presence of which in the texts means the commission of an ethically colored choice or act (e.g., 'kill', 'give', 'create', etc.). The collection of keywords includes the automatic collection of synonyms using the semantic similarity tools of the RusVestores project (Kutuzov and Kuzmenko, 2017).
After that, we extract short texts containing these keywords.
Each text is annotated via a Russian crowdsourcing platform Toloka. The workers were asked to answer five questions, one for each target column:
Do you think the text…
Examples with low inter-annotator agreement rates were filtered out.
Human annotators' submissions are collected and stored anonymously. The average hourly pay rate exceeds the hourly minimum wage in Russia. Each annotator is warned about potentially sensitive topics in data (e.g., politics, societal minorities, and religion). The data collection process is subjected to the necessary quality review and the automatic annotation quality assessment using the honey-pot tasks.
MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks.
Question-answering has been an essential task in natural language processing and information retrieval. However, certain areas in QA remain quite challenging for modern approaches, including the multi-hop one, which is traditionally considered an intersection of graph methods, knowledge representation, and SOTA language modeling.
Multi-hop reasoning has been the least addressed QA direction for Russian. The task is represented by the MuSeRC dataset (Fenogenova et al., 2020) and only a few dozen questions in SberQUAD (Efimov et al., 2020) and RuBQ (Rybin et al., 2021). In response, we have developed a semi-automatic pipeline for multi-hop dataset generation based on Wikidata.
Data instances are given as a question with two additional texts for answer extraction.
{
'support_text': 'Пабло Андрес Санчес Спакес ( 3 января 1973, Росарио, Аргентина), — аргентинский футболист, полузащитник. Играл за ряд клубов, такие как: "Росарио Сентраль", "Фейеноорд" и другие, ныне главный тренер чилийского клуба "Аудакс Итальяно".\\n\\nБиография.\\nРезультаты команды были достаточно хорошм, чтобы она заняла второе место. Позже он недолгое время представлял "Депортиво Алавес" из Испании и бельгийский "Харелбек". Завершил игровую карьеру в 2005 году в "Кильмесе". Впоследствии начал тренерскую карьеру. На родине работал в "Банфилде" и "Росарио Сентрале". Также тренировал боливийский "Ориенте Петролеро" (дважды) и ряд чилийских клубов.',
'main_text': "'Банфилд' (полное название — ) — аргентинский футбольный клуб из города Банфилд, расположенного в 14 км к югу от Буэнос-Айреса и входящего в Большой Буэнос-Айрес. Один раз, в 2009 году, становился чемпионом Аргентины.\\n\\nДостижения.\\nЧемпион Аргентины (1): 2009 (Апертура). Вице-чемпион Аргентины (2): 1951, 2004/05 (Клаусура). Чемпионы Аргентины во Втором дивизионе (7): 1939, 1946, 1962, 1973, 1992/92, 2000/01, 2013/14.",
'question': 'В какой лиге играет команда, тренера которой зовут Пабло Санчес?',
'bridge_answers': [{'label': 'passage', 'offset': 528, 'length': 8, 'segment': 'Банфилде'}],
'main_answers': [{'label': 'passage', 'offset': 350, 'length': 16, 'segment': 'Втором дивизионе'}],
'episode': [18],
'perturbation': 'multiq'
}
An example in English for illustration purposes:
{
'support_text': 'Gerard McBurney (b. June 20, 1954, Cambridge) is a British arranger, musicologist, television and radio presenter, teacher, and writer. He was born in the family of American archaeologist Charles McBurney and secretary Anna Frances Edmonston, who combined English, Scottish and Irish roots. Gerard's brother Simon McBurney is an English actor, writer, and director. He studied at Cambridge and the Moscow State Conservatory with Edison Denisov and Roman Ledenev.',
'main_text': 'Simon Montague McBurney (born August 25, 1957, Cambridge) is an English actor, screenwriter, and director.\\n\\nBiography.\\nFather is an American archaeologist who worked in the UK. Simon graduated from Cambridge with a degree in English Literature. After his father's death (1979) he moved to France, where he studied theater at the Jacques Lecoq Institute. In 1983 he created the theater company "Complicity". Actively works as an actor in film and television, and acts as a playwright and screenwriter.',
'question': 'Where was Gerard McBurney's brother born?',
'bridge_answers': [{'label': 'passage', 'length': 14, 'offset': 300, 'segment': 'Simon McBurney'}],
'main_answers': [{'label': 'passage', 'length': 9, 'offset': 47, 'segment': Cambridge'}],
'episode': [15],
'perturbation': 'multiq'
}
question
: a string containing the question textsupport_text
: a string containing the first text passage relating to the questionmain_text
: a string containing the main answer textbridge_answers
: a list of entities required to hop from the support text to the main textmain_answers
: a list of answers to the questionperturbation
: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is usedepisode
: a list of episodes in which the instance is used. Only used for the train setThe dataset consists of a training set with labeled examples and a test set in two configurations:
raw data
: includes the original data with no additional samplingepisodes
: data is split into evaluation episodes and includes several perturbations of test for robustness evaluation
Test and train data sets are disjoint with respect to individual questions, but may include overlaps in support and main texts.
Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split:
Split | Size (Original/Perturbed) |
---|---|
Train.raw | 1056 |
Test.raw | 1000 |
Train.episodes | 64 |
Test.episodes | 1000 / 7000 |
Original
- original test data without adversarial perturbationsPerturbed
- perturbed test, containing both original data and its perturbationsThe data for the dataset is sampled from Wikipedia and Wikidata.
The data for the dataset is sampled from Wikipedia and Wikidata.
The pipeline for dataset creation looks as follows:
First, we extract the triplets from Wikidata and search for their intersections. Two triplets (subject, verb, object) are needed to compose an answerable multi-hop question. For instance, the question "Na kakom kontinente nakhoditsya strana, grazhdaninom kotoroy byl Yokhannes Blok?" (In what continent lies the country of which Johannes Block was a citizen?) is formed by a sequence of five graph units: "Blok, Yokhannes" (Block, Johannes), "grazhdanstvo" (country of citizenship), "Germaniya" (Germany), "chast’ sveta" (continent), and "Yevropa" (Europe).
Second, several hundreds of the question templates are curated by a few authors manually, which are further used to fine-tune ruT5-large to generate multi-hop questions given a five-fold sequence.
Third, the resulting questions undergo paraphrasing and several rounds of manual validation procedures to control the quality and diversity.
Finally, each question is linked to two Wikipedia paragraphs, where all graph units appear in the natural language.
CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK.
The task can be considered the most challenging in terms of reasoning, knowledge and logic, as the task implies the QA pairs with a free response form (no answer choices); however, a long chain of causal relationships between facts and associations forms the correct answer.
The original corpus of the CheGeKa game was introduced in Mikhalkova (2021).
Data instances are given as question and answer pairs.
{
'question_id': 966,
'question': '"Каждую ночь я открываю конверт" именно его.',
'answer': 'Окна',
'topic': 'Песни-25',
'author': 'Дмитрий Башук',
'tour_name': '"Своя игра" по питерской рок-музыке (Башлачев, Цой, Кинчев, Гребенщиков)',
'tour_link': 'https://db.chgk.info/tour/spbrock',
'episode': [13, 18],
'perturbation': 'chegeka'
}
An example in English for illustration purposes:
{
'question_id': 3665,
'question': 'THIS MAN replaced John Lennon when the Beatles got together for the last time.',
'answer': 'Julian Lennon',
'topic': 'The Liverpool Four',
'author': 'Bayram Kuliyev',
'tour_name': 'Jeopardy!. Ashgabat-1996',
'tour_link': 'https://db.chgk.info/tour/ash96sv',
'episode': [16],
'perturbation': 'chegeka'
}
question_id
: an integer corresponding to the question id in the databasequestion
: a string containing the question textanswer
: a string containing the correct answer to the questiontopic
: a string containing the question categoryauthor
: a string with the full name of the authortour_name
: a string with the title of a tournamenttour link
: a string containing the link to a tournament (None for the test set)perturbation
: a string containing the name of the perturbation applied to text. If no perturbation was applied, the dataset name is usedepisode
: a list of episodes in which the instance is used. Only used for the train setThe dataset consists of a training set with labeled examples and a test set in two configurations:
raw data
: includes the original data with no additional samplingepisodes
: data is split into evaluation episodes and includes several perturbations of test for robustness evaluationEach training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split:
Split | Size (Original/Perturbed) |
---|---|
Train.raw | 29376 |
Test.raw | 520 |
Train.episodes | 49 |
Test.episodes | 520 / 3640 |
Original
- original test data without adversarial perturbationsPerturbed
- perturbed test, containing both original data and its perturbationsThe train data for the task was collected from the official ChGK database. Since that the database is open and its questions are easily accessed via search machines, a pack of unpublished questions written by authors of ChGK was prepared to serve as a closed test set.
For information on the data collection procedure, please, refer to Mikhalkova (2021).