TAPE Benchmark

About TAPE

Assessing Few-shot Russian Language Understanding.

TAPE (Text Attack and Perturbation Evaluation) is a novel benchmark for few-shot Russian language understanding evaluation that includes six complex NLU tasks, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation across different axes:

subpopulations for nuanced interpretation
linguistic-oriented adversarial attacks and perturbations for analysing robustness.

General data collection principles of TAPE are based on combining "intellectual abilities" needed to solve GLUE-like tasks, ranging from world knowledge to logic and commonsense reasoning. Based on the GLUE format, we have built six new datasets from the ground up, each of them requiring the modeling abilities of at least two skills:

Logical reasoning (Winograd scheme)
Reasoning with world knowledge (RuOpenBookQA, RuWorldTree, MultiQ, and CheGeKa)
Multi-hop reasoning (MultiQ)
Ethical judgments and reasoning (Ethics)

TAPE Toolkit

(a) D_test is passed to the adversarial framework to create the adversarial D_test that includes the original and adversarial examples.
(b) We randomly sample five sets of demonstration examples from D_train for each k ∈ {1, 4, 8}. In the zero-shot scenario, we skip this stage.
(c) After that, we merge the demonstrations, when applicable, with the examples from the adversarial Dtest to construct evaluation episodes.
(d) Each episode is used to obtain predictions from the model.
(e) The performance is summarized in a diagnostic evaluation report.

The perturbations, included in the framework, can be divided into two categories:

Word-Level Perturbations: spelling (mimicking spelling mistakes) and modality (replacement of the input with emojis)
Sentence-Level Perturbations: random (token deletion and swaps), distraction (generation of additional text) and paraphrases (generating context variations)

Resources

TAPE: Evaluation framework

All of the code for model evaluation (including evaluation reports and subpopulation analysis), as well as the code to reproduce the baselines mentioned in the paper are available on our GitHub page.

TAPE: Datasets

TAPE's datasets are available through the HuggingFace dataset library. The datailed description of each task can be found here.

RuTransform: Framework for adversarial attacks and perturbations

Additionally, we release RuTransform, a Python framework for adversarial attacks and text data augmentation for Russian. The framework presents a stand-alone tool for adversarial data creation and model evaluation for Russian, that can also be used for adversarial model training and data augmentation. More information on the framework can be found in the RuTransform GitHub repository.

Cite us:

@article{taktasheva2022tape,
	title={TAPE: Assessing Few-shot Russian Language Understanding},
	author={Taktasheva, Ekaterina and Shavrina, Tatiana and Fenogenova, Alena and Shevelev, Denis and Katricheva, Nadezhda and Tikhonova, Maria and Akhmetgareeva, Albina and Zinkevich, Oleg and Bashmakova, Anastasiia and Iordanskaia, Svetlana and others},
	journal={arXiv preprint arXiv:2210.12813},
	year={2022}
}

Leaderboard

Report your results: If you have new results experimented with TAPE, please see submission instructions here. For any inquiries send an email to tapebenchmark@gmail.com.

The goal of this leaderboard is to collect research works under the evaluation framework and to measure the true progress of the field. So it is encouraged that you attach a link to the reproducible source codes. Thank you!

Metrics: F1-score/accuracy (EM for CheGeKa and MultiQ). Abbreviations: RWT - RuWorldTree; ROBQA - RuOpenBookQA.