Deutsch im Studium: Lernerkorpus | German at university: learner corpus

Learner corpora are systematic collections of spoken or written texts produced by learners of a language (Granger et al. 2015). They are not only useful for researching L2 acquisition, but also increasingly common in language-testing research and language instruction. Other examples of learner corpora for L2 German are MERLIN, Falko (Lüdeling et al. 2008), and BeMaTaC (Sauer 2013).

DISKO has several subcorpora, all of which contain texts produced in university-entrance language tests. Most international students at German universities are required to take an entrance examination of this kind to verify their “sprachliche Studierfähigkeit”, i.e. their academic language proficiency. More detailed information on the subcorpora can be found in the corpus manual (Muntschick et al. 2020, in German).

Core subcorpora of DISKO

At the core of DISKO are two longitudinal subcorpora with texts produced in the context of the SpraStu project: DISKO_L2 and DISKO_L1.

The DISKO_L2 texts (n = 510) were written by international students enrolled in BA programs at the universities of Leipzig and Würzburg, all of whom had learned German as a foreign language. There are up to three texts available for each L2 participant (data points: beginning of the first semester, one year later, and two years later). All texts were written in response to one writing task of the TestDaF, a standardised university-entrance exam that assesses levels of German language proficiency corresponding to the CEFR levels and serves as a certificate of academic language proficiency (HRK/KMK 2015, 2004).
The DISKO_L1 texts (n = 85) were produced by BA students at the same universities who were L1 speakers of German; the same task was used.

The authors of these texts were participants in the SpraStu project. As a result, extensive metadata are available, e.g. regarding their linguistic and socio-economic backgrounds (see SpraStu design section; further information available in Wisniewski et al. (forthcoming). Furthermore, while the DISKO_L1 and DISKO_L2 corpus texts stem from a TestDaF writing task that was rated by professional TestDaF raters, the students in the SpraStu project also took other language tests (the onSET as well as listening, reading, and vocabulary tests), the results of which are part of the corpus metadata.

Other subcorpora

In addition to this core data, DISKO also contains two further subcorpora. While these also include written texts produced in the context of university-entrance language examinations, they are more loosely tied to the SpraStu project.

DISKO_DSH is a very small subcorpus that consists of 24 texts written by SpraStu participants prior to their participation in the project for the purpose of their own university admission. These texts stem from the university-entrance language test DSH (Deutsche Sprachprüfung für den Hochschulzugang) that some participants had taken. Thus, while DISKO_L1 and DISKO_L2 were compiled for research purposes, DISKO_DSH is an ex-post collection of authentic high-stakes, written productions by some of the project participants.

The DISKO_WebTestDaF is another DISKO subcorpus. Unlike the other subcorpora, its texts were not produced by SpraStu participants, meaning that fewer metadata are available. The pseudolongitudinal subcorpus contains 479 texts elicited in a field test of the recently introduced digital TestDaF (TestDaF-Institut n.d.). These learner texts were written in response to one integrated and one independent writing task (2 tasks per learner). A proficiency classification is available via an onSET test (TestDaF-Institut 2018) that the participants took during the field test, the results of which are part of the metadata. However, no ratings of the corpus texts themselves are available, meaning that the classifications are less finely grained than in the core corpora, where each corpus text was rated.

The common feature of all DISKO texts is the fact that they were produced in language-assessment contexts targeting the construct of the “sprachliche Studierfähigkeit,” i.e., the academic language proficiency required to study at a German university.

Corpus size

DISKO contains n=1098 texts with 397,082 tokens written by n=695 participants. The texts are distributed across four subcorpora, i.e., the two core corpora DISKO_L2 and DISKO_L1, as well as DISKO_DSH and DISKO_WebTestDaF.

In Table 1, you can find information on tasks, elicitation contexts, and metadata for each subcorpus. The number of authors and texts, overall token numbers, and average number of tokens per text are also specified (including standard deviations, σ). In addition, the table includes a rough summary of the annotation tiers (for a detailed account, see downloads).


Living together or alone?
Living together or alone?
Various, unknown tasks
1) Job mobility
2) Bee death

SpraStu (2017-19)
SpraStu (2017-18)
Field test TestDaF (2018)



1) 57,462
2) 33,703

Tokens per text (Ø)
470 (σ=136)
639 (σ=125)
487 (σ=102)
1) 236 (σ=53)
2) 143 (σ=41)

Annotation tiers (without lemma & POS tiers)
tok, norm, TH1, borrowed, task, handwriting, macro
tok, norm, borrowed, task, handwriting, macro
tok, handwriting, macro
tok, norm

Table 1: Overview of DISKO subcorpora


Metadata are available with regard to:

  • elicitation context (research project SpraStu)
  • corpus (administrative issues, corpus design, information regarding the annotations)
  • texts (elicitation context, ratings, transcription)
  • author (e.g., language-learning biography, language-test results)

In the download area you can find abridged versions of the metadata with the most important information for each subcorpus. In addition, a list of all metadata variables used in DISKO is available on this website. The full set of metadata will be available for downloaded from the Research Data Repository of the Leibniz Institute for the German Language in Mannheim. Metadata will also be available to be used as filter variables via the search and visualization architecture ANNIS.

Building a digital learner corpus from the handwritten texts required a complex workflow, which is described in more detail in the corpus manual (Muntschick et al. 2020, chapter 2, in German).

Transcription guidelines were developed and extensively piloted. EXMARaLDA (Dulko) (Nolda 2019) was used as a transcription and annotation editor, see also the EXMARaLDA Partitur Editor (Schmidt 2002; Schmidt and Wörner 2014). Selected features of the learners’ handwriting were annotated early on in the transcription process, and all texts were anonymized in the transcripts and scans.

Automatic lemma, part-of-speech, and sentence-span annotations were added and corrected manually (n=74 texts) and/or semiautomatically for certain phenomena (n=1074 texts).

In addition, manual annotations are available for some of the transcribed texts. These take into account a limited set of linguistic characteristics and were only applied to parts of the corpora. Manual annotations include a normalized transcription tier (where unknown tokens were corrected ex post, n=1074 texts) and, for n = 92 texts, a minimal target hypothesis which reflects morphosyntactic and orthographic corrections. Further manual annotations link sections of the learner texts to the writing task (n=119 texts) and/or mark structures that were borrowed from the task rubrics (n=203 texts).

Inter- and intra-rater reliability were checked via a number of quantitative and qualitative procedures specified in the corpus manual (in German).

An overview of all annotation tiers available for each subcorpus can also be found in the corpus manual, as well as here (also in German).


DISKO has been published in the ANNIS environment of the Humboldt-Universität zu Berlin. Users of DISKO must have an academic account to get access.

ANNIS (Krause and Zeldes 2016) offers a broad range of corpus-linguistic query and analysis functions for texts and metadata.

To start using ANNIS and view some sample searches, please see chapter 3.2 of the corpus manual.

DISKO in the Research Data Repository at the Leibniz Institute for the German Language

Another option for working with DISKO will be to download the corpus from the Research Data Repository at the Leibniz Institute for the German Language (IDS) in Mannheim. DISKO will be available at this PID: https://hdl.handle.net/10932/00-0534-6404-3CE0-0001-3. In the Data Repository, DISKO will be available in exb, txt, pdf, and ANNIS formats.
Access will be possible via the Authentication and Authorization Infrastructure (AAI). Users who have no affiliation with a research institution listed in AAI can register athttps://idm.clarin.eu/user/home to qualify for corpus access. As the IDS is part of the CLARIN infrastructure, the corpus will also be accessible in the Virtual Language Observatory) via some core metadata.

DISKO access upon request

It is also possible to directly contact Katrin Wisniewski (e-mail) for access to the corpus.

DISKO corpus manual

Task used in DISKO_L2 and DISKO_L1: Living together or alone?

Task 1 used in DISKO_WebTestDaF: Job mobility

Task 2 used in DISKO_WebTestDaF: Bee death

Selected metadata for DISKO_L2

Selected metadata for DISKO_L1

Selected metadata for DISKO_DSH

Selected metadata for DISKO_WebTestDaF

Metadata variables DISKO (complete)

Overview of transcription and annotation tiers used in DISKO

All DISKO downloads listed above as compressed zip file

To cite the DISKO corpus as a whole, please use the following text:

Wisniewski, K., Muntschick, E., Portmann, A. (forthcoming). Schreiben in der Studiersprache Deutsch: Das Lernerkorpus DISKO. In Wisniewski, K., Lenhard, W., Möhring, J., Spiegel, L. (eds.). Sprache und Studienerfolg bei Bildungsausländer/-innen. Münster: Waxmann.

To cite specific data from the corpus, please use the following link: https://hdl.handle.net/10932/00-0534-6404-3CE0-0001-3

If your citation rules require you to name the corpus editors, you can list these as Katrin Wisniewski, Elisabeth Muntschick, and Annette Portmann.
To cite text extracts, please use the document ID (e.g., DISKO_012_L2_T1_TDN3).

Granger, Sylviane; Gilquin, Gaëtanelle; Meunier, Fanny (eds.) (2015): The Cambridge Handbook of Learner Corpus Research. Cambridge: Cambridge University Press (Cambridge Handbooks in Language and Linguistics).

HRK/KMK (eds.) (2004): Rahmenordnung über Deutsche Sprachprüfungen für das Studium an deutschen Hochschulen (RO-DT). Beschluss der HRK vom 08.06.2004 und der KMK vom 25.06.2004 i.d.F. der HRK vom 10.11.2015 und der KMK vom 12.11.2015. Accessible online at https://www.kmk.org/fileadmin/Dateien/veroeffentlichungen_beschluesse/2004/2004_06_25_RO_DT.pdf, last accessed 03/30/2022.

HRK/KMK (eds.) (2015): Rahmenordnung über Deutsche Sprachprüfungen für das Studium an deutschen Hochschulen (RO-DT). Accessible online at https://www.hrk.de/themen/internationales/internationale-studierende/hochschulzugang-fuer-internationale-studierende/sprachnachweis-deutsch/, last accessed 03/30/2022.

Krause, Thomas; Zeldes, Amir (2016): ANNIS3: A new architecture for generic corpus query and visualization. In: Digital Scholarship Humanities 31 (1), pp. 118–139. DOI: 10.1093/llc/fqu057.

Lüdeling, Anke; Doolittle, Seanna; Hirschmann, Hagen; Schmidt, Karin; Walter, Maik (2008): Das Lernerkorpus Falko. In: Deutsch als Fremdsprache 45 (2), S. 67–73.

Muntschick, Elisabeth; Portmann, Annette; Schwendemann, Matthias; Wisniewski, Katrin (2020): DISKO (Deutsch im Studium: Lernerkorpus): Handbuch. Accessible online at https://home.uni-leipzig.de/sprastu/Muntschick_et_al_2020_DISKO-Handbuch.pdf, last accessed 03/30/2022.

Nolda, Andreas (2019): EXMARaLDA (Dulko). Version 14.1. Accessible online at https://hg.sr.ht/~nolda/exmaralda-dulko, last accessed 03/30/2022.

Sauer, Simon (eds.) (2013): BeMaTaC. Ein tief annotiertes multimodales Map-Task-Korpus gesprochener Lerner- und Muttersprache. Accessible online at https://www.linguistik.hu-berlin.de/en/institut-en/professuren-en/korpuslinguistik/research/bematac/bematac, last accessed 03/30/2022.

Schmidt, Thomas (2002): EXMARaLDA – ein System zur Diskurstranskription auf dem Computer: Sonderforschungsbereich 538 (Mehrsprachigkeit), Universität Hamburg (Arbeiten zur Mehrsprachigkeit Folge B, 34).

Schmidt, Thomas; Wörner, Kai (2014): EXMARaLDA. In: Jacques Durand, Ulrike Gut und Gjert Kristoffersen (eds.): The Oxford Handbook of Corpus Phonology: Oxford University Press, pp. 402–419.

Spiegel, L., Parker, M., Feldmüller, T., Lenort, L. & Wisniewski, K. (2020). MIKO (Mitschreiben in Vorlesungen: Ein multimodales Lehr-Lernkorpus): Handbuch. Accessible online athttps://home.uni-leipzig.de/sprastu/Spiegel_et_al_2020_MIKO-Handbuch.pdf, last accessed 03/30/2022.

TestDaF-Institut (n.d.): Der Aufbau des digitalen TestDaF. Accessible online at https://www.testdaf.de/de/teilnehmende/der-digitale-testdaf-ueberblick/, last accessed 03/30/2022.

TestDaF-Institut (eds.) (2018): onSET-Handbuch: Planung und Durchführung von Online-Spracheinstufungstests – onSET-Deutsch, onSET-English. TestDaF-Institut. Bochum: TestDaF-Institut.

Wisniewski, Katrin; Lenhard, Wolfgang; Möhring, Jupp; Spiegel, Leonore (eds.) (forthcoming): Sprache und Studienerfolg bei Bildungsausländer/-innen. Münster: Waxmann.

Close up mixed race students sitting in row, writing academic test, solving tasks or answering questions during school exam. Diverse focused people passing entrance university exams, studying concept.

Ich bin ein Textblock. Klicken Sie auf den Bearbeiten Button um diesen Text zu ändern. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

Ich bin ein Textblock. Klicken Sie auf den Bearbeiten Button um diesen Text zu ändern. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.