SpraStu » Sprache & Studienerfolg

Access to MIKO

The MIKO corpus is now available via the Archive for Spoken German (AGD) and the Database for Spoken German (DGD). The publication of MIKO via the IDS Repository is unfortunately delayed.

It is nevertheless possible to get access to the corpus. Please contact Prof Dr Katrin Wisniewski directly (e-mail).

Context

Multimodal corpora are collections of written or spoken language that contain empirical data stemming from more than one “sensory modality” (Allwood, 2008, p. 208). As a result of the rapid technological improvements in corpus linguistics in the past few years, multimodal corpora have become much easier to access and have drawn increased attention in the corpus linguistic research community.

MIKO (Mitschreiben in Vorlesungen: Ein multimodales Lehr-Lernkorpus) is a multimodal lecture corpus targeting academic German. It contains both spoken (lectures; audio and video included) and written language (L1 and L2 students’ notes). Other, comparable multimodal corpora for academic German are euroWiss and GeWiss.

MIKO was compiled in the broader context of the research project “Sprache und Studienerfolg bei Bildungsausländer/-innen“ (SpraStu, “Language and academic success of international students”), which was funded by the German Federal Ministry of Education and Research. One of the aims of the project was to better understand note-taking in lectures.
The extremely widespread practice of note-taking is also highly demanding. This is particularly true for international students whose L1 is not the language of instruction. These students often experience note-taking as very difficult at the beginning of their university studies (Bärenfänger et al., 2015; Marks, 2015). Note-taking involves the simultaneous use of receptive and productive skills (Arras & Fohr, 2020; Wisniewski, 2019; Steets, 2003). From a research perspective, lecture notes are hard to grasp, both empirically and linguistically (Wisniewski, 2019), which is partly due to their intrinsic intertextuality, i.e. the fact that they are interwoven with and dependent on visual and auditory (lecture) input. If and how notes are taken is strongly influenced by both learner-internal (e.g., motivational, knowledge- or proficiency-related) and external factors (e.g., complexity of the language used in the lecture). Therefore, successful note-taking requires strategic, linguistic, and content knowledge and is dependent on somewhat volatile individual motives and intentions linked to specific individual situations (Wisniewski, 2019).
This is why the use of a multimodal corpus for the analysis of lecture notes seemed to be a promising approach (Wisniewski, Spiegel, Parker et al., forthcoming).

Data and metadata

Corpus data

MIKO contains corpus-linguistic data collected during sessions (n = 8, 10 hours, 82,075 tokens) from mandatory first-semester lectures in medicine (Functional Anatomy I, Physics for medical students), German as a foreign language (Lexicology of Contemporary German), and economic sciences (Civil Law). All sessions (three per lecture) were recorded (video & audio) in the winter semester of 2017/2018 at Leipzig University. Lecture transcripts are available both with and without tokenization and automatic annotations.

Further sessions were recorded, but not transcribed (n = 4; n = 2 pertaining to the study of German as a foreign language, n = 2 from economic sciences, 5:38 hours). These files are also available in MIKO.

MIKO also contains n = 146 notes taken in these (n = 12) lecture sessions by students with L1 or L2 German. However, the notes were not transcribed, and no machine-readable format is available. They are made available as anonymized pdf scans.

Metadata

Metadata are available for lectures and notes, respectively.

Lecture metadata refer to:

the research project (SpraStu)
the corpus (administrative issues, corpus design)
the lecture itself (sessions, use of media, lecture topics, related audio and video files, audience)
transcripts (tokens, types, annotations, coders, reviewers)
speakers (lecturer, assistant)

Lecture notes metadata stem from different sources: lecture metadata, student note-taking questionnaires for each lecture session, a general project questionnaire, and post hoc analyses and ratings. Furthermore, results of a number of language proficiency tests carried out in the wider context of the SpraStu project form part of the lecture notes metadata in MIKO. Thus, information on aspects of the students’ language proficiency at the very beginning of their studies serves as background metadata for their note-taking in selected first-semester lectures.

Lecture note metadata refer to:

the lecture session (general information, perceived linguistic and overall difficulty, perceived important topics)
the notes (key features, ratings, usefulness, note-taking intentions)
authors (field of study, language learning biography, language test results)

A reduced list of MIKO metadata and a full list of the metadata variables are available for download here. In addition, the complete list of metadata will be made available along with the entire data set in the Research Data Repository of the Leibniz Institute for the German Language (IDS) in Mannheim.

Compilation of the corpus

Lectures

As in the euroWiss project (Heller et al., 2013), out of every lecture sequence (which extend over a whole semester), three consective sessions were recorded. In all sessions, two cameras were used to capture both the blackboard/projection (e.g., of a presentation) and the lecturer. Audio files were recorded via a radio microphone worn by the lecturer during the session. Later, the video was edited so that the footage of the blackboard/ projection and the close-up of the lecturer could be viewed side by side. Thus, the videos show all nonverbal activities by the lecturers, including pointing gestures directing the students’ attention to parts of the presentation and/or blackboard.

In post-production editing, the videos were modified in order to protect the privacy rights of students and other persons present in the lectures. Soft focus was used for masking the video, and the respective audio sequences were either deleted or replaced with noise.

Transcription and annotation

MIKO lecture sessions were transcribed using a slightly modified version of the HIAT transcription guidelines (Rehbein et al., 2004). Transcription and annotation guidelines are described in detail in the MIKO manual (Spiegel et al., 2020).
To build MIKO, we used the EXMARaLDA editor (Schmidt & Wörner, 2014) which is particularly well suited for working with spoken corpora. EXMARaLDA allows for the integration of audio-aligned audio and video files. The tier structure used for MIKO can be downloaded here.

The quality control of transcriptions and annotations was organized in several stages and carried out by the same two transcribers throughout the duration of the SpraStu project (2018-2020). Details can be found in the corpus manual (Spiegel et al., 2020).

For MIKO, automatic tokenization, lemmatization, and part-of-speech tagging were run. We used the TreeTagger (Schmid, 1994), which is based on the Stuttgart Tübingen tagset (STTS) (Schiller et al., 1999). These automatic annotations were not checked manually. MIKO provides a fully tokenized version of the transcripts containing all automatic annotations, which, however, can slow down EXMARaLDA. Therefore, we also provide the transcripts without automatic annotations.

Lecture notes taken by students

The lecture notes of SpraStu participants were photographed directly after each lecture session by the project team. The notes were then anonymized to delete all references to the authors.

MIKO contains n = 146 lecture notes (pdf). These notes are available via the Archive for Spoken German (AGD). Access to the data can be obtained by sending an email to agd[at]ids-mannheim.de specifying your research aim.

While MIKO contains 12 lecture sessions, notes are available for 8 of these sessions. The reason why no notes are available for four of the sessions is that either few participants were present or some of them did not take notes at all or did not agree to make their notes available. The lecture sessions were only transcribed if lecture notes could be collected, resulting in n = 8 transcribed lecture sessions for which n = 123 notes are provided (n = 78 for students with L2 German, n = 45 for students with L1 German).
For these lecture notes, we extracted key features (such as their length and language(s) used) and rated their appropriateness based on three rating criteria (quantitative completeness, correctness, and overall quality). Both the key features and the ratings form part of the lecture note metadata. Detailed results can be found in (Wisniewski, Spiegel, Lenort & Feldmüller forthcoming).

Access to MIKO

Database for Spoken German (DGD) at the IDS

Registered users can access MIKO via the Database for Spoken German (Datenbank für Gesprochenes Deutsch, DGD) at the Leibniz Institute for the German Language (IDS) (https://dgd.ids-mannheim.de). The DGD is a browser-based corpus-management system that offers a broad range of search options for transcripts and metadata, with a special focus on spoken language.

Here, MIKO provides lecture transcripts (n = 8), audio (n= 12) and video recordings (n = 9), the corpus manual, and the metadata for students’ lecture notes.

Research Data Repository at the IDS

MIKO will also be stored in the Research Data Repository (Langzeitarchiv, LZA) of the Leibniz Institute for the German Language (IDS, PID: https://hdl.handle.net/10932/00-0534-6426-9660-0101-7).

Access will be possible for free via the Authentication and Authorization Infrastructure (AAI). If users have no affiliation with a research institution listed in AAI, they can register at https://idm.clarin.eu/user/home to qualify for corpus access. As the IDS is part of the CLARIN infrastructure, the corpus is also accessible in the Virtual Language Observatory via some core metadata.

The corpus data that can be downloaded from this repository includes lecture transcripts in their tokenized (n = 8, .exb) and untokenized (n = 8, .exb) versions, audio files (n = 12, .wav), video files (n = 9, .mp4), the MIKO manual (pdf/a), and all metadata (.csv).

Archiv für Gesprochenes Deutsch

For copyright reasons, the lecture notes (n = 146, pdf/a) can only be obtained from the Archive for Spoken German (AGD). Please write an email to agd[at]ids-mannheim.de and specify your research project to get access to the notes.

Downloads

MIKO corpus manual

Spiegel_et_al_2020_MIKO-Handbuch.pdf

Overview of lecture sessions

MIKO_Uebersicht_Sprechereignisse.pdf

Overview of transcription and annotation tiers in MIKO

MIKO_Uebersicht_Spurstruktur.pdf

Selected metadata for lecture sessions

MIKO_Metadaten_Vorlesungen_Auszug.csv

Overview of metadata for lecture sessions

MIKO_Metadaten_Vorlesungen_Variablen.pdf

Selected metadata for lecture notes

MIKO_Metadaten_Mitschriften_Auszug.csv

Overview of metadata for lecture notes

MIKO_Metadaten_Mitschriften_Variablen.pdf

All MIKO downloads listed above as a compressed zip file

MIKO_Downloads.zip

Cite MIKO

To cite MIKO as a whole, please use the following text:

Wisniewski, K., Spiegel, L., Parker, M., Feldmüller, T. & Lenort, L. (forthcoming). Mitschreiben in Vorlesungen in der Studieneingangsphase: Das multimodale Lehr-Lernkorpus MIKO. In K. Wisniewski, W. Lenhard, J. Möhring & L. Spiegel (eds.), Sprache und Studienerfolg bei Bildungsausländer/-innen. Waxmann.

To cite specific data from the corpus, please use the following link: https://hdl.handle.net/10932/00-0534-6426-9660-0101-7.

If your citation rules require you to name the corpus editors, you can list these as Wisniewski, K., Spiegel, L., Parker, M., Feldmüller, T., Lenort, L.

To cite a transcript, we recommend using the event code, e.g., MIKO_E_00012.

If you wish to refer to a specific transcript sequence, please indicate the point in time where the sequence starts (rounding to the nearest second, e.g., MIKO_E_00004, 71:15.

If you access MIKO via the DGD, please refer to the database’s own citing recommendations.

References

Allwood, J. (2008). Multimodal corpora. In A. Lüdeling & M. Kytö (eds.), Handbücher zur Sprach- und Kommunikationswissenschaft /HSK]: Vol. 29.1. Corpus linguistics: An international handbook (pp. 207–225). De Gruyter.

Arras, U. & Fohr, T. (2020). Mitschreiben: Funktionen und didaktische Überlegungen zu Formen der Wissensverarbeitung an der Hochschule. In A. Gryszko, C. Lammers, K. Pelikan & T. Roelcke (eds.), DaFFür Berlin – Perspektiven für Deutsch als Fremd- und Zweitsprache in Schule, Beruf und Wissenschaft (pp. 131–149). Göttingen University Press.

Bärenfänger, O., Lange, D. & Möhring, J. (2015). Sprache und Bildungserfolg: Sprachliche Anforderungen in der Studieneingangsphase. Research papers in assessment: Vol. 1. Institut für Testforschung und Testentwicklung e.V. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-188820

Heller, D., Hornung, A., Redder, A. & Thielmann, W. (2013). The euroWiss-Project: Linguistic Profiling of European Academic Education (Germany/Italy). European Journal of Applied Linguistics, 1(2), 317–320. https://doi.org/10.1515/eujal-2013-0018

Marks, D. (2015). Prüfen sprachlicher Kompetenzen internationaler Studienanfänger an deutschen Hochschulen – Was leistet der TestDaF? Zeitschrift für Interkulturellen Fremdsprachenunterricht, 20(1), 21–39.

Rehbein, J., Schmidt, T., Meyer, B., Watzke, F. & Herkenrath, A. (2004). Handbuch für das computergestützte Transkribieren nach HIAT: Version 1.0. Arbeiten zur Mehrsprachigkeit Folge B: Vol. 56. Sonderforschungsbereich 538 (Mehrsprachigkeit), Universität Hamburg.

Schiller, A., Teufel, S., Stöckert, C. & Thielen, C. (1999). Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). Universität Stuttgart, Institut für Maschinelle Sprachverarbeitung; Universität Tübingen, Seminar für Sprachwissenschaft. http://www.sfs.uni-tuebingen.de/resources/stts-1999.pdf

Schmid, H. (1994). Probabilistic part-of-speech tagging using Decision Trees. Proceedings of the International Conference on New Methods in Language. https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger1.pdf

Schmidt, T. & Wörner, K. (2014). EXMARaLDA. In J. Durand, U. Gut & G. Kristoffersen (eds.), The Oxford Handbook of Corpus Phonology (pp. 402–419). Oxford University Press.

Spiegel, L., Parker, M., Feldmüller, T., Lenort, L. & Wisniewski, K. (2020). MIKO (Mitschreiben in Vorlesungen: Ein multimodales Lehr-Lernkorpus): Handbuch. https://home.uni-leipzig.de/sprastu/Spiegel_et_al_2020_MIKO-Handbuch.pdf

Steets, A. (2003). Die Mitschrift als universitäre Textart–schwieriger als gedacht, wichtiger als vermutet. In K. Ehlich & A. Steets (eds.), Wissenschaftlich schreiben–lehren und lernen (pp. 51–64). De Gruyter.

Wisniewski, K. (2019). Mitschreiben in Vorlesungen. Ein interdisziplinärer Forschungsüberblick mit Fokus Deutsch als L2. In C. Fandrych & R. Schmidlin (eds.), Bulletin suisse de linguistique appliquée: Bd. 109. Wissenschaftssprache(n) kontrastiv. (pp. 153–170).

Wisniewski, K., Spiegel, L., Lenort, L. & Feldmüller, T. (forthcoming). Herausforderung Wissenschaftssprache I: Mitschreiben. In K. Wisniewski, W. Lenhard, J. Möhring & L. Spiegel (eds.), Sprache und Studienerfolg bei Bildungsausländer/-innen. Waxmann.