Learner Corpora and Corpus Studies

Thursday, 27 February 2014

Openly available Learner Corpora 1 - a survey

Awareness of the uses of English L1 corpora and (more recently and more slowly) L2 learner corpora among teachers of English as a Foreign Language and writers and developers of materials for teaching English seems to be growing rapidly. Publicly available and easily-accessible corpora of native-speaker English are well documented and their use, and their uses, are increasing. Teachers and writers wanting to check their own intuitions, provide concordances of real-life examples of English usage for students and generate activities for use in the language classroom have vast repositories of data at their fingertips, here and here and here, for example.

But where are the learner corpora? Where can the curious EFL teacher/writer/researcher go to find data that highlights and clarifies the needs of learners?

There are hundreds of learner corpora already developed or in development at universities and publishing houses across the globe. ELT publishers such as CUP and Longman have been using learner corpora to inform their learner dictionaries and ELT materials since the early nineties. Academic institutions, such as The Centre for English Corpus Linguistics in Louvain, Belgium, have compiled huge learner corpora for their research in Second Language Acquisition.

Openly and freely available ready-to-search corpora of learner English, however, are not so easy to find.

I've been looking into this and plan to offer reviews of any learner corpora I find that fit the description above and strike me as worth a visit for the newcomer to learner corpora.

First up ...

ICNALE: The International Corpus Network of Asian Learners of English

http://language.sakura.ne.jp/icnale/icnale_online.html

The ICNALE corpus, compiled by Dr. Shin'ichiro Ishikawa of Kobe University, Japan, contains 1.3 M words of controlled essays written by 2,600 college students in 10 Asian countries and regions, as well as 200 English Native Speakers. The inclusion of the same essays written under the same conditions by Native Speakers from a range of English-speaking countries makes it ideal for comparisons between L1 and L2 output. The ICNALE is one of the largest learner corpora publicly available and a reliable database for a contrastive interlanguage analysis of Asian learners as well as studies of World Englishes in Asia.

It comprises two controlled essays written by learners of English from the following regions:

The data collection and annotation, including essay titles, conditions of writing, and how the L2 proficiency of each student was assigned, is clearly described on the home page.

From there, a simple login takes you to the online interface and you're ready to start exploring.

I looked at if occurring within two positions to the left of will for all regions at all CEFR levels and in both essay topics, to look for uses of will in if clauses of first conditionals. (My interest was in checking a vague intuition I had that this was primarily a European learner error.) This yielded a simple, sortable concordance with a column displaying the L1 and CEFR equivalent for each concordance line. I missed the ability to expand the concordance lines to view whole sentences, but was placated by the attractive pop-up sentences that appear when you click on the search node.

The interface also offers collocation searches with Raw Frequency, t-score, Log-Likelihood and Mutual Information views (and handy Help (?) buttons), a Wordlist function, and Keywords. The keywords list produced gives both 'overused' and 'underused' words compared with the reference corpus selected (I selected the two native-speaker corpora). Overused words included smoke, smoking, cigarette, part, time and job. This is hardly surprising given the two essay titles:

(A) "It is important for college students to have a part time job."
(B)"Smoking should be completely banned at all the restaurants in the country."

It's a great resource and I'd like to see it grow and the number of essays expanded to increase the range of language used.

I think the interface should be simple enough for anyone to make a good start at delving into Asian learners' English production. There isn't a user manual, so beginners might be a little bewildered. But this paper (PDF) http://language.sakura.ne.jp/s/ilaa/ishikawa_20130323.pdf acts as a useful overview of tools and functions etc. and is pretty accessible.

I recommend this corpus to anyone interested in learner corpora, and in particular to any ELT teachers and materials writers/developers interested in teaching or writing for Asian students. And I'd be very pleased to hear what you think.

Many thanks to @Glenn_Hadikin and Costas Gabrielatos @congabonga for bringing this corpus to my attention.

Friday, 4 October 2013

Learner Corpus Research conference - Bergen, 27th- 29th September 2013

Review and reflections

I was invited by Cambridge University Press Language Research department to attend the second Learner Corpus Research conference in Bergen last week. I won't dwell here on the organization, accommodation, food, leisure facilities, dancing, live music, and much-needed opportunities for quiet reflection (see above), which were all beyond my wildest imaginings. The whole package was faultless and I'm very grateful to CUP for sending me, and to the organizers for all their work.

The conference was attended by about 120 participants - lecturers, researchers, PhD students, teachers and teacher trainers - from all over the world. There were 4 plenaries, 49 presentations and around 20 posters and demos. A complete book of abstracts is available for download here. The country with the most accepted papers was Spain - some great work going on there with learner corpora! - but speakers came from a very wide range of countries and disciplines.

In the 4 plenaries, Sylviane Granger: Contrastive Interlanguage Analysis: a reappraisal; reminded us in a very timely way, of the importance of constantly re-evaluating our research hypotheses, methods and aims. Bård Uri Jensen: "A chi-square test showed that..." or did it really? Some remarks on the use of statistical tests in corpus-based research warned us against allowing the powerful and apparently simple statistical tools now at our fingertips to do our thinking for us. John Osborne: Comparisons are odorous: native-speaker data in learner corpus research urged us to circumspection about the role of 'native-speaker' corpora as reference corpora for the study of learner interlanguage. And Scott Jarvis: Signals and clues in detecting cross-linguistic influence: What detectives and detectors can tell us highlighted the value of human judges working *together* with machine classifiers to uncover and analyze corpus research clues.

For me, the buzzwords/phrases of the conference seemed to be (in no particular order):

EGAP/ESAP teaching
Spoken corpora - compiling, error-annotating
Longitudinal corpora
Statistical tools for the analysis of corpora
The status of native speaker data in the age of World Englishes and English as a Lingua Franca
Reference corpora and comparability of corpora
Freely-available and user-friendly corpus tools for analysis and annotation
The Common European Framework of Reference
Availability of corpus data
Using learner corpora in the EFL classroom
Lexical bundles/formulaic sequences/fixed expressions/multi-grams
Collocations

There were software demos of a more user-friendly interface for error-annotated and -corrected corpora in Sketch Engine; freely-available corpus compilation, annotation and analysis tools developed at Autonomous University Madrid - CorpusTool, which a number of researchers at the conference were using, and the new, freely-available 30m word EF-CamDat learner corpus.

A landmark event was the launch of the Learner Corpus Association to provide a forum for exchanging ideas on learner corpus research from an interdisciplinary perspective:

- Second language acquisition
- Foreign language teaching (including CALL)
- Language testing
- NLP applications (automated scoring, L1 identification, error detection and correction, etc.)
- other language-related fields

Members (membership costs between 20 and 60 Euros, depending on your status) get access to a range of resources, including shared corpora, publications and corpus tools and a regularly updated searchable learner corpus bibliography with more than 900 entries. They will also be able to take part in forums focused on a range of topics (learner corpus design, annotation, methodology, applications, etc.) and benefit from discounts negotiated by the LCA. There is currently a special offer of free 6-month trial access to Sketch Engine.

I was at the first Learner Corpus Research conference in Louvain in 2011 and my impression is that the availability and accessibility of both learner corpora and user-friendly corpus tools has really gathered pace in the last two years and will continue to do so, and that this is slowly bringing learner corpora and corpora in general out of the research centres and into the classroom. There is still a great wealth of fascinating research coming out of those centres. I am pleased to see that it is reaching a wider audience and gaining wider application in SLA and in ELT teaching and materials development. I'm excited about what the next two years will bring!