Openly available Learner Corpora 1 - a survey
Awareness of the uses of English L1 corpora and (more recently and more slowly) L2 learner corpora among teachers of English as a Foreign Language and writers and developers of materials for teaching English seems to be growing rapidly. Publicly available and easily-accessible corpora of native-speaker English are well documented and their use, and their uses, are increasing. Teachers and writers wanting to check their own intuitions, provide concordances of real-life examples of English usage for students and generate activities for use in the language classroom have vast repositories of data at their fingertips, here and here and here, for example.But where are the learner corpora? Where can the curious EFL teacher/writer/researcher go to find data that highlights and clarifies the needs of learners?
There are hundreds of learner corpora already developed or in development at universities and publishing houses across the globe. ELT publishers such as CUP and Longman have been using learner corpora to inform their learner dictionaries and ELT materials since the early nineties. Academic institutions, such as The Centre for English Corpus Linguistics in Louvain, Belgium, have compiled huge learner corpora for their research in Second Language Acquisition.
Openly and freely available ready-to-search corpora of learner English, however, are not so easy to find.
I've been looking into this and plan to offer reviews of any learner corpora I find that fit the description above and strike me as worth a visit for the newcomer to learner corpora.
First up ...
ICNALE: The International Corpus Network of Asian Learners of English
http://language.sakura.ne.jp/icnale/icnale_online.htmlThe ICNALE corpus, compiled by Dr. Shin'ichiro Ishikawa of Kobe University, Japan, contains 1.3 M words of controlled essays written by 2,600 college students in 10 Asian countries and regions, as well as 200 English Native Speakers. The inclusion of the same essays written under the same conditions by Native Speakers from a range of English-speaking countries makes it ideal for comparisons between L1 and L2 output. The ICNALE is one of the largest learner corpora publicly available and a reliable database for a contrastive interlanguage analysis of Asian learners as well as studies of World Englishes in Asia.
It comprises two controlled essays written by learners of English from the following regions:
The data collection and annotation, including essay titles, conditions of writing, and how the L2 proficiency of each student was assigned, is clearly described on the home page.
From there, a simple login takes you to the online interface and you're ready to start exploring.
I looked at if occurring within two positions to the left of will for all regions at all CEFR levels and in both essay topics, to look for uses of will in if clauses of first conditionals. (My interest was in checking a vague intuition I had that this was primarily a European learner error.) This yielded a simple, sortable concordance with a column displaying the L1 and CEFR equivalent for each concordance line. I missed the ability to expand the concordance lines to view whole sentences, but was placated by the attractive pop-up sentences that appear when you click on the search node.
The interface also offers collocation searches with Raw Frequency, t-score, Log-Likelihood and Mutual Information views (and handy Help (?) buttons), a Wordlist function, and Keywords. The keywords list produced gives both 'overused' and 'underused' words compared with the reference corpus selected (I selected the two native-speaker corpora). Overused words included smoke, smoking, cigarette, part, time and job. This is hardly surprising given the two essay titles:
(A) "It is important for college students to have a part time job."
(B)"Smoking should be completely banned at all the restaurants in the country."
It's a great resource and I'd like to see it grow and the number of essays expanded to increase the range of language used.
I think the interface should be simple enough for anyone to make a good start at delving into Asian learners' English production. There isn't a user manual, so beginners might be a little bewildered. But this paper (PDF) http://language.sakura.ne.jp/s/ilaa/ishikawa_20130323.pdf acts as a useful overview of tools and functions etc. and is pretty accessible.
I recommend this corpus to anyone interested in learner corpora, and in particular to any ELT teachers and materials writers/developers interested in teaching or writing for Asian students. And I'd be very pleased to hear what you think.
Many thanks to @Glenn_Hadikin and Costas Gabrielatos @congabonga for bringing this corpus to my attention.