Lexicom 2010: University of Ljubljana, Slovenia

A personal view – Anita Srebnik

Between 7th and 11th July 2010 the tenth annual Lexicom workshop took place in Ljubljana. The topic of this year’s workshop was lexicography and lexical computing. The hosts of Lexicom 2010 were the Institute for Applied Slovene Studies Trojína (which was in charge of organization) and the Faculty of Arts of the University of Ljubljana.

Partaking in the workshop there were 27 participants representing more than half a dozen countries: France, Spain, India, Burundi, Norway, Estonia, Serbia, Croatia and Slovenia.

The five-day programme was intensive and full of novelty in the field of lexicography and natural language processing. Just like every year, it was prepared and executed by lecturers of the Lexicography MasterClass Company – Adam Kilgarriff, Michael Rundell and Sue Atkins – the latter did not lecture herself this year – as well as Simon Krek of the Jožef Stefan Institute and Amebis (http://www.amebis.si/), the company for development of language technologies.

Following the tradition, the workshop was mostly practical; however this doesn’t mean that there was no flirting with theory – both linguistic and lexicographic. The lecturers didn’t focus their attention only on monolingual English dictionary writing, but also on general lexicography and practice, useful for any language – which was, after all, the aim of the workshop. Interaction and cooperation between the lecturers and participants contributed to a very pleasant and creative atmosphere.

Since the basis for any effective lexicographic work is a computer-supported reference corpus, the workshop started with an introduction to corpus lexicography. Why do the corpuses need to be larger and larger? Almost 90 % of the most frequent words in a language (approximately 7500) are very common, which means that a great majority of words are rare which is precisely why as much data as possible is necessary for their analysis. Corpora for English, for example, will probably rise up to 20 billion lexical units by the year 2011.

We became familiar with parameters of building well-balanced corpora and touched upon various topics, e.g. how theoretical knowledge can be applied to a lexicographer’s work; the advantages that natural language processing brings to corpus linguistics, word disambiguation, the world wide web as a lexical database of a linguistic corpus; Dictionary Writing Systems, ontological approaches such as FrameNet (http://framenet.icsi.berkeley.edu/) and Wordnet (http://wordnet.princeton.edu/), bilingual lexicography, and other relevant topics. When addressing the problem of defining the boundaries between various word meanings, Kilgarriff touched upon the topic of developing automatic pooling of meanings with the help of the so-called CoCoDo Database (COllocations, COnstructions and DOmains), which could be a basis for a dictionary.

Michael Rundell and Adam Kilgarriff introduced us to an extensive and interesting project called Dante, a fine-grained lexical database, which describes the core vocabulary of English in – so to say – every detail (http://www.webdante.net/).

The novelty I personally found most interesting was the programme module Corpus Architect/ Sketch Engine (SkE) (http://www.sketchengine.co.uk/), developed by Adam Kilgarriff which can be used to analyze different language functions. Amongst them is the option of word sketches, a function that makes a lexicographer’s work much simpler. Within this module one can create their own corpus using a new tool called WebBootCat that the participants were trained to use during practical exercises.

We were simultaneously transferring theory into practice, divided into two groups: while members of the first group dealt with the tasks involved in dictionary writing, the other group were immersing themselves into the subject of natural language processing, among others also learning how to compose their own corpus with the use of WebBootCat and preparing a corpus for the Sketch Engine.

It was also very interesting to hear about the participants’ projects from their countries, like the making of a comprehensive Serbian encyclopaedic dictionary and the Norwegian dictionary project which started in 1930 and will be completed in 2014, comprising Bokmål as well as Nynorsk.

The four participants from the University of Hyderabad in India helped us all greatly with finding out the advantages of working with the Sketch Engine.

All in all, the content part of Lexicom was extremely successful, thanks to Michael Rundell’s and Adam Kilgarriff’s warm and clear transmitting of their incredibly broad and deep knowledge and valuable experience. Important contributors were also Simon Krek of the Jožef Stefan institute and Mojca Šorli of the Trojina Institute, the latter making sure that everything was going smoothly.

The official programme lasted less than a week, yet the aftermath of the workshop is immense: Lexicom 2010 isn’t over yet! The participants found out that we have quite a lot in common in the field of our research and practical work, therefore we continue – via email – to exchange experience, instructions and advice on how to improve our lexicographic tools. A special surprise for everyone interested in the Sketch Engine software came from Adam Kilgarriff – soon after the workshop he updated its image, improved it and added new functions. Thank you, Adam, I can not imagine my translating and lexicography work without this program anymore!

And beside the official and education success we also found the social aspect of the workshop very pleasant. For example, there was an organized Ljubljana sightseeing walk with an official licensed guide, a trip to Bled (where the more courageous participants went for a swim in the refreshingly cool lake), a walk through Ljubljana in the night-time and culinary pampering of the senses with fine Slovene cuisine.

Anita Srebnik
University of Ljubljana
anita.srebnik@guest.arnes.si