Announcement and Call for Expressions of Interest SHARED TASK IN SURFACE REALISATION FROM A COMMON-GROUND INPUT REPRESENTATION We seek input and participation in a proposed shared task on surface realisation using resources developed over the Penn TreeBank, including a track for automatic evaluation metrics. If the task descriptions below interest you, please contact us (see end of email for contact details). Background In Natural Language Analysis (NLA), reuse of core utilities and tools has become common, and researchers frequently use off-the-shelf parsers, POS-taggers, named entity recognisers, coreference resolvers, and many other tools. NLG has not so far developed generic tools, and methods for comparing them, to the same extent as NLA. The NLG subfield that has perhaps come closest to developing generic tools is surface realisation. Wide-coverage surface realisers such as PENMAN/NIGEL, FUF/SURGE and REALPRO were intended to be more or less off-the-shelf plug-and-play modules. But they tended to require a significant amount of work to adapt and integrate, e.g. requiring highly specific inputs with up to several hundred features that needed to be set. With the advent of statistical techniques in NLG, surface realisers appeared for which it was far simpler to supply inputs, as information not provided in the inputs could be added on the basis of likelihood. The current generation of surface realisers tend to be statistical and use reversible, treebank-based, automatically extracted grammars for both parsing and generation. A significant subset of statistical realisation work has produced results for regenerating the Penn Treebank (PTB) where the annotated resources of the PTB are mapped to some form of meaning representation which then serves as input to the surface realiser whose task it is to reproduce the original treebank sentence. Despite the fact that these research projects involve the same corpus, the reported results cannot be directly compared, because each realiser uses different input representation formalisms (to match the grammar formalisms used in the realiser: HPSG, CCG, LFG, LTAG, etc.) and inputs specify the word-string outputs to different degrees (some inputs are more `surfacey', others more semantic). Evaluation results typically report BLEU scores, and publications refer to each other and (tentatively) compare BLEU scores, but no conclusions can be drawn from these comparisons, because of the differences in inputs. Additionally, meta-evaluations of MT metrics on realiser outputs have suggested that these metrics correlate less well with human judgments than in the case of MT outputs, perhaps because realiser outputs are generally of higher quality and exhibit more subtle variation. We are therefore left in a situation where a vibrant new generation of surface realisation research exists, but we do not have the facility to compare these approaches with each other, or to the previous generation of symbolic realisers. Shared Task Outline We are currently developing (for an early outline see Belz, White, van Genabith, Hogan & Stent, 2010) a shared task in surface realisation (SR) based on common inputs and annotated corpora of paired inputs and outputs derived from various resources from NLA that build on the Penn Treebank. Inputs are provided in a common-ground representation formalism which participants map to the types of input required by their system. These inputs are automatically derived from the Penn Treebank and the various layers of annotation (syntactic, semantic, discourse) that have been developed for the texts in it. The shared task is defined precisely and outputs from participating systems (realisations) are evaluated by automatic comparison against the human-authored text in the corpora as well as by human assessors. We have assembled a working group of SR researchers (see below) to tackle the task of designing the common-ground input representation, with the aim of ensuring a fair and balanced approach. In the short term, an SR Shared Task as outlined here will make existing and new approaches directly comparable by evaluation on the benchmark data associated with the task. Additionally, a metrics track will allow researchers working on automatic evaluation metrics to submit the results of their metrics on the realisation data. In the long term, the common-ground input representation is likely to lead to a standardised representation that can act as a link between surface realisers and preceding modules, and may one day make it possible to use alternative surface realisers as drop-in replacements for each other, enabling developers to determine the best realiser for their purpose. Moreover, the acquired human judgments of realiser outputs will form a challenging data set for advancing research on automatic evaluation metrics. Working Group Developing the Common-Ground Input Representation Anja Belz, NLTG, University of Brighton, UK (coordination) Bernd Bohnet, IMS, University of Stuttgart, Germany Charles Callaway, University of Haifa, Israel Josef van Genabith, CNGL, Dublin City University, Ireland Deirdre Hogan, CNGL, Cublin City University, Ireland Stephan Oepen, University of Oslo, Norway Amanda Stent, AT&T Labs Research Inc., US Leo Wanner, Information and Communication Technologies, UPF, Barcelona, Spain Mike White, Department of Linguistics, The Ohio State University, US Prospective Timeline February 2011: Training/development data and task documentation available July 2011: Submission of test data outputs August 2011: Submission of automatic metrics results on realiser data September 2011: Results session at GenChal'11 at ENLG 2011 Expressions of Interest Please let us know if you would like to participate in the Surface Realisation Shared Task or the Automatic Metrics Track. We welcome any feedback or suggestions you may have. Note that we would particularly welcome work on surface realization from PTB-style inputs, but for languages other than English. Related ideas and/or results can be submitted to the Open Track at Generation Challenges 2011 (to be announced shortly). Shared Task Organisation: Anja Belz, NLTG, University of Brighton, UK Josef van Genabith, CNGL, Dublin City University, Ireland Deirdre Hogan, CNGL, Cublin City University, Ireland Amanda Stent, AT&T Labs Research Inc., US Mike White, Department of Linguistics, The Ohio State University, US Contact: Anja Belz (A.S.Belz@brighton.ac.uk) and Mike White (mwhite@ling.osu.edu)