The Unit for Linguistic Data (ULD) is concerned with the creation, improvement and maintenance of linguistic data (also known as language resources) through a variety of methods. The term linguistic data refers to a range of data types that are of use to researchers in linguistics and natural language processing (NLP). Principally, linguistic data can be split into four major categories. The first category is lexical data, an organised collection of words and their meanings, syntax, and relations. Secondly, resources known as corpora consist of collections of texts, or audio or audiovisual material, made for a particular purpose. Language descriptions, the third category, document typological properties of language to enable comparative studies. Metadata, finally, is used to describe language resources and their availability.

As a primary research method, this group is focussed on exploring the use of linked data technologies, that is, Linguistic Linked Open Data (LLOD), as a method of processing linguistic data. This has led to the development of several key tools and resources that use linked data as a key part of its mechanism. One such novel tool is Naisc, developed by the group for linking resources of different kinds; it has been applied to the task of linking lexicographical resources in the context of the Horizon 2020 ELEXIS project. Another tool, Teanga (see Ziad, McCrae & Buitelaar 2018), enables the construction of pipelines of NLP tools that can be composed and integrated through the use of linked data and standards for linguistic data, such as the OntoLex-Lemon standard developed in this project. Finally, ULD maintains and develops several catalogues for the discovery of resources of linguistic data, including the Linghub website as well as the Linked Open Data Cloud and its Linguistic Linked Open Data Subcloud. In the context of the now finished Horizon 2020 Prêt-à-LLOD project, ULD further explored how the quality and availability of resources can be improved. 

One of the major applications of linguistic data is the use of already developed NLP technologies to new languages and domains. As such, a major part of this group's work is on under-resourced languages, and there is much ongoing work on the development of technologies for minority languages. Most of this work takes places as part of the unit’s Cardamom project, funded by the Irish Research Council under the Consolidator Laureate Award scheme. In this context there is an active collaboration with the Irish Department and the Moore Institute on the development of NLP techniques for historical languages, in particular Old Irish. Furthermore, the unit is working on expanding WordNet to many under-resourced languages by means of machine translation.

Areas of work:
Linked data, Under-resourced languages, Digital humanities, Language resources, Lexicography, Metadata, Linguistic linked open data, Linked-data-based services