1. Computational Lexicon based on Distributional Semantics
Distributional semantic models (DSM) -- also known as "word space" or "distributional similarity" models -- are based on the assumption that the meaning of a word can (at least to a certain extent) be inferred from its usage, i.e. its distribution in text. Therefore, these models dynamically build semantic representations -- in the form of high-dimensional vector spaces -- through a statistical analysis of the contexts in which words occur. DSMs are a promising technique for solving the lexical acquisition bottleneck by unsupervised learning, and their distributed representation provides a cognitively plausible, robust and flexible architecture for the organisation and processing of semantic information.
Aim
To construct a distributional profile for Maltese words using tools such as Sketch Engine, Maltese Language Resource Server (MLRS), Natural Language Toolkit (NLTK) etc.
Deliverables
A. Tool and API which takes a corpus and builds a semantic lexicon (dictionary) based on distributional semantics
B. Integration of tool into METASHARE framework currently being developed within context of METANET4U project.
References/Resources
Lin, Dekang (1998). Automatic retrieval and clustering of similar words. In Proceedings of the 17th International Conference on Computational Linguistics (COLING-ACL 1998), pages 768-774, Montreal, Canada.
-------------------------------------------------------------------------
2. Developing Machine Translation for Maltese using the GF Grammatical Framework.
GF (Grammatical Framework, Ranta 2004) is a special purpose functional language for defining multilingual grammars. A key feature of GF is that the syntactic features of a language are described at two levels: abstract and concrete. The abstract level is language-independent, whilst the concrete syntax links the abstract structures to actual sentences of a given language. It is thus suitable for the development of high-quality, domain-specific translation systems.
The design and implementation of GF is based on Haskell, an elegant functional programming language. GF has a separate low-level run-time format, a multi-pass compiler, and an interactive development environment. The GF Resource Grammar Library plays the role of the standard library. From this perspective, it is a software library, similar to the Standard Template Library of C++ or the Java API. This makes it highly amenable to extension.
Although GF grammars exist for over 20 languages, Maltese is somewhat under-represented. The aim of this project is to extend the coverage of Maltese whilst developing a simple, domain-specific translation system, possibly in the areas of healthcare or tourism.
The project, which is highly interdisciplinary, will appeal to those with an interest in the Maltese language and Functional programming. Experience with Haskell is desirable but could be acquired whilst on the job.
Deliverables
A. Grammar and lexicon for Maltese
B. Translation component
C. Web based applications
D. Integration with the METASHARE framework being developed within the EU project METANET4U
Bibliography
[1] GF Tutorial http://www.grammaticalframework.org/doc/tutorial/gf-tutorial.html
[2] A. Ranta. Grammatical Framework: A Type-Theoretical Grammar Formalism. Journal of Functional Programming, 14(2), pp. 145-189, 2004. http://www.cse.chalmers.se/~aarne/articles/gf-jfp.pdf
[3] GF Book http://www.grammaticalframework.org/gf-book/
------------------------------------------------------------------------
3. Computing the Similarity of Natural Language Texts using Information Gain
The project proposed here concerns the development of algorithms which judge the similarity of two natural language texts. By texts here we mean anything from a book to a query. Such judgements are fundamental to a whole range of NL applications including question answering, automatic summarisation, classification and translation.
The basic idea, which incidentally has been used with some success in bioinformatics, is that texts which are similar share similar subparts, and we can therefore measure the similarity of texts in terms of an "alignment" between the subparts and the similarity of the subparts. The two main parameters to this process are (a) the definition of subpart and (b) the metric by which we judge strength of a proposed alignment (since in general there will be many possible alignments for a given pair of texts). In this FYP we will adopt several definitions of (a). For (b) we will use the notion of information gain. This is a probabilistic measure which is based on the notion of entropy. which attempts to estimate the extent to which a proposed method for choosing amongst possible alignments method scores better than a purely random choice.
Deliverables
- Implementation of core algorithms
- Parametrisable system for setting up different experiments
-System of evaluation
- Use Cases
Bibliography
[1] Tatsunori Mori, Miwa Kikuchi, Kazufumi Yoshida, Term Weighting Method based on Information Gain Ratio for Summarizing Documents retrieved by IR systems, Journal of Natural Language Processing, 9(4):3--32, 2002.
------------------------------------------------------------------------
4. Controlled Natural Page Generation
Many websites these days are, at least in part, derived from knowledge bases of one kind or another. The aim of this project is to develop a system that, one the basis of such acquired knowledge, can automatically generate a set of "natural" web pages that are stylistically appropriate for the expression of a body of pre-determined content. The system will need to make to make decisions at different levels of document organisation including the number of pages to use and the relationship between them, navigation between individual pages, the layout and organisation of individual pages (where to place individual items, headings, lists, graphics etc.).
It is expected that this system will involve the ability to reason dynamically about the assembly of the evolving pages, for which some kind of planning system would be defined. The project clearly has many overlaps with Natural Language Generation - and in fact will include a minor NLG component. However the focus is on generating paralinguistic aspects of web pages.
Deliverables:
Acquisition component: gets knowledge base from the web or by hand via an interface
RDF-like language definition for description of content.
Planning system for web-page construction.
A significantly complex worked example.
References
[1] Mann, William C. and Thompson, Sandra A. Rhetorical Structure Theory:
Toward a functional theory of text organization. Text, 8(3):243–281, 1988.
[2] Richard Power, Donia Scott, and Nadjet Bouayad-Agha. Document structure. COMPUTATIONAL LINGUISTICS, 29:211–260, 2003
[3] Simon Lok, Steven Feiner, and Gary Ngai. 2004. Evaluation of visual balance for automated layout. In Proceedings of the 9th international conference on Intelligent user interfaces (IUI '04). ACM, New York, NY, USA, 101-108. DOI=10.1145/964442.964462 http://doi.acm.org/10.1145/964442.964462
----------------------------------------------------------------
5. Understanding Geospatial Descriptions
This project is structured around a large text corpus of real estate descriptions that has been obtained within the scope of the Diadem project, an European Science Foundation project being carried out at the University of Oxford. The aim is to use NLP to extract and enhance knowledge about the geospatial aspects of such descriptions. This will tackle two aspects: one one hand, the geographical location of the property with respect to other identifiable geospatial entities such as towns, shops etc.; on the other hand the topological organisation of the property itself: the number and type of rooms; the layout of rooms on floors etc.
Deliverables
Dataset preprocessor: filtering to identify the most relevant part of the description
Data Models: (i) geographical and (ii) topological
Presentation: Query processor for interrogating the extracted knowledge
Evalutation: determines how much knowledge was actually extracted.
Bibliography
[1] N. Blaylock, B. Swain, and J. Allen. TESLA: A tool
for annotating geospatial language corpora. In
Proceedings North American Chapter of the
Association for Computational Linguistics - Human
Language Technologies (NAACL HLT) 2009
Conference, Boulder, Colorado, May 31–June 5 2009.