|
Last Updated: May 2010 (this website is still under construction)
Contact Me
Visit me
Previous home page (2003-2006)
Previous Teaching Experience:
* Introduction to Computational Linguistics
* Introduction to Corpus Linguistics
Facebook, Sweden
|
Research Statement
It is easy nowadays to collect large digital document collections in many different languages, but when these collections are not classified by any textual categories, their usefulness is seriously diminished, thus causing a waste of resources and loss of information.
Documents can be classified into topical and non-topical text categories, that I call descriptors. Examples of topical descriptors are topic, content, subject matter or domain. Examples of non-topical descriptors are genre, register, style, sentiment/opinion, readability and vulgarisation, or layout structure (e.g. tables or lists). My current research interests focus on automatic classification of web documents by non-topical descriptors. Combined with topical descriptors, non-topical descriptors can help profile documents in a more realistic, accurate and productive way. For this reason, they would be of great advantage for all the fields where language variation is important, and especially for research areas where language technology can be enhanced or refined by a more fine-grained document typology, e.g. corpus linguistics, Natural Language Processing (NLP), automatic summarization, machine translation and information retrieval/extraction.
Unfortunately, annotating documents by non-topical descriptors is not always an easy task. Like any manual annotation, also the annotation of documents by non-topical descriptors is time-consuming, controversial and prone to error, because human annotators get easily tired or confused by this tedious task. The automation of this activity would be a great advantage to avoid some of the predictable pitfalls associated with it. However, there are no large and agreed upon evaluation resources to test the efficiency and the performance of the automatic classification of many non-topical descriptors.
My research goal is to create evaluation resources for genre and other non-topical descriptors. Another research goal is to apply and evaluate supervised, semi-supervised and unsupervised classification methods, as well as other statistical approaches to provide large unannotated corpora with non-topical descriptors. The ultimate goal is to propose methods to improve the overall classification performance and shed light into the relations among different descriptors. To date, the interaction and correlation among non-topical descriptors, and between topical and non-topical descriptors are still underexplored. |
Practical Activities |
** Teaching Italian as Foreign Language in Stockholm (Italiano con diletto) |
Academic Activities 2010 |
Mehler A., Sharoff S. and Santini M. (eds) (2010). Genres on the web: Computational Models and Empirical Studies. Springer Series: Text, Speech and Language Technology (Series Editors:Ide, Nancy, Véronis, Jean).
Identificazione automatica dei generi testuali sul web: Stato dell’arte. Tavola Rotonda PAISA’ – CiC, Universita’ di Bologna, 9 aprile 2010. |
|
Editorial and Organizational Activities
2009
2008
2007
- Co-organizer and co-chair with Serge Sharoff of the Colloquium "Towards a Reference Corpus of Web Genres" (Friday, 27 July 2007) held in conjunction with Corpus Linguistics 2007, Birmingham, UK (http://corpus.leeds.ac.uk/serge/webgenres/colloquium/).
- Co-organizer and co-chair with Georg Rehm: Workshop "Towards Genre-Enabled Search Engines: The Impact of NLP" (Sunday, 30 Sept. 2007) held in conjunction with RANLP, Borovets, Bulgaria (http://www.sics.se/use/genre-ws/).
|
Publications
Forthcoming
-
Santini M., Sharoff S. and Mehler A. "Riding the Rough Waves of the Web", Introduction. In Mehler A., Sharoff S and Santini M. (eds.), Genres on the web: Computational Models and Empirical Studies, Springer.
-
Santini M. "Cross-testing a Genre Classification Model for the Web". In Mehler A., Sharoff S and Santini M. (eds.), Genres on the web: Computational Models and Empirical Studies, Springer.
-
-
Santini M., Rehm G., Sharoff S. and Mehler A. Editorial of the Special Issue: Automatic Genre Identification: Issues, and Prospects" ( http://ldv-forum.org/2009_Heft1/Editorial.pdf) Journal for Language Technology and Computational Linguistics (JLCL) 2009, volume 25, number 1.
-
Santini M. Classifying web genres automatically. Chapter in the book: Genre theory and new literacies. Applications to autonomous language learning, Springer.
2008
- Santini M. (2008). Cross-testing a Genre Classification Model. The second Swedish Language Technology Conference (SLTC-008). November 20 - 21, 2008, Stockholm. Poster Paper. Proceedings.
- Santini M. and Rosso M. (2008). “Testing a Genre-Enabled Application: A Preliminary Assessment”, Proceedings of Future Direction in Information Access (FDIA-2008), BCS, London.
- Rehm G., Santini M., Mehler M., Braslavski P., Gleim R., Stubbe A., Symonenko S., Tavosanis M. and Vidulin V. (2008). “Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems”, LREC 2008. Marrakech.
- Santini M. (2008). State of the Art in Automatic Genre Classification: Where do we go from here?. Talk. University of Glasgow, Glasgow, UK <http://www.dcs.gla.ac.uk/research/groups/oneevent.cfm?eventid=2559>.
- Santini M. (2008).“WebGenre and NLP: Identification of genres on the web through the processing of natural language. Position Paper. Processing Text-technological Resources Conference”, Bielefeld University, Germany. <http://coli.lili.uni-bielefeld.de/Texttechnologie/Forschergruppe/PTTR/abstracts/Abstract-Santini.pdf>.
- Santini M. (2008). “Zero, Single, or Multi? Genres of Web Pages through the Users' Perspective”. Information Processing & Management. Volume 44, Issue 2, March 2008, pp. 702–737.
2007
|
Book Reviews |
- Review: Bateman J. Multimodal Documents and Genre LINGUIST List 21.1606 Fri April 02 2010
- Review: Heyd T. (2008). Email Hoaxes - Form, Function, Genre Ecology LINGUIST List 21.75, Thu Jan 07 2010
- Review: Hundt, Nesselhauf and Biewer (eds, 2006) Corpus Linguistics and the Web. Corpora. Volume 4, Page 209-211
- Review: Discourse on the move by D. Biber, U. Connor and T. Upton, Computational Linguistics, March 2009, Vol. 35, No. 1, Pages 105-107.
- Review:Bruce I. (2008). Academic Writing and Genre. A Systematic Analysis LINGUIST List 19.3079, Fri Oct 10 2008
|
Resources |
WEBGENREWIKI: http://purl.org/net/webgenres |
Other Interests and Hobbies |
Guiding in Stockholm (Baltic Cruise Guide)
Teaching Italian as Foreing Language at Folkuniversitet, Stockholm, www.folkuniversitetet.se/stockholm) and at the Istituto Italiano di Cultura, Stockholm (http://www.iicbelgrado.esteri.it/IIC_Stoccolma/Menu/Imparare_Italiano/I_corsi_di_lingua/Docenti_e_testi/)
Conversazioni letterarie (Folkuniversitet, Stockholm): http://www.folkuniversitetet.se/templates/Arr.aspx?id=171650&LeftMenuPageId=111144 |
email Contacts |
MarinaSantini.MS-->--gmail.com
MarinaRomeStockholm-->--gmail.com
marina.santini-->--folkuniversitetet.se
marina.santini-->--student.kyh.se |
|
Ĉ ď Marina Santini, Feb 14, 2009 11:38 AM
Ĉ ď Marina Santini, Feb 14, 2009 11:39 AM
Ĉ ď Marina Santini, Feb 13, 2009 5:09 AM
Ĉ ď Marina Santini, Feb 13, 2009 5:11 AM
Ĉ ď Marina Santini, Feb 13, 2009 5:13 AM
Ĉ ď Marina Santini, Feb 13, 2009 5:09 AM
ď my_manual_genre_labelling_1000SPIRIT_webpages_NOVEMBER2008_matching_with_the_initial_corpus.xls (174k) Marina Santini, Mar 29, 2009 7:37 PM
|