Marina Santini: Genre, Web Pages, Automatic Classification - Research 2007-2010 --

 
Last Updated: May 2010 (this website is still under construction)

Contact Me 

Visit me

Previous home page (2003-2006)

My CV

Previous Teaching Experience:

* Introduction to Computational Linguistics

* Introduction to Corpus Linguistics

 

Facebook, Sweden

 

 

Research Statement

It is easy nowadays to collect large digital document collections in many different languages, but when these collections are not classified by any textual categories, their usefulness is seriously diminished, thus causing a waste of resources and loss of information.

Documents can be classified into topical and non-topical text categories, that I call descriptors. Examples of topical descriptors are topic, content, subject matter or domain. Examples of non-topical descriptors are genre, register, style, sentiment/opinion, readability and vulgarisation, or layout structure (e.g. tables or lists). My current research interests focus on automatic classification of web documents by non-topical descriptors. Combined with topical descriptors, non-topical descriptors can help profile documents in a more realistic, accurate and productive way. For this reason, they would be of great advantage for all the fields where language variation is important, and especially for research areas where language technology can be enhanced or refined by a more fine-grained document typology, e.g. corpus linguistics, Natural Language Processing (NLP), automatic summarization, machine translation and information retrieval/extraction.

Unfortunately, annotating documents by non-topical descriptors is not always an easy task. Like any manual annotation, also the annotation of documents by non-topical descriptors is time-consuming, controversial and prone to error, because human annotators get easily tired or confused by this tedious task. The automation of this activity would be a great advantage to avoid some of the predictable pitfalls associated with it. However, there are no large and agreed upon evaluation resources to test the efficiency and the performance of the automatic classification of many non-topical descriptors.

My research goal is to create evaluation resources for genre and other non-topical descriptors. Another  research goal is to apply and evaluate supervised, semi-supervised and unsupervised classification methods, as well as other statistical approaches to provide large unannotated corpora with non-topical descriptors.  The ultimate goal is to propose methods to improve the overall classification performance and shed light into the relations among different descriptors. To date, the interaction and correlation among non-topical descriptors, and between topical and non-topical descriptors are still underexplored.

 Practical Activities

** Agile Web Development course at KYH 

** GeoTimes project at About Time AB

** Teaching Italian as Foreign Language in Stockholm (Italiano con diletto)

 Academic Activities 2010

The WebRider Project -- Work in progress.

Mehler A., Sharoff S. and Santini M. (eds) (2010). Genres on the web: Computational Models and Empirical Studies. Springer Series: Text, Speech and Language Technology (Series Editors:Ide, Nancy, Véronis, Jean).

Identificazione automatica dei generi testuali sul web: Stato dell’arte. Tavola Rotonda PAISA’ – CiC, Universita’ di Bologna, 9 aprile 2010.

Editorial and Organizational Activities

2009

2008

2007

  • Co-organizer and co-chair with Serge Sharoff of the  Colloquium "Towards a Reference Corpus of Web Genres" (Friday, 27 July 2007) held in conjunction with Corpus Linguistics 2007, Birmingham, UK (http://corpus.leeds.ac.uk/serge/webgenres/colloquium/).
  • Co-organizer and co-chair with Georg Rehm: Workshop "Towards Genre-Enabled Search Engines: The Impact of NLP" (Sunday, 30 Sept. 2007) held in conjunction with RANLP, Borovets, Bulgaria (http://www.sics.se/use/genre-ws/).
     

Talks

2009

2008
 
Publications
 
Forthcoming
  • Santini M., Sharoff S. and Mehler A. "Riding the Rough Waves of the Web", Introduction. In Mehler A., Sharoff S and Santini M. (eds.), Genres on the web: Computational Models and Empirical Studies, Springer.
  • Santini M. "Cross-testing a Genre Classification Model for the Web". In Mehler A., Sharoff S and Santini M. (eds.), Genres on the web: Computational Models and Empirical Studies, Springer.
  • Santini M. and Sharoff S. "Web Genre Benchmark Under Construction". Journal for Language Technology and Computational Linguistics (JLCL) 2009, volume 25, number 1 -- Special Issue: Automatic Genre Identification: Issues, and Prospects", (http://ldv-forum.org/2009_Heft1/07Marina_Santini_and_Serge_Sharoff.pdf).
  • Santini M., Rehm G., Sharoff S. and Mehler A.  Editorial of the Special Issue: Automatic Genre Identification: Issues, and Prospects" (http://ldv-forum.org/2009_Heft1/Editorial.pdf) Journal for Language Technology and Computational Linguistics (JLCL) 2009, volume 25, number 1.
  • Santini M. Classifying web genres automatically. Chapter in the book: Genre theory and new literacies. Applications to autonomous language learning, Springer.

    2008

  • Santini M. (2008). Cross-testing a Genre Classification Model. The second Swedish Language Technology Conference (SLTC-008). November 20 - 21, 2008, Stockholm. Poster Paper.  Proceedings.
  • Santini M. and Rosso M. (2008). “Testing a Genre-Enabled Application: A Preliminary Assessment”, Proceedings of Future Direction in Information Access (FDIA-2008), BCS, London.
  • Rehm G., Santini M., Mehler M., Braslavski P., Gleim R., Stubbe A., Symonenko S., Tavosanis M. and Vidulin V. (2008). “Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems”, LREC 2008. Marrakech.
  • Santini M. (2008). State of the Art in Automatic Genre Classification: Where do we go from here?. Talk. University of Glasgow, Glasgow, UK <http://www.dcs.gla.ac.uk/research/groups/oneevent.cfm?eventid=2559>.
  • Santini M. (2008).“WebGenre and NLP: Identification of genres on the web through the processing of natural language. Position Paper. Processing Text-technological Resources Conference”, Bielefeld University, Germany. <http://coli.lili.uni-bielefeld.de/Texttechnologie/Forschergruppe/PTTR/abstracts/Abstract-Santini.pdf>.
  • Santini M. (2008). “Zero, Single, or Multi? Genres of Web Pages through the Users' Perspective”. Information Processing & Management. Volume 44, Issue 2, March 2008,  pp. 702–737.
         2007

 Book Reviews

 Resources

 WEBGENREWIKI:  http://purl.org/net/webgenres

 Other Interests and Hobbies

Guiding in Stockholm (Baltic Cruise Guide)

Teaching Italian as Foreing Language at Folkuniversitet, Stockholm, www.folkuniversitetet.se/stockholm) and at the Istituto Italiano di Cultura, Stockholm (http://www.iicbelgrado.esteri.it/IIC_Stoccolma/Menu/Imparare_Italiano/I_corsi_di_lingua/Docenti_e_testi/)

Conversazioni letterarie (Folkuniversitet, Stockholm): http://www.folkuniversitetet.se/templates/Arr.aspx?id=171650&LeftMenuPageId=111144

 email Contacts

 

MarinaSantini.MS-->--gmail.com

MarinaRomeStockholm-->--gmail.com

marina.santini-->--folkuniversitetet.se

marina.santini-->--student.kyh.se

Č
Ĉ
ď
Marina Santini,
Feb 14, 2009, 11:38 AM
Ĉ
ď
Marina Santini,
Feb 14, 2009, 11:39 AM
Ĉ
ď
Marina Santini,
Feb 13, 2009, 5:09 AM
Ĉ
ď
Marina Santini,
Feb 13, 2009, 5:11 AM
Ĉ
ď
Marina Santini,
Feb 13, 2009, 5:13 AM
Ĉ
ď
Marina Santini,
Feb 13, 2009, 5:09 AM
ċ
ď
my_manual_genre_labelling_1000SPIRIT_webpages_NOVEMBER2008_matching_with_the_initial_corpus.xls
(174k)
Marina Santini,
Mar 29, 2009, 7:37 PM
Comments