CLUK 2004

Marina Santini

PhD student at

Natural LanguageTechnology Group

University of Brighton
Watts Building, Moulsecoomb Campus

Brighton BN2 4GJ, UK

Tusculum: city of ancient Latium. The ruins of this city are near modern Frascati, 15 mi (24 km) SE of Rome, Italy. According to legend, Tusculum was founded by Telegonus, son of Ulysses, and it early became an important city. It was a favorite summer residence of Roman nobles; Pliny the Younger, Cicero, and the emperors Nero and Titus were among those who built villas there. It continued to be important until 1191, when it was razed by the Romans. Ruins include those of villas, an amphitheater, and a theater.

Contact: santinim {at} inwind {dot} it

Castelli Romani (Rome, Italy)
***This web page will not be updated after August 2007***


Ongoing Activities and Future Directions - PhD at UoB - MSc at UMIST - Book Reviews - Web Genres for Download


Ongoing Activities and Future Directions (August 2007)

Ongoing Activities

Co-organizing and co-chairing with Serge Sharoff the Colloquium "Towards a Reference Corpus of Web Genres" (Friday, 27 July 2007) held in conjunction with Corpus Linguistics 2007, Birmingham, UK.

Co-organizing and co-chairing with Georg Rehm the Workshop "Towards Genre-Enabled Search Engines: The Impact of NLP" (Sunday, 30 Sept. 2007) held in conjunction with RANLP, Borovets, Bulgaria.

Future Directions

Genre and Sentiment

Michel Généreux and I started exploring whether the use of genre-revealing features is profitable for automatic sentiment analysis. Our first investigations are described in:

Généreux M. and Santini M. (2007). Exploring the use of Linguistic Features in Sentiment Analysis. Corpus Linguistics 2007 - 27-30 July 2007, Birmingham.

Généreux M. and Santini M. (2007). Défi: Classification de Textes Français Subjectifs. 3ème DÉfi Fouille de Textes - 3rd July 2007, Grenoble.

Genre and IR

In future, I would like to have more insight into the real-world possibility of setting up a genre-enabled search engine.  

Thesis, Publications and Talks (PhD Project at University of Brighton, UK)

All the material presented in my PHD thesis* has been published in conference proceedings, talks and in journal articles. More specifically, some issues addressed in Chapter 3 were published in the journal article Santini (2006h) and were presented in the talk Santini (2005d). Content from other chapters has been published in conference or workshop proceedings, namely Chapter 4 in Santini (2004b); Chapter 5 in Santini (2005b) and more extensively in the technical report Santini (2005e); Chapter 7 in Santini (2005c, 2005f); and Chapter 10 in Santini (2006d, 2006e). A preliminary sketch of the model implemented in Chapter 11 was outlined in Santini (2004a). From Chapter 11 I extracted some other papers, namely Santini (2005a, 2006f, ) and – in co-operation with my supervisors, Richard Power and Roger Evans – Santini et al. (2006). Material in Chapter 11 was also used for presentations and short papers, namely Santini (2006a), Santini (2006b), and Santini (2007a). I will expand this chapter in the book chapter Santini (2008). Although Chapter 2 appears as a quick synthesis of the literature on automatic genre and text type identification, it is based on extensive background studies described in the technical report Santini (2004c) and presented in the invited talk Santini (2006c). Santini (2007c) contains material from Chapter 1 and Chapter 9. Chapter 8 will be published in the journal article Santini (2007b). A previous interpretation of the data reported in Chapter 8 is in Santini (2006g). This interpretation may be resumed in future work. What I had in mind when I started this PhD in September 2002 can be found in Santinti (2003).

There is some overlap in these publications. This is because they share motivation and aims, but also because some of them are prepared for different audiences, namely corpus linguists, computational linguists, genre analysts or information retrieval practitioners.

I would like to thank the anonymous reviewers of these publications for their comments and suggestions from which my thesis and research have benefited a lot.

Last but not least, I would like to thank my supervisors, Richard Power and Roger Evans, for their invaluable help, and my examineres, Micheal Oakes and Lyn Pemberton, for the enjoyable discussion during the viva.

* PhD Thesis: Santini M. (2007). Automatic Identification of Genre in Web Pages. Thesis submitted for the degree of Doctor of Philosophy, University of Brighton, Brighton (UK). PDF (345 pages, version submitted to University of Brighton) - PFD (277 pages, compressed line spacing)


2008 - 2007 - 2006 - 2005- 2004 - 2003 - Web Genres for Download - Manual Evaluation - My Genre Features



[Santini, 2008] Santini M. (2008). Genre of Web Pages: Implementing a Zero-to-Multi-Genre Classification Scheme. In Mehler A., Sharoff S., Rehm G. and Santini M. (eds.), Genres on the Web: Corpus Studies and Computational Models. In Preparation.


[Santini, 2007a] Santini M. (2007). Automatic Genre Identification: Towards a Flexible Classification Scheme. BCS IRSG Symposium: Future Directions in Information Access 2007 (FDIA 2007), Tuesday, 28th and Wednesday, 29th of August, Glasgow, Scotland. Held in conjunction with the European Summer School on IR (ESSIR 2007).

[Santini, 2007b] Santini M. (in press). Zero, single, or multi? genre of web pages through the users' perspective. Information Processing and Management, in press.

[Santini, 2007c] Santini M. (2007). Characterizing Genres of Web Pages: Genre Hybridism and Individualization, 40th Annual Hawaii International Conference on System Sciences (HICSS'07).


[Santini, 2006a] Santini M. (2006). Marina Santini, Towards a Zero-to-Multi-Genre Classification Scheme, Journée ATALA "Typologies de textes pour le traitement automatique", 9 décembre 2006, Paris.

[Santini, 2006b] Santini M. (2006). Deriving web genres from text types: a corpus-based approach, slides & abstracts, American Association of Applied Corpus Linguistics (AAACL), October 20 -22, 2006, Flagstaff, AZ USA.

[Santini, 2006c] Santini M. (2006). Marina Santini, From Biberian text types to genres of web pages: An overview of studies on automatic genre identification, slides & references, GENRE TEXTUEL/DOMAINE/ACTIVITÉ, Toulouse, 5 et 6 octobre 2006, Journées d'étude organisées par l'opération «Sémantique et Corpus»/ TEXTUAL GENRE/FIELDS/ACTIVITY October 2006, 5th and 6th - Toulouse, France - Workshop organised by the « Sémantique et Corpus » group.

[Santini, 2006d] Santini M. (2006). Marina Santini, Common Criteria for Genre Classification: Annotation and Granularity, Workshop on Text-based Information Retrieval (TIR-06), In Conjunction with ECAI 2006, Riva del Garda, Italy - Aug 29th, 2006.

[Santini, 2006e] Santini M. (2006). Some issues in Automatic Genre Classification of Web Pages, JADT 2006 - 8èmes Journées internationales d'analyse statistique des données textuelles du 19 au 21 avril 2006 à l'université de Besançon (France).

[Santini, 2006f] Santini M. (2006). Identifying Genres of Web Pages, TALN 2006 - Traitement Automatique des Langues Naturelles: du 10 au 12 avril 2006 à Leuven (Belgique)/Natural Language Processing: April 10-12, 2006 in Leuven (Belgium).

[Santini, 2006g] Santini M. (2006). Interpreting Genre Evolution on the Web, EACL 2006 Workshop: NEW TEXT - Wikis and blogs and other dynamic text sources. Preface to the Proceedings. ERCIM news.

[Santini, 2006h] Santini M. (2006). Web pages, text types, and linguistic features: Some issues. ICAME Journal, Vol. 30, pp. 67-86.

[Santini et al. 2006] Santini M., Power R., Evans E. (2006). Implementing a Characterization of Genre for Automatic Genre Identification of Web Pages, Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pp. 699–706, Sydney, July 2006.


[Santini, 2005a] Santini M. (2005). Automatic Text Analysis: Gradations of Text Types in Web Pages. Proceedings of the Tenth ESSLLI Student Session, 8-19 August, 2005, Edinburgh, UK, pp. 276-285.

[Santini, 2005b] Santini M. (2005). Building on Syntactic Annotation: Labelling Subordinate Clauses. Proceedings of the Workshop on Exploring Syntactically Annotated Corpora, 14th July 2005, Workshop held in conjunction with the Corpus Linguistics 2005 Conference, University of Birmingham, 14-17 July 2005, UK, pp. 35-46.

[Santini, 2005c] Santini M. (2005). Clustering Web Pages to Identify Emerging Textual Patterns. RÉCITAL 2005, 06-10 June 2005 - Dourdan, France. Poster.

[Santini, 2005d] Santini M. (2005). Annotated corpora vs. raw web page collections. Text types, web pages, and Linguistic features: Some issues. AAACL/ICAME, 12-15 May 2005, Ann Arbor, MI. Slides.

[Santini, 2005e] Santini M. (2005). Linguistic Facets for Genre and Text Type Identification: A Description of Linguistically-Motivated Features. Technical Report ITRI-05-02, 2005, ITRI, University of Brighton (UK). ps & pdf,

[Santini, 2005f] Santini M. (2005). Genres In Formation? An Exploratory Study of Web Pages using Cluster Analysis. Proceedings of the 8th Annual Colloquium for the UK Special Interest Group for Computational Linguistics (CLUK 8), University of Manchester (UK), 11 January, 2005. Slides. (See also CLUK home page).


[Santini, 2004a] Santini M. (2004). Identification of Genres on the Web: a Multi-Faceted Approach. Proceedings of the ECIR 2004 (26th European Conference on IR Research), Volume 2, Poster Abstracts, Edited by Michael P. Oakes, University of Sunderland (UK), 5-7 April, 2004.

[Santini, 2004b] Santini M. (2004). A Shallow Approach To Syntactic Feature Extraction For Genre Classification. Proceedings of the 7th Annual Colloquium for the UK Special Interest Group for Computational Linguistics (CLUK 7), University of Birmingham (UK), 6-7 January, 2004 (see also CLUK home page).

[Santini, 2004c] Santini M. (2004). State-of-the-art on Automatic Genre Identification. Technical Report ITRI-04-03, 2004, ITRI, University of Brighton (UK).


[Santini, 2003] Santini M. (2003). Identifying Genres on the Web, Technical Report ITRI-03-06, 2003, ITRI, University of Brighton (UK).


Web Genres for Download

The 7-web genre collection includes 200 blogs, 200 eshops, 200 FAQs, 200 online newspaper front pages, 200 listings, 200 personal home pages, 200 search pages. The 7-web genre collection has been built following the criteria of 'annotation by objective sources' and consistent genre granularity'. For details, see Santini (2006d) and Chapter 10 in the thesis.

200 blogs, 200 e-shops, 200 FAQs, 200 front pages, 200 lists (various types), 200 personal home pages, 200 search pages, 1,000 random unclassified web pages from the SPIRIT collection*, KI-04 corpus (a.k.a. Meyer-zu-Eissen-web-page collection)** ( original version and working corpus of 1,205 web pages), 25 web pages used in the web user study and a small BBC-on-line corpus (20 DIYs, 20 editorials, 20 short biographies, 20 hot-topics).


*Joho H. and Sanderson M. (2004), The SPIRIT collection: an overview of a large web collection, SIGIR Forum, December 2004 (Vol 38, #2)

**Meyer zu Eissen S. and Stein B. (2004), Genre Classification of Web Pages: User Study and Feasibility Analysis, in Biundo S., Fruhwirth T., Palm G. (eds.), Advances in Artificial Intelligence, Springer, Berlin, 256-269.
The KI-04 corpus was collected using bookmarks from about five people. Some genres were extended to get a better balance. The corpus was sorted by three people, one of them wrote a bachelor thesis (in German) on the corpus building process. One of the authors of the paper checked many of the pages, and most of the sorting complied with his understanding of the genre categories. The download date was January 26th, 2004.

Manual Evaluation

These datasets contain my manual evaluation. I used them in Santini (2006d) and Chapter 10 in the thesis.

* SPIRIT predictions with the 7-web-genre palette

* SPIRIT predictions with the KI-04 palette (8 web genres)

Genre Features

This is a description of the three feature sets that I used for automatic genre identification (DRAFT). A more comprehensive description can be found in the Thesis, namely in Chapter 3-6 and Appendices B-C.

Publications (Master Project)

From my MSc Dissertation at Umist, I published the following articles:

Santini M. (2003). Fattori per i testi. Italiano e oltre, 2/2003, La Nuova Italia, pp. 78-82.

Santini M. (2001). Text typology and statistics. Explorations in Italian press subgenres.Italian Journal of Linguistics/Rivista di linguistica, Volume 13, Issue 2, pp. 339-374.

Book Reviews

Marina Santini, Review of "Text Types and the History of English", LINGUIST List 15.3136, Mon Nov 08 2004.

Marina Santini, Review of "Web Advertising", LINGUIST List 16.1652, Mon May 23 2005.



Maintained by Marina Santini - Last updated: August 2007. This personal home page will not be updated after August 2007.

NLTG (Natural Language Technology Group)
©University of Brighton (UK)