Font Size: a A A

Topical facets: Semantic patterns between documents and the vocabulary

Posted on:2009-04-19Degree:DrType:Thesis
University:Universiteit Antwerpen (Belgium)Candidate:Van Horenbeeck, EricFull Text:PDF
GTID:2448390002491536Subject:Language
Abstract/Summary:
This Topical Facets thesis is about unsupervised information discovery. The amount of reused chunks in a text is striking, chunks that are more complex than words but less so than the text itself or its main components. These chunks make up a third layer between petrified documents and the constituent tokens; a layer with recycled phrases. Texts that share many of these fragments have also a meaning in common, hence, the denomination topical facets. The word topic relates to the pivotal theme in a language production; it concerns what a text is about. The term facet specifies that only a fraction of the whole topic is available. Many topical facets are needed to compose a full topic. When we collect texts based on the topical facets they have in common, we posit at the same time that the documents are gathered on a shared meaning too. A relatively recent theory underpins this conjecture, a theory describing natural language with a special network class: the small-world network. Language is seen as a special class of social network with word clusters having short distances inside and linked with long distance links (hubs) to other clusters. Available words are bonded to each other via function words acting as a roundabout, redirecting incoming links to other content words. The language network is dynamic: it grows by preferential linking, and decays when words become obsolete or change meaning. Preference linking manifests itself in the community-collection dimension of a document and in the link frequency of a term. A text network is conceptually interesting and provides a computationally efficient setting to access documents in different ways. The search for meaningful topics is one of them. To check the assumptions, we use a 30,000 manually annotated document corpus. In this controlled environment we test the application on its ability to allocate unseen documents to the right topic without external help. The Topical Facet Application stands the test when judged against other systems.
Keywords/Search Tags:Topical, Documents, Text
Related items