Font Size: a A A

Research And Implementation Of Web Name Disambiguation

Posted on:2011-07-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y S WangFull Text:PDF
GTID:2178360305476546Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. With the devoloping of the Internet, the performance of many Internet applications, especially the search engine, would be affected by name ambiguity in the web pages. Web name disambiguation focuses on how to disambiguate the person name in the web pages and reference to a special people. It's a hotspot in the field of natural language processing currently.Firstly, this paper analyzes the research status of name disambiguation, and then we propose a basic framework and process of name disambiguation combined with key technologies of name disambiguation.Secondly, this paper proposes a CSS-based web page text extraction algorithm by analyzing layout of web pages and the characteristic of name disambiguation. It parses the web page and processes the layout to extract the CSS information, and based on that, it extracts the content and some useful tags form the web pages for name disambiguation.Finally, the feature selection is the crucial role in the name disambiguation. This paper focuses on the shallow semantics in the name disambiguation. It proposes a web name disambiguation approach based on LDA (Latent Dirichlet Allocation) and name's context snippets according to the facts of web pages about one real people is relevant and the name's context is more low-noise than whole text. Our approach preprocesses the text extracting form web pages by using topic model to gain the topic's relations of the text, and then it disambiguates names in texts according to the topic's relations. This paper also introduces an improved K-means alogrithm based on the maximum principle to our approach on cluster names. The experimental results show that our approach can improve the performance of name disambiguation.
Keywords/Search Tags:Name Disambiguation, Web Pages Content Extracting, Cluster, Feature Selection, LDA Model
PDF Full Text Request
Related items