Font Size: a A A

Research On Domain-Specific Web Information Collection And Topic Detection And Its Application

Posted on:2011-08-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y H WuFull Text:PDF
GTID:1118330338989385Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Ever since the Internet became an important part of our life, the Internet users havebeen facing huge amount of news articles composed of complex topics. Internet usersneed more professional and personal services which overpass the traditional keywordssearching. Users are not satisfied with the searching result containing only the relatedinformation that can be described using keywords. They need more professional and per-sonal information services which could automatically find and recommend the topics andevents from the Internet. This paper presents the research on domain-specific web infor-mation collection and web topic detection combining Natural Language Processing andData Mining methods, which could provide more professional and personal knowledgeservices to Internet users.The collection of domain-specific web information is a prerequisite to Internetknowledge service system. There are many researches about the crawling module ofthe information collection. However, there are still many problems in domain-specificadaptive refresh strategy of large scale web pages. We proposed an adaptive and incre-mental refresh model for domain-specific web pages, which improved the efficiency ofdomain-specific web page collection.Topic detection and hot news recommendation are important parts of domain-specific knowledge service system. The traditional methods used in topic detectionmainly use text clustering algorithms based on document co-ocurred features, which arestill not enough to provide more professional services for domain-specific topic detection.The feature space of web pages is growing up as new pages are collected, whichwill cost more system resource and lead to a low precision. In this paper, we make aresearch on the feature selection and reduction problem using tolerance rough set model.The complexity of on-line topic detection system can be reduced by topic clustering oftopical words. However, the web page documents always contain too many words whichcould increase the system complexity. The feature selection method proposed in thispaper make use of the topical words extracted from the topic area of the semi-structuredweb pages as well as the nouns. Then tolerance rough set model is used to extend thistopical word set. The experiments show that the feature space and the system running complexity can be reduced greatly. At the same time, we improved the traditional on-linenew event detection task using incremental TF-IDF and time line analysis.In order to analyse the topic structure of web page documents, we introduced thetopic models. First, a comparative study of the topic models is presented in the topicdetection task. The topic model could analyse the semantic structure of web page doc-uments by projecting the word features to semantic space, which is effective than thetraditional method based on document co-occurrence features. The topic distance matrixcan be derived by topic decomposition using topic models. The experiments show thatmore topic features can be extracted by projecting the word features to the semantic space,which could improve the precision of web topic detection and alleviate the lost of systemperformance.The number of topics is always unknown and is changing with the web page collec-tion. In order to automatically recommend the topic based web news to Internet users,an adaptive topic detection method is needed. In this paper, we proposed a new topicdetection and news recommendation method combining LDA topic model and affinitypropagation algorithm. The experiments show that our method could automatically findthe topics identical to the real topic structure of web pages. Based on this adaptive topicdetection method, our system could effectively recommend topic based web news to In-ternet users.
Keywords/Search Tags:NLP, Topic Detection, Topic Model, Topic Clustering, Web Crawler
PDF Full Text Request
Related items