Font Size: a A A

Research And Design Topic Crawler For Personal Information

Posted on:2013-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:C JiangFull Text:PDF
GTID:2248330371985483Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the advent of the Internet age, people use Internet technology to access informationmore and more commonly, the Internet has become an access to information for public in aconvenient, fast and effective way. At the same time, due to the expanding scale of theInternet, the growth of the Internet in the amount of information is exponential. Therefore,people are increasingly concerned about how to access to useful information from the Internetquickly and easily, so that making it better serve the people’s work and life. Now the rapiddevelopment of vertical search technology, makes up the universal search technology tosearch for information "big but not full, full but not precision" shortcomings. While aneffective solution to specific areas of Web access to information technology is as the basis ofeach industry’s use of the Internet information resources.At present, personal information is the basis of forums, portals, community questionanswering system (CQA) and social networks (SNS) pushing information, doing good forpersonal information crawling is the basic of further related applications realizing. Some newapplications for the Chinese Personal Search constantly is introduced, such as Yahoo PeopleSearch, Renlifang by Microsoft,6dtop by Douban and so on. Because of personal informationincludes a wide range of topics, keywords and links analysis-based methods are not onlydifficult to describe the comprehensive on personal information topics, but also is Inefficientto analyse on the topic information. According to some tests, the based classifier predictionmethod is a better method to identify topic information.The main purpose is to research and design topic crawler for personal information fromthe vast amounts of information identifying the personal information Web resources, and getthe personal information, then taking the personal information to next processing system forextracting person information, entitle name and personal relationship. In this paper, the webcrawler technology, is conducted in-depth research, combined with the topic model algorithmand text classification algorithm to work together to build an automatic obtaining dataset topiccrawler system.This paper focuses on the KNN classifier design and implementation, as well as theknowledge of the LDA topic model algorithm, learning the LDA model and the data setautomatic generating process and do a good job for the design and implementation of theLDA topic model. LDA topic model generated data as the data set of the KNN classifier usedfor classification, so as to effectively improve the KNN classifier ability to identify thepersonal information. Based on the knowledge of Web crawler, Web crawler crawling processof data was being study, and do good for further realization of the topic crawler related design. The final integration of the KNN classifier, the topic model and the web crawlerprogram, formed the topic crawler systems. Finally, based on the status of the topic crawlersystem crawl the personal information web pages, to further debug topic crawler system toimprove the Precision and Recall.Research and design topic crawler for personal information uses a web crawlertechnology, the LDA probability generate model algorithm and KNN text classificationalgorithm to jointly set up a complete program.Research and design topic crawler for personalinformation focuses on the identification and access of the personal information, and detaileddescript the system-built modules, development process and related experiments.Inexperiments, the entrance of URL is the home page of Jilin University, the system crawlingthe topic of personal information pages reached94.25%in accuracy, and about92.13%inrecall rate.The overall reached good results, but still have to be improved.
Keywords/Search Tags:Topic Crawler, Personal Information, KNN, LDA Topic Model
PDF Full Text Request
Related items