Font Size: a A A

Research On Topic Web Crawler For Web Text Mining

Posted on:2018-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:C ChenFull Text:PDF
GTID:2348330512483275Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the development of the Web3.0,the number of Web pages in the Internet presents an explosive growth,so that the information contained in the Web pages also grow geometrically.Web page's information usually reflected by the text in the Web pages,so the Web text data contained rich,valuable knowledge and rules for users.However,the Web text data are semi-structured,real-time and discrete;it is difficult for users to obtain the knowledge they need from such a complex data set.Therefore,how to get the information and knowledge that users care effectively from the massive web data,and presenting it by the way that users can understand,is the most popular research topic.This thesis started from obtaining and analysing the Web text information.It also surveyed how to obtain the Web text information that users need accurately and efficiently,and excavated the valuable knowledge from the text information.The specific work of this thesis is as follows:Theme web crawler: Firstly,this thesis analyzed the principle and structure of the theme web crawler,and then it introduced the classification of the topic web crawler.Secondly,this thesis choosed the functional theme web crawler as the focus of this article.Finally,this thesis analyzed the web crawler language,and selected the new language named Node.js to implement the theme web crawler for the theme network communityWeb Text Representation Model: Firstly,thesis analyzed the text representation model comprehensively.Then,it started from the actual situation of short text based on the Web text data,and Combined with the natural language processing of keyword extraction and word vector representation of the relevant technology,a kind of Web text representation model with the vector of the keyword are proposed.Web Text Clustering Algorithm: Firstly,this thesis introduced the definition of Web text mining technology.Secondly,this thesis introduced the clustering mining technology in Web text mining in detail;based on analyzing the classification of Web text-clustering algorithm,this thesis chosed the BIRCH algorithm as the Web text-clustering algorithm.Then this thesis analyzed the shortcomings and shortcomings of BIRCH algorithm,and proposed a new Web text-clustering algorithm.Based on the above research contents,this thesis combined the Web text mining technology and the research results of technology of the theme web crawler,and it designed and implemented an information acquisition and analysis system for the subject network community.
Keywords/Search Tags:Web Mining, Topic Web Crawler, Node.js, Web Text Representation Model, Balanced Iterative Reducing and Clustering Using Hierarchies Algorithm
PDF Full Text Request
Related items