Font Size: a A A

Research And Application Of Web Text Mining Based On Crawler

Posted on:2017-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:Q X ChenFull Text:PDF
GTID:2428330590968145Subject:Control Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of information and Web technology,Web data are growing at an unprecedented rate produced by the human society.Those data have the characteristics of large quantity,velocity,variability,complexity and value.However,most of the data are stored in the text or other unstructured and heterogeneous data forms exiting in the Internet,which are difficult to obtain and analyze.These data contain enormous information and knowledge.In order to fully exploit the value of these data,it has great significance of research on how to mining Web data with the aid of computer science and technology.Web text mining based on crawler refers to crawling text data from certain website by writing web crawlers,and using pattern recognition,statistical learning and other techniques to analyze the implicit,deep and valuable information.The main contribution can besummerized into the following four parts:(1)In this paper,we presented the method to get the Web data through crawlers.As the Web text data embedded in the HTML pages,it is difficult to obtain manually.Therefore,we write crawler to download Web text data automatically.We introduced the basic principles of web crawlers,and made a detailed description of the method to parse HTML page content.(2)We proposed a complete set of solutions for Web text mining based on open source crawler framework,with the combination of traditional text mining techniques.We presented the general processes of Web text mining,included Web text crawling,text cleaning,text preprocessing,analysis and result visualization.And we introduced text classification and clustering,text sentiment analysis and other commonly used text mining algorithms.(3)It is a challenge to classify the short text owing to its high sparseness and less semantic information.We proposed an improved short text classification method based on Latent Dirichlet Allocation topic model and K-Nearest Neighbor algorithm.This method combined the latent topics with the information of their discriminative terms.The extensive and comparable experimental results obtained show the effectiveness of our proposed method.(4)We applied our proposed method to analyze the hot commodities of Haitao market in e-commerce.Firstly,we programmed a Scrapy crawler to get the posts and comments of the related e-commerce websites.And then,we analyzed the hot commodities using statistics and related text mining algorithms to help sellers to improve,adjust and develop appropriate marketing strategies.
Keywords/Search Tags:Web Crawler, Text Mining, Natural Language Processing, Statistical Learning
PDF Full Text Request
Related items