Research And Application Of Web Text Mining Based On Crawler

Posted on:2017-11-19

Degree:Master

Type:Thesis

Country:China

Candidate:Q X Chen

Full Text:PDF

GTID:2428330590968145

Subject:Control Engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of information and Web technology,Web data are growing at an unprecedented rate produced by the human society.Those data have the characteristics of large quantity,velocity,variability,complexity and value.However,most of the data are stored in the text or other unstructured and heterogeneous data forms exiting in the Internet,which are difficult to obtain and analyze.These data contain enormous information and knowledge.In order to fully exploit the value of these data,it has great significance of research on how to mining Web data with the aid of computer science and technology.Web text mining based on crawler refers to crawling text data from certain website by writing web crawlers,and using pattern recognition,statistical learning and other techniques to analyze the implicit,deep and valuable information.The main contribution can besummerized into the following four parts:(1)In this paper,we presented the method to get the Web data through crawlers.As the Web text data embedded in the HTML pages,it is difficult to obtain manually.Therefore,we write crawler to download Web text data automatically.We introduced the basic principles of web crawlers,and made a detailed description of the method to parse HTML page content.(2)We proposed a complete set of solutions for Web text mining based on open source crawler framework,with the combination of traditional text mining techniques.We presented the general processes of Web text mining,included Web text crawling,text cleaning,text preprocessing,analysis and result visualization.And we introduced text classification and clustering,text sentiment analysis and other commonly used text mining algorithms.(3)It is a challenge to classify the short text owing to its high sparseness and less semantic information.We proposed an improved short text classification method based on Latent Dirichlet Allocation topic model and K-Nearest Neighbor algorithm.This method combined the latent topics with the information of their discriminative terms.The extensive and comparable experimental results obtained show the effectiveness of our proposed method.(4)We applied our proposed method to analyze the hot commodities of Haitao market in e-commerce.Firstly,we programmed a Scrapy crawler to get the posts and comments of the related e-commerce websites.And then,we analyzed the hot commodities using statistics and related text mining algorithms to help sellers to improve,adjust and develop appropriate marketing strategies.

Keywords/Search Tags:

Web Crawler, Text Mining, Natural Language Processing, Statistical Learning

PDF Full Text Request

Related items

1	Combining Natural Language Processing and Statistical Text Mining: A Study of Specialized Versus Common Languages
2	Intelligent Device Text Classification Method Based On Natural Language Processing
3	Research On Text Classification Based On Natural Language Processing And Machine Learning
4	Reaearch And Implementation Of Duplicate Checking System Under Internet Environment
5	Research And Application Of Text Classification Based On Natural Language Processing
6	The Application Of Natural Language Processing In Mining The Characteristics Of Concept Convey
7	Text Sentiment Analysis Based On Statistical Knowledge
8	Text Classification Based On Natural Language Processing, Analysis And Research
9	Text Similarity Analysis Technology Based On Deep Learning And Its Application In Auxiliary Decision-making Of HIA
10	Research On Machine Learning For Natural Language Processing And Transmission