Font Size: a A A

The R-based Internet Text Materials Analysis

Posted on:2018-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y DengFull Text:PDF
GTID:2428330569485103Subject:Applied Statistics
Abstract/Summary:PDF Full Text Request
The advent of the Big Data Age and the rapid development of the Internet have changed the way we share,publish and collect information data.The global digital universe will expand to 40 ZB in the 2020 according to the "2010 Digital Universe's report released by the Internet Data Center and the United States.When the global data volume will reach 44 times more than that in 2015,its growth rate exceeds Moore's Law.Therefore,the situation that observation data is scarce and difficult to obtain has been reversed.Due to the change of information storage and distribution mode,it brings both the technical and theoretical challenges to the rapid positioning and screening of data,accurate extraction and merge,proper analysis and interpretation.The Internet is based on the collection of unclassified text data,facing such a massive amount of information makes artificial work not available.The automatic collection of network text is the key technology of data mining and information retrieval in recent years,which has gained both wide attention and rapid research and development.At the same time,text classification technology as the core of information retrieval and natural language processing,this paper will be based on machine learning and text classification system,including text representation,text preprocessing,feature reduction,classification.On the one hand,it is convenient for us to achieve automatic data collection with R language.Making use of several major technologies including network text capture,regular expressions and basic string operations are able to achieve the collection of text data from five categories and 4098 webs in total;On the other hand,making use of machine learning algorithm and statistical model can conduct the research work of text classification.This paper proposes a text categorization algorithm based on LDA model about the previous 4098 web page corpus and compares to the traditional classical text classification algorithm including SVM and so on.Comparative analysis will be given by experimental results.Firstly,taking advantage of computing language R in the web text data collection and acquisition based on the open source has an obvious convenience.The R language is able to search and obtain specific web text information resources quickly and efficiently.Secondly,the language is also good at implementing analysis and interpretation of the web text properly and efficiently with its vast and abundant packages.Finally,the text classification algorithm based on LDA model is proposed and improved the experimental performance.
Keywords/Search Tags:Web text, Text classification, Topical model, Feature representation
PDF Full Text Request
Related items