The R-based Internet Text Materials Analysis

Posted on:2018-08-04

Degree:Master

Type:Thesis

Country:China

Candidate:Y Deng

Full Text:PDF

GTID:2428330569485103

Subject:Applied Statistics

Abstract/Summary:

PDF Full Text Request

The advent of the Big Data Age and the rapid development of the Internet have changed the way we share,publish and collect information data.The global digital universe will expand to 40 ZB in the 2020 according to the "2010 Digital Universe's report released by the Internet Data Center and the United States.When the global data volume will reach 44 times more than that in 2015,its growth rate exceeds Moore's Law.Therefore,the situation that observation data is scarce and difficult to obtain has been reversed.Due to the change of information storage and distribution mode,it brings both the technical and theoretical challenges to the rapid positioning and screening of data,accurate extraction and merge,proper analysis and interpretation.The Internet is based on the collection of unclassified text data,facing such a massive amount of information makes artificial work not available.The automatic collection of network text is the key technology of data mining and information retrieval in recent years,which has gained both wide attention and rapid research and development.At the same time,text classification technology as the core of information retrieval and natural language processing,this paper will be based on machine learning and text classification system,including text representation,text preprocessing,feature reduction,classification.On the one hand,it is convenient for us to achieve automatic data collection with R language.Making use of several major technologies including network text capture,regular expressions and basic string operations are able to achieve the collection of text data from five categories and 4098 webs in total;On the other hand,making use of machine learning algorithm and statistical model can conduct the research work of text classification.This paper proposes a text categorization algorithm based on LDA model about the previous 4098 web page corpus and compares to the traditional classical text classification algorithm including SVM and so on.Comparative analysis will be given by experimental results.Firstly,taking advantage of computing language R in the web text data collection and acquisition based on the open source has an obvious convenience.The R language is able to search and obtain specific web text information resources quickly and efficiently.Secondly,the language is also good at implementing analysis and interpretation of the web text properly and efficiently with its vast and abundant packages.Finally,the text classification algorithm based on LDA model is proposed and improved the experimental performance.

Keywords/Search Tags:

Web text, Text classification, Topical model, Feature representation

PDF Full Text Request

Related items

1	Classification Of News Short Text Based On Deep Learning
2	Text Representation And Algorithms For Chinese Text Classification
3	Research On Improvement Of Chi-square Feature Selection And Word Vector Text Representation For News Classification
4	Algorithm Research On Text Classification And Named Entity Recognition Based On Deep Text Feature Representation
5	Research On Text Representation Model And Deep Learning Algorithm In Text Classification
6	Research Of Text Classification Based On Word2vec And Self-attention
7	A Research On Feature Extraction Applied For Text Classification
8	Research And Implementation Of Text Representation In Continuous Sapce
9	Research On Multi-label Text Classification Based On Improved Seq2seq Model
10	Research On Text Representation And Text Classification Method Based On Adversarial Training