Font Size: a A A

Research On Text Feature Extraction Method Based On Spark

Posted on:2019-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:G H XuFull Text:PDF
GTID:2438330548472597Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Feature extraction as an important step of text processing is a research hotspot in text mining,information retrieval,natural language processing(Natural language processing,NLP),text sentiment analysis,and network public opinion monitoring.The basic task of feature extraction is to collect effective features from the multi-dimensional feature space of the text,filter redundant information and form a smaller feature space subset to reduce the feature space dimension.However,with the explosion of text information and the constant optimization of various types of machine learning technologies,the requirements for text processing are getting higher and higher.The feature space subsets selected by traditional feature extraction algorithms still have features such as feature redundancy,noise,and high dimensionality,affecting the efficiency and accuracy of text classification,and can not meet the processing needs of massive texts.At the same time,the application of big data processing technology has gradually matured,and it can effectively solve the storage and processing needs of big data.Therefore,using parallel distributed frameworks to solve problems of text processing becomes an effective solution.For the logical relations and syntactic analysis of Chinese language,this paper studies the basic theoretical knowledge of text feature extraction,optimizes feature subsets,increases the correctness of text classification,and combines the Spark computation framework.The main works have been done as follows:Firstly,we introduce the relevant algorithms of text feature extraction methods and the key technologies of text classification,as well as the basic principles of parallel computing Spark framework.This article introduces the relevant concepts and basic algorithms of text feature extraction at first.What's more,we compared Filter filter and Wrapper package feature extraction methods.Combine the improvement and innovation of commonly used feature extraction algorithms at home and abroad,a detailed summary and the commonly used feature extraction methods are given.Then we introduce the ecosystem of the parallel computing framework Spark,the operating architecture and the core components RDD.Secondly,the feature space set chosen by the feature extraction algorithm is often characterized by great redundancy and high dimensionality.Based on the above features,logical relationship of Chinese language,and the improvement of feature extraction information gain,an efficient text feature extraction algorithm are proposed.This method integrates text semantic relations rules into word frequency weighted calculations and combines information gain methods to obtain optimal feature subsets.In order to adapt the large scale of the text data,the Spark computing framework is further combined to test the correctness of the text classification and verify the effectiveness of the algorithm.Based on the feature extraction algorithm and parallel distributed computing framework theory,this paper focuses on the optimization and application of feature subsets combined with feature extraction algorithm of semantic rules.The experimental results show that it can effectively reduce the dimension of feature subsets and improve the accuracy of text classification.It has certain reference significance for big data text processing.
Keywords/Search Tags:Feature extraction, semantic analysis, Information Gain, Text Classification, Spark framework
PDF Full Text Request
Related items