Font Size: a A A

WEB Extraction And Analysis Based On SVM And LDA

Posted on:2019-12-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y DaiFull Text:PDF
GTID:2428330572963625Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,we have entered an era of massive information.Acquiring information by Internet has become a necessary way today,Extracting web content from the vast amounts of web news has been hotspot resersch.The automatic content extraction of web page is also the basis of other applications,such as public opinion collection and analysis,news collection and personalized recommendation,news hot spot mining and so on.The difficulty of automatic content extraction of WEB mainly lies in the complexity and dynamic nature of WEB structure.Meanwhile,short texts with related topics are not necessarily the content of the text,such as page recommendation links.Therefore,the automatic content extraction algorithm proposed in this paper is based on the similarity of WEB structure and text semantics.The automatic content extraction from web pages is an important foundation of web data mining and natural language processing.This paper proposes the concept of tag meaning and the representation method of DOM feature vectors are proposed,and under this method,a WEB text extraction algorithm based on SVM and LDA is proposed.,the method can ensure the extracted content with continuity of visual and coherence of semantic,this algorithm get the feature representation of DOM node of its neighbor firstly,then we compress its feature representation by SVD decomposition and visualize it by DeepAutoEncoder.Experimental data show that the method we proposed has high accuracy and good generalization ability.At the same time,we also compares the results of various classifiers under the same data and feature representation,and results show that SVM has higher accuracy.At the same time,the text also introduces the web text extraction algorithm based on SVM and model of Gravity Radius of DOM and the web page automatic segmentation algorithm based on label semantics.The main research content of this paper includes the following aspects:(1)Various algorithms of web text extraction based on rules and machine learning are analyzed(2)Introduced and analyzed the commonly used machine learning classification model and deep neural network model.(3)Introduced the basic principles of label semantics model.(4)Introduced the web text extraction algorithm based on SVM and LDA in detail,and gives the comparison of relevant experimental and the possibility of improving the model.Finally,it gives the corresponding concrete implementation(5)Introduced the web text extraction algorithm based on SVM and model of Gravity Radius of DOM and gives the corresponding principle and implementation(6)Introduced the web page automatic segmentation algorithm based on label semantics and gives the corresponding principle and implementation...
Keywords/Search Tags:WEB content, topic model, Support Vector Machine, web page segmentation, label semantics, machine learning, deep learning
PDF Full Text Request
Related items