WEB Extraction And Analysis Based On SVM And LDA

Posted on:2019-12-24

Degree:Master

Type:Thesis

Country:China

Candidate:Y Dai

Full Text:PDF

GTID:2428330572963625

Subject:Computer technology

Abstract/Summary:

With the advent of the era of big data,we have entered an era of massive information.Acquiring information by Internet has become a necessary way today,Extracting web content from the vast amounts of web news has been hotspot resersch.The automatic content extraction of web page is also the basis of other applications,such as public opinion collection and analysis,news collection and personalized recommendation,news hot spot mining and so on.The difficulty of automatic content extraction of WEB mainly lies in the complexity and dynamic nature of WEB structure.Meanwhile,short texts with related topics are not necessarily the content of the text,such as page recommendation links.Therefore,the automatic content extraction algorithm proposed in this paper is based on the similarity of WEB structure and text semantics.The automatic content extraction from web pages is an important foundation of web data mining and natural language processing.This paper proposes the concept of tag meaning and the representation method of DOM feature vectors are proposed,and under this method,a WEB text extraction algorithm based on SVM and LDA is proposed.,the method can ensure the extracted content with continuity of visual and coherence of semantic,this algorithm get the feature representation of DOM node of its neighbor firstly,then we compress its feature representation by SVD decomposition and visualize it by DeepAutoEncoder.Experimental data show that the method we proposed has high accuracy and good generalization ability.At the same time,we also compares the results of various classifiers under the same data and feature representation,and results show that SVM has higher accuracy.At the same time,the text also introduces the web text extraction algorithm based on SVM and model of Gravity Radius of DOM and the web page automatic segmentation algorithm based on label semantics.The main research content of this paper includes the following aspects:(1)Various algorithms of web text extraction based on rules and machine learning are analyzed(2)Introduced and analyzed the commonly used machine learning classification model and deep neural network model.(3)Introduced the basic principles of label semantics model.(4)Introduced the web text extraction algorithm based on SVM and LDA in detail,and gives the comparison of relevant experimental and the possibility of improving the model.Finally,it gives the corresponding concrete implementation(5)Introduced the web text extraction algorithm based on SVM and model of Gravity Radius of DOM and gives the corresponding principle and implementation(6)Introduced the web page automatic segmentation algorithm based on label semantics and gives the corresponding principle and implementation...

Keywords/Search Tags:

WEB content, topic model, Support Vector Machine, web page segmentation, label semantics, machine learning, deep learning

Related items

1	The Research Of Public Opinion Analysis Technologies Based On Machine Learning Theory
2	Research On Some Problesm Of Support Vector Machine Learing Algorithm
3	Research On Semi-Supervised Support Vector Machine Learning Methods
4	Research On Text Categorization Method Oriented To Content Security
5	The Study Of Classification Methods And Its Applications In Web Mining Based On Statistical Learning
6	Research And Implementation Of Incremental Learning Based On Support Vector Machine
7	Study On Application Of Machine Learning Based On Support Vector Machine
8	Study On Some Issues Of Kernel Machine Learning Method
9	Research On Key Technologies For Multi-instance Multi-label Web Page Categorization
10	Research On Multi-label Data Classification Technology