Font Size: a A A

Research On WEB Page Classification Algorithms Based On Text Semantic Graph

Posted on:2020-02-01Degree:MasterType:Thesis
Country:ChinaCandidate:W W ZhouFull Text:PDF
GTID:2428330590951088Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet has brought about an explosive growth in the number of web pages.As the carrier of information,a large number of web pages with different themes are produced every moment on the Internet,and the amount of data is huge,and more and more abundant.How to obtain the required information from the massive and dynamic information resources has become an important research topic.Web page classification technology is the basis of large data processing of web pages.Web page classification technology generates classification model through text processing training of web pages,and then classifies unknown web pages,so as to determine the category of web pages.However,the statistical web page classification method has the problem of ignoring the information of word semantics and text structure,and the dimension of feature vectors is too high in the classification process,which will increase the computational burden,and then affect the classification efficiency.In order to solve the problems of statistical Web page classification algorithm,this paper proposes a Web page classification algorithm based on the theory of graph theory,which describes the semantic information of Web pages by building text semantic graphs,and calculates the feature weights on this basis to improve the efficiency of Web page classification.The feature extraction algorithm is improved to further reduce the spatial dimension and increase the information density.The following work has been done in this paper:(1)Propose LP-TIF feature extraction algorithm.Aiming at the problem of insufficient expression of word frequency TF in TFIDF algorithm for text internal information,this paper proposes an improved method to express the importance of words to text by fusing parts of speech,word location and other features on the basis of word frequency.Then we introduce the word package model and use synonyms to integrate and standardize the feature space to further reduce the spatial dimension.(2)A method of building Web text semantic map is proposed.This paper proposes a method to construct text semantic maps that take into account word similarity and text relevance.In this paper,we use the feature word package pattern as the node of the semantic graph.First,we use word similarity to construct similar edges for the semantic graph.Then we propose a new algorithm based on co-occurrence words to measure word correlation,build relevant edges for the semantic graph,and complete the construction of the semantic graph.(3)WordRank weight calculation method is proposed.On the basis of graph structure,PageRank node sorting algorithm is introduced to calculate the weight of feature words.Based on the fact that text semantic graph is a weighted digraph,the algorithm is adjusted according to node weight and semantic edge weight,and a WordRank weight calculation method is proposed.Finally,the validity of the feature extraction algorithm and the web page classification algorithm based on text semantic graph are verified.The experimental results show that compared with traditional TF-IDF,LP-TIF feature extraction algorithm can effectively reduce the spatial dimension and improve the time efficiency of the algorithm.At the same time,the web page classification algorithm based on text semantic graph can improve the classification accuracy,optimize the web page classification effect and improve the stability of the algorithm.
Keywords/Search Tags:web page classification, text semantic map, feature extraction, weight computing
PDF Full Text Request
Related items