Font Size: a A A

Research On Key Technologies Of Character Analysis In Heterogeneous WEB

Posted on:2019-08-23Degree:MasterType:Thesis
Country:ChinaCandidate:Q ZhouFull Text:PDF
GTID:2428330599477706Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As an information exchange platform,the Internet contains a lot of character information.How to effectively extract character-related information from massive heterogeneous webpages is an important research content in Natural Language Processing field.In order to solve the problem of disorganized person information,repetitive person name phenomenon and difficult to extract character attributes in heterogeneous Web,this paper proposes three methods for character analysis in heterogeneous Web information.Firstly,this paper proposes a character-related text extraction algorithm based on vision-based webpages segmentation technology.The algorithm uses vision-based webpages segmentation technology to visually block webpages,combines the topic,text,and structural features of the visual blocks,and adopts GBDT algorithm to classify the visual blocks so as to extract the text information in the character-related visual blocks.The experimental results show that the F1 value of character-related text extraction algorithm using the vision-based webpages segmentation technology reaches 86%.Secondly,this paper proposes a DPMM-based person name disambiguation algorithm.The algorithm uses the word-frequency statistics vector of the webpage as an input to avoid the effect of data sparsity on the clustering algorithm.It can automatically determine the number of categories according to the word-frequency statistics features of the webpage text.At the same time,this paper proposes a strategy based on increasing the weight of named entities to improve the accuracy of the algorithm.The experiment was performed on the page set returned by the Baidu search engine,and the average F1 value reached 84%.Then,this paper proposes a character attribute extraction algorithm based on attention mechanism and Bi-LSTM,and the character attribute extraction is transformed into the entity relationship extraction problem.The algorithm uses BiLSTM to learn sentence semantics features.In order to distinguish the different relationships between different types of entity pairs,the location and type features of the entity pairs are extracted,and the character attribute extraction is realized by combining attention mechanism.The experimental results show that the recall rate and F1 value of this algorithm all reached 97%.Finally,on the basis of researches above,we designed and implemented an Internet-oriented character analysis prototype system.The system can collect webpage data returned from different search engines for querying names,implement person name disambiguation and character attribute extraction.The system test results show that the system has high accuracy and stability in extracting character information from heterogeneous webpages.
Keywords/Search Tags:character analysis, person name disambiguation, character attribute extraction, DPMM
PDF Full Text Request
Related items