Font Size: a A A

Research On Extraction And Fusion Of Structured Character Attributes In Web

Posted on:2022-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:H W YeFull Text:PDF
GTID:2518306524475864Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In recent years,character knowledge base and user portrait have been widely used in intelligent search,intelligent question answering recommendation system and other scenarios,character attribute information is the core content of the construction of character knowledge base and user portrait.With the popularization of the Internet and the increase of its scale,the rapid growth of the amount of information makes it more and more difficult to select and obtain the character attribute data from the Internet.How to efficiently and accurately obtain character attributes in the network has become a hot research topic in the field of information mining.Structured personage data has the characteristics of uniform form and reliable content,which makes it becomes the best data source.Because of the heterogeneity and multi-source nature of web pages,how to accurately and efficiently extract structured character attributes from web pages and analyze and integrate them has become a necessary research topic.Based on an in-depth analysis of the existing related research,this thesis proposes an unsupervised dynamic web pages structured attributes extraction framework and an unsupervised entity resolution algorithm based on bipartite graph and random walk,so as to realize the two key steps of character attribute extraction and fusion in the task of character attribute construction,concrete research content is as follows:(1)Aiming at the problem that it is difficult to extract structural information from heterogeneous web pages by using a general method,this thesis designs an unsupervised dynamic webpage structural attribute extraction framework,which is divided into three modules: webpage processing,attribute name learning and attribute extraction.A structured region discovery algorithm is designed to locate the structured information in web pages,and in the learning process,the attribute confidence is constantly learned and updated according to the semantic similarity,occurrence frequency and naming entity recognition information of the attribute.At the same time,several DOM tree models are used to generate attribute pairs with incredible attribute name.Experiments show that the framework is effective on both SWDE dataset and persona dataset.And the framework has no attribute and vertical domain restrictions and has no labeled data requirements,so it has good expansibility.(2)In order to achieve more accurate and efficient fusion of character attributes,this thesis proposes an unsupervised entity resolution algorithm based on bipartite graph and random walk.This algorithm is based on graph theory for entity resolution without labeled data.The algorithm is divided into two parts: bipartite graph iteration and random walk.Firstly,a directed bipartite graph is constructed according to the common terms and different terms of records,and the similarity of records is calculated iteratively.Then,a record graph is constructed according to the similarity,and the arrival probability of records is generated based on the random walk algorithm,so as to solve the threshold definition problem.The above two parts will also carry out alternate iterations,and constantly adjust the impact factors of different terms and the number of walking steps in each iteration,so that the algorithm can gradually find more matching pairs.Experiments show that the algorithm surpasses the latest similar methods in standard ER data set and persona dataset.
Keywords/Search Tags:structured attribute extraction, entity resolution, graph theory, unsupervised method
PDF Full Text Request
Related items