Font Size: a A A

Mining,Inferring And Utilizing Latent Entity Associations In Textual Web Content

Posted on:2020-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:L LiFull Text:PDF
GTID:2428330575989335Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Textual Web content(TWC)include e-mails,Web news,etc.Entity associations(EA)in TWC documents serve as the basis for tasks such as data acquisition,relationship strength estimation and social network analysis.Latent entity associations(LEA)represent that two entities associate with each other indirectly through multiple intermediate entities in different TWC documents.Discovering and utilizing LEA can improve results of EA-based approaches but depends on solving problems in two perspectives.From the academic perspective,modeling of entity associations in TWC data and ranking of entity associations by their strength should be supported.LEA are uncertain knowledge and should be appropriately expressed and inferred.Not all LEA are valuable for subsequent analysis and therefore LEA should be ranked by their strength.For practical perspective,acquisition of TWC data and interactive analysis system should be implemented.LEA obtained from latest TWC documents from the internet would be timelier.The interactive analysis system not only allows users to use their own TWC data and choose entities to be analyzed but also provide a visual interface to present results of each step in analytical process to users.Problems in the academic perspective are mainly researched in this dissertation,including three parts.(1)We propose the concept of entity association Bayesian Network(EAB:N)by adopting Bayesian Network as the framework to model the uncertainty of LEA.EABN employ entities as variables.The directed acyclic graph obtained by structure selection of EABN express dependences among entities.Conditional probability tables obtained by parameters estimation of EABN are quantitative evaluations for dependences among entities.(2)We propose SBIC to learn the structure of EABN efficiently.During selecting the EABN structure,Self-organizing map can divide a TWC dataset into several subsets based on the sparsity of entities.We continuously choose a subset to estimate a directed edge in a candidate structrure so that the structure of EABN can be obtained efficiently.(3)We use probabilistic inferences of EABN to rank LEA.Most entity associations are LEA in the list produced by probabilistic inferences of EABN.Intance numbers of two entities belonging to an entity association form a ratio and the standard deviation of this ratio in randomly divided subsets increases in EABN ranking.By probabilistic inferences of EABN we can also discover entities which do not exist in new TWC documents but associate with entities in new TWC documents.Problems in the practical perspective are also considered and solved.(1)A web crawler is implemented in this dissertation To obtain latest TWC documents from the internet.The web crawler can get the list containing links of historical web pages by custom queries on the search engine,render web pages dynamically in PhantomJS,and store all data in MongoDB database.(2)The online system implemented for interactive analysis in this dissertation support user-defined data and entities,data visualization and data persistence.It also supports features like cross-platform,cross-terminal and instant update.
Keywords/Search Tags:Entity association, Web crawler, Bayesian network, Self-organizing map, Probabilistic inference, Interactive system
PDF Full Text Request
Related items