Research On Key Technologies Of Character Analysis In Heterogeneous WEB

Posted on:2019-08-23

Degree:Master

Type:Thesis

Country:China

Candidate:Q Zhou

Full Text:PDF

GTID:2428330599477706

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As an information exchange platform,the Internet contains a lot of character information.How to effectively extract character-related information from massive heterogeneous webpages is an important research content in Natural Language Processing field.In order to solve the problem of disorganized person information,repetitive person name phenomenon and difficult to extract character attributes in heterogeneous Web,this paper proposes three methods for character analysis in heterogeneous Web information.Firstly,this paper proposes a character-related text extraction algorithm based on vision-based webpages segmentation technology.The algorithm uses vision-based webpages segmentation technology to visually block webpages,combines the topic,text,and structural features of the visual blocks,and adopts GBDT algorithm to classify the visual blocks so as to extract the text information in the character-related visual blocks.The experimental results show that the F1 value of character-related text extraction algorithm using the vision-based webpages segmentation technology reaches 86%.Secondly,this paper proposes a DPMM-based person name disambiguation algorithm.The algorithm uses the word-frequency statistics vector of the webpage as an input to avoid the effect of data sparsity on the clustering algorithm.It can automatically determine the number of categories according to the word-frequency statistics features of the webpage text.At the same time,this paper proposes a strategy based on increasing the weight of named entities to improve the accuracy of the algorithm.The experiment was performed on the page set returned by the Baidu search engine,and the average F1 value reached 84%.Then,this paper proposes a character attribute extraction algorithm based on attention mechanism and Bi-LSTM,and the character attribute extraction is transformed into the entity relationship extraction problem.The algorithm uses BiLSTM to learn sentence semantics features.In order to distinguish the different relationships between different types of entity pairs,the location and type features of the entity pairs are extracted,and the character attribute extraction is realized by combining attention mechanism.The experimental results show that the recall rate and F1 value of this algorithm all reached 97%.Finally,on the basis of researches above,we designed and implemented an Internet-oriented character analysis prototype system.The system can collect webpage data returned from different search engines for querying names,implement person name disambiguation and character attribute extraction.The system test results show that the system has high accuracy and stability in extracting character information from heterogeneous webpages.

Keywords/Search Tags:

character analysis, person name disambiguation, character attribute extraction, DPMM

PDF Full Text Request

Related items

1	The Research On Personal Name Disambiguation And Character Relationship Extraction Merging Sentential Semantic Feature
2	Research On Social Network Character Attribute Extraction Method Based On Statistical Learning
3	The Subtle Characteristics Of The Radio Signal Analysis And Extraction
4	Research On Crucial Technologies Of Web Person Name Entity Disambiguation
5	A Study On Person Name Discrimination Algorithm Based On Two-Stage Clustering
6	Research On Cluster-based Person Name Disambiguation
7	Video Caption Recognition Research
8	Character Recognition Research And Application
9	Design And Implement Of A System Of Automatic Extraction Of Chinese Character Strokes
10	The Study On The Mechanism Of Humanoid Recognition Characteristic For Image Character