Research On Web Personal Information Integration

Posted on:2013-01-14

Degree:Doctor

Type:Dissertation

Country:China

Candidate:L H Cao

Full Text:PDF

GTID:1118330374980640

Subject:Computer software and theory

Abstract/Summary:

With the maturity of the Internet applications and the in-depth development of Internet technology, Web sites and Web pages have increased explosively. Internet has become an important information infrastructure for politics, economy, culture and life in modern society. Information resources on the Internet are huge and colorful, covering various fields of people's life and work, which makes Internet a great and potential source of data for a variety of Web application. There is a lot of information relating to individuals, which we call the personal information. According to Web query statistics, approximately11%-17%of Web queries include people'names, and about4%of Web queries only include people' names, which shows obtaining personal information from the Internet has become one of the most common user behavior. Search engines offer users the path to find personal information and return the Web pages which include the web links matching the user's query to the user and can sort the pages according to the ranking rules. But there are many deficiencies in using search engines to find personal information, which prompts researchers to explore the personal information on the Internet from the perspective of information organization.The Internet is a dynamic and heterogeneous environment, on which the contents and existence of the information sources of personal information constantly undergo change. Single web page can not fully describe a person, so the user needs the integration of personal information on a variety of information sources. Re-organizing personal information on the Internet has following problems needed to be resolved.(1) On the web pages relating to personal information on the Internet, the same names in different personal information page may be corresponding to different individual's character. For this case, it needs to integrate personal information. What should be done first is to distinguish between the pages which include the same name with the different individual's character, in order to find the pages which user query, in preparation for further information extraction and analysis.(2) there is no unified regulation for the presentation of personal information, and data from different data sources are different in both forms and contents. Information inconsistency brings inconvenience to the people to apply this heterogeneous information. In order to apply the various data sources together effectively, it can base on different forms of expression and presentation of the content the data from different data sources to build comprehensive data model for the character entity to guide character entity recognition, extraction and integration from new data sources.(3) The unstructured and semi-structured text information can portray the subjective attributes of the objective attributes of a person's living conditions and state position better. However, due to the inherent difficulties in comprehending natural languages, the precondition is that such information should be sorted out in order to achieve effective extraction of such information.Its essence is to provide a mechanism of re-organization and understanding the information on the Web. It is an important form of exploration and utilization of Web information resources, and it can improve the utilizing efficiency of information sharing. This paper uses Web information integration technology to study the organization of the heterogeneous, autonomous, distributed personal information on the internet. It also models for personal intuitive properties from different data sources, and extracts non-visual information.(1) This paper adopts the vector space of the character and level condensed clustering to solve the problem of persons'name disambiguation on Web page. During the solving process, selecting the attributes of the character from multiple angles such as named entity,hyperlinks and so on, not only breaks through the restriction that some methods only select named entity as the character, but also differs from some methods which make measurements from text characteristic. The method of the persons'name disambiguation of this paper computes weight according to the number of feature vector, thus it is more reasonable than the way of traditional TF/IDF. The experiment can also prove the way of selecting feature vector and computing weight more effective.(2)This paper adopts SVM to dynamically construct the global model of character entity. The way of dynamically constructing the global model of character entity breaks through the method of constructing the global model adopted by reference and can fuse the new data model to the global model timely and adapt to the dynamic data sources, so as to ensure the integrity of the global model. In the process of using SVM, this paper proposes a way of constructing training set from experimental data directly, which can provide reference for the SVM to constructing training set.(3)This paper adopts conditional random field model to extract the character entity activities. Because of the complexity of natural language processing, entity activity is a kind of information type which is rarely considered by the traditional information extracting. The formalized definitions of the character entity activities can not only investigates character entity as the main activities, but also investigates character entity as the object activities, and grasp the character entity life and work track more comprehensively. In the process of using conditional random field to tag, in addition to the common part-of-speech characteristics, the method of location features and named entity characteristics of adding a word into one sentence is tried. The experiment proves that adding both features has improved the accuracy of entity activities extraction.(4) Because of the high complexity of the character entity, there are less study in allusion to it, and even lesser under the Chinese environment, and most of the experiments are carried out on English personal information web. In the few studies on the Chinese character entity, the object of the study focuses on the news web of the characters, whose wording is standard, form is simple and has finitude characters, which reduces the complexity of the study. The experiment of the three problems in this article is from the personal information web in Chinese, which has nonstandard wording, disunited forms, and extensive characters. Only in such an environment can really examine the effectiveness of the problem-solving methods.

Keywords/Search Tags:

Web Personal Information, Web Information Integration, Name Disambiguity, Person Entity Schema, Person Entity Activity

Related items

1	Research On Crucial Technologies Of Web Person Name Entity Disambiguation
2	Research On Mutual Enhancement Of Entity Resolution And Schema Matching In Web Information Intergration
3	Research On Web Entity Activity And Entity Relationship Extraction
4	Research On Knowledge Mining In Person Tracking
5	Research On Person Entity Linking For Different Scenarios
6	Research On Relation Extraction Of Person Entity In News Webpage
7	Research On Cluster-based Person Name Disambiguation
8	Research On Named Entity Recognition And Disambiguation Based On Network Semantic Resource
9	Research On Joint Extraction Of Entity Relations By Fusing Entity Local Information
10	Research On Entity Recognition Of Person Names In Uyghur Text Corpus