Web Integration Systems aim to integrate data from various Web sources efficiently and provide high quality data for applications such as market intelligence, business intelligence and public opinion analysis. However, the world is changing and data that reflects the world is changing and inter-related. In various scenes of applications, it is prerequisite for analysis and decision-making to acquire things’evolution process and relations.Web Data Integration Systems (WIS) collect data mainly from big volume, high-quality Deep Web sites and integrate all data into structure data with global schema. Therefore, in the background of velocity and volume of Web data, the data provided by WIS has limitations as follows. (1) Data from web has the characteristic of variety, the descriptions of the same thing in different sites are not always the same, even in the same site in different time are sometimes not in accordance, which results in wrong alignments when collecting data, sometimes even missing valuable data. Apparently, it affects data quality from the beginning. (2) The world is dynamic; the same thing changes its status in different time. To construct the evolution process of things in different time will help people to get data that are more comprehensive. However, with the characteristics of variety and incompletion of web data, it is challengeable for WIS to decide the time order of different values about the same thing. (3) Things in WIS are inter-related, such as COMPETE and COOPERATE relations between two companies, these relations are valuable for follow-up analysis and decision-making. Nevertheless, for that data in WIS is from limited high quality Deep Web sources, it is hard to get such relations from the structure data residing in the WIS.Aiming to enhance the data quality in Web Data Integration Systems and provide comprehensive information for entities in WIS, this paper researches on entity evolution and relation discovery problems. The contributions are as follows:(1)This thesis puts forward a method combining the match strategy and machine learning technology to dynamically discover synonyms for predefined attribute labels and new attribute labels for a specified type of web entity. On one hand, it meets the need of Web data Integration Systems (WIS) in efficiently and comprehensively collecting data from various data sources; on the other hand, it solves the matching problem of web entity schemas from web pages and pre-defined ones in WIS. For each website, the proposed approach works in four steps. First, we extract content blocks from original web pages and find object values and their description labels if exist correspondently using clustering method, and use the description labels to pre-annotate the values, get annotation sequence r1. Second, we use CRF to allocate labels for every value of the entity using predefined labels in WES and get annotation sequence r2. Third, we match each label pair in the sequence r1 and r2 both in semantic and content similarity, if the match value is higher than threshold, the annotation is fixed, otherwise it needs next step to determine. In the end, to those left undetermined in previous step, we check the confident value of the annotation by CRF, which is a probability in fact. If it is higher than threshold, we say the description label is a synonym to the annotation by CRF from WES; otherwise, we will match values from both sides. If they match, we say description label from webpage is a synonym to the annotation from WES, if not, we say it is a new attribute label for the entity.(2) In order to record the evolution process of entity attributes, this thesis proposes a method based on Markov Logic Net (MLN) to determine the time order of entity attribute values. In this method, we analyze the web sources and web data to use the characteristics of web sources’currency, web sources inter-dependency and attribute data currency in a certain web source as predicates in MLN. We define five rules (new rules can be added) to infer the currency of different values provided by different sources. On one hand, this method considers currency problem based on entity attribute instead of the entire entity, which is critical to improve the quality of data provided by WIS; on the other hand, this method summarizes characteristics of web sources and web data based on carefully analysis. To fit all the characteristics into MLN, this method designs MLN rules according to the characteristics of web sources and semantic constraints of data, which ensures the correctness of the time orders of attribute values. It is also noteworthy that it is not complicate for the MLN model to incorporate new rules, which shows that the proposed method is extensible.(3) This thesis proposes a two-staged clustering method to mine semantic relationships to a target entity in WIS. This method uses search engine to generate related web documents to mine rich semantic relations and it focuses on multi-semantic relationships specially by two-staged clustering. Given a text corpus, the proposed approach first extracts all mentions of entities that co-occur with the given entity in the same sentence and the corresponding context existing around the pair of entities. Then it filters out entities that do not appear in the WIS. In the first clustering stage, it classifies the corresponding contexts of each related entities into different set by edit distance and semantic similarity in WordNet. Then clusters the sets with hierarchical clustering method. The second clustering stage aims to adjust the results of the first stage for that it is possible that some contexts are not similar either in edit distance or WordNet but truly express the same semantic relationship. The second stage clustering uses distributional hypothesis to merge such context clusters together. The results show that the proposed approach is close with state-of-the-art approach in clustering single-semantic relationship entity pairs and has good precision in clustering multi-semantic relationship ones. |