Font Size: a A A

Researches On Data Elimination In Forestry WEB Yellow Page Information Integration

Posted on:2014-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:X Q LiuFull Text:PDF
GTID:2248330398456782Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
The integration of massive, heterogeneous, dynamic and discrete information on the Internet is the basis of vertical search engine and other information service. Due to the frequently changed information and its incorrect input, the integrated information of the Internet contains a lot of "dirty information", which would have bad influence on the following usages. Thus the cleaning method of integrated information and its key technologies are important research topics.Forestry yellow page is an important forestry information resource. The integration of web yellow pages from different website and a topic-oriented forestry web yellow page database has a significant value. Existing research has preliminarily integrated some information from different websites, but the integrated information contains a large amount of "dirty information", including abnormal data,incomplete data and duplicated data. The elimination of duplicated data is the main problem. This paper summarizes the principle of data elimination and its common methods, analyzes their advantages and disadvantages and presents a stepwise clustering data elimination method (SCDE). This method firstly divides the whole record set into sub-sets using both key attributes division and the Canopy clustering technique, and then accurately eliminates the records in each sub-set. A fuzzy entity matching method is proposed based on dynamic weight while accurately clustering similar records. The name of company is specially treated to improve the matching accuracy. A forestry Web yellow page data elimination system is designed and achieved. The effectiveness of SCDE method is verified through many experiments. The SCDE method has a strong practical function and can successfully solve the data elimination problem in forestry yellow page integration.
Keywords/Search Tags:WEB information integration, stepwise clustering data elimination (SCDE), approximately duplicate records, Canopy clustering, fuzzy entity matching
PDF Full Text Request
Related items