Automatic Extraction And Integration For Academic Achievement

Posted on:2015-04-19

Degree:Master

Type:Thesis

Country:China

Candidate:P Hong

Full Text:PDF

GTID:2298330434953177

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

The extraction and integration of the academic achievements from the Web pages can not only facilitate the scientific management of the academic achievements, but also provide significant fundamental resources for further mining of the expert academic track. Because of the low robustness, the existing information extraction system cannot adapt to the frequent changes in Web pages. And due to the huge resource amount, high redundancy, low confidence and inconformity of the description, the accuracy of the results also cannot be guaranteed. This thesis focuses on the expert academic achievements, and study the extraction and data deduplication, which are two of the most important key techniques of the Web information integration.Although there exists several web information extraction methods, the existing methods have low robustness, and they either strong depend on the extracting templates or have rigorous request for the web structure. In this thesis, a Spatial Relation Based DOM (SRB-DOM) Web information extraction algorithm is proposed, which can achieve the extraction of the academic achievements from the Web pages. In the SRB-DOM algorithm, each node in DOM tree is mapped into an object in2-D space, and the description of spatial relationships among each object can be obtained by the rectangle algebra theory. The SRB-DOM algorithm utilizes the spatial relationships among each node to extract the tuple data. Finally, the academic achievements can be extracted according to records formulated from the largest connectionless boundary tuples. Analyses and simulation results demonstrate that the SRB-DOM algorithm has superior robustness over the existing path-based extraction algorithm.The diversity of information sources and description result in the presence of a large number of similar or repeated extraction results. So some cleaning work must be down before further integration and mining of the achievement information. In this thesis, an Entropy Incremental Based Classification accomplishes the weigh allocation of each data, the calculation of the similarity among achievements, and the classification of achievements by using the degree of the importance in entropy incremental records. Afterwards, a Data Standardization Based Record Combine (DSBRC) algorithm is also presented. First, the DSBRC algorithm normalizes the achievement records based on their features, then the data state of each record is labeled and the data state matrix is acquired. According to the matrix, the complete description for achievement records can be obtained. Analyses and simulation results indicate that the proposed algorithms surpass other algorithms in accuracy and completeness.The robustness of Web information extraction is important to the practicability of the system, so the robustness should be improved. The SRB-DOM algorithm proposed in this thesis highly enhances the robustness of the system due to the elimination the dependence of the path compared with the traditional path-based extraction method. Moreover, the Entropy Incremental Based Classification and DSBRC algorithms can efficiently improve the accuracy of classification and the completeness of the mergence for the achievement records respectively, which will provide significant research values for further mining of data and exploration of knowledge.

Keywords/Search Tags:

information integration, Web information extraction, spatialconnection, data classification, record deduplication

PDF Full Text Request

Related items

1	Study On Key Techniques And System For Accurate Web Information Extraction
2	Research On Key Issues Of Web Information Integration Oriented Web Information Extraction
3	Research On Key Technologies Of Deep Web Information Integration
4	Research On Key Techniques Of Web Information Extraction For Online Public Opinion Analysis
5	Research Of Web Information Extraction Technology Based On Tree Structure
6	Research On Web Information Extraction For Domain In Information Integration System
7	Research On Techniques Of Automatic Data Record Analysis And Recognition For Accurate Web Information Extraction
8	Web Information Extraction And Integration Research Based On XML
9	Research On Automated Web Navigation And Data Integration Rules For Web Information Extraction
10	Design And Implementation Of A Tax-related Information Integration System For Local Taxation Bureau