Font Size: a A A

Automatic Extraction And Integration For Academic Achievement

Posted on:2015-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:P HongFull Text:PDF
GTID:2298330434953177Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
The extraction and integration of the academic achievements from the Web pages can not only facilitate the scientific management of the academic achievements, but also provide significant fundamental resources for further mining of the expert academic track. Because of the low robustness, the existing information extraction system cannot adapt to the frequent changes in Web pages. And due to the huge resource amount, high redundancy, low confidence and inconformity of the description, the accuracy of the results also cannot be guaranteed. This thesis focuses on the expert academic achievements, and study the extraction and data deduplication, which are two of the most important key techniques of the Web information integration.Although there exists several web information extraction methods, the existing methods have low robustness, and they either strong depend on the extracting templates or have rigorous request for the web structure. In this thesis, a Spatial Relation Based DOM (SRB-DOM) Web information extraction algorithm is proposed, which can achieve the extraction of the academic achievements from the Web pages. In the SRB-DOM algorithm, each node in DOM tree is mapped into an object in2-D space, and the description of spatial relationships among each object can be obtained by the rectangle algebra theory. The SRB-DOM algorithm utilizes the spatial relationships among each node to extract the tuple data. Finally, the academic achievements can be extracted according to records formulated from the largest connectionless boundary tuples. Analyses and simulation results demonstrate that the SRB-DOM algorithm has superior robustness over the existing path-based extraction algorithm.The diversity of information sources and description result in the presence of a large number of similar or repeated extraction results. So some cleaning work must be down before further integration and mining of the achievement information. In this thesis, an Entropy Incremental Based Classification accomplishes the weigh allocation of each data, the calculation of the similarity among achievements, and the classification of achievements by using the degree of the importance in entropy incremental records. Afterwards, a Data Standardization Based Record Combine (DSBRC) algorithm is also presented. First, the DSBRC algorithm normalizes the achievement records based on their features, then the data state of each record is labeled and the data state matrix is acquired. According to the matrix, the complete description for achievement records can be obtained. Analyses and simulation results indicate that the proposed algorithms surpass other algorithms in accuracy and completeness.The robustness of Web information extraction is important to the practicability of the system, so the robustness should be improved. The SRB-DOM algorithm proposed in this thesis highly enhances the robustness of the system due to the elimination the dependence of the path compared with the traditional path-based extraction method. Moreover, the Entropy Incremental Based Classification and DSBRC algorithms can efficiently improve the accuracy of classification and the completeness of the mergence for the achievement records respectively, which will provide significant research values for further mining of data and exploration of knowledge.
Keywords/Search Tags:information integration, Web information extraction, spatialconnection, data classification, record deduplication
PDF Full Text Request
Related items