Font Size: a A A

Research On Full-Text Clustering Analysis And Information Extraction Of Geological Data

Posted on:2015-03-11Degree:MasterType:Thesis
Country:ChinaCandidate:Z K FanFull Text:PDF
GTID:2268330428969733Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With computers and Internet technology development and popularization,industries have accumulated a large amount of digital information, geological industryas well. Given the large scale of data, the traditional geological data service systemreveals many deficiencies, such as, retrieval system recommends unrelated or littlerelated content, the retrieval results are documents rather than specific information.The purpose of this study is based on the characteristics of geological data, improvingexisting text mining technologies, to make the current system better.Geological data recommendation function available only relied on metadata inthe database, cannot determine the correlation between the full texts, and thereforewill usually recommend irrelevant data to the user. In this paper, the text clustering isused to improve this problem, while the text features selection phase, artificiallyincreasing the weight of geological related words to highlight the theme. Data in onecluster often represent the same or similar topic; information recommended bycategory makes higher correlation. In addition, firstly clustered retrieved documents,and then the user can inspect data by clustering theme, which improves queryefficiency by narrowed searching range.The task of information extraction is to extract the text information out in aspecified pattern, and stored in a structured way. In this paper, using it to extractmetadata and automatically stored into the database, not only improving workefficiency and reducing human consumption, but also substantially increasing theaccuracy of the database while avoiding introducing human errors. In addition, withthe help of the related components of the GATE to build thematic informationextraction module, based on the characteristics of the topic to write extraction rulesand provide relevant fields vocabularies, the final extraction results can be accurate towords, phrases, sentences and paragraphs, which can be more concise to use.As the Chinese text clustering and information extraction techniques use Chineseword segmentation for preprocessing, therefore, this paper has optimized andimproved the ICTCLAS segmentation function based on the geological data topic. Onthe one hand, by the way of adding the professional geological vocabularies toimprove the segmentation accuracy; On the other hand, using the word frequencystatistics to identify the candidates of new words, and then feeding back to the user todetermine the final set of new words, finally completing the segmentation processingwith the new words. Been tested, the segmentation results improved to meet therequirements for subsequent processing.For a long time, the research of geological data informatization focuses on thedigitization side, few studies for efficient utilization of digital data. Full-textclustering analysis achieving a rough classification of documents, and informationextraction accessing the specific content of documents by topics, are able to improvethe utilization efficiency of data, obtained affirmation of geological data servicesector.
Keywords/Search Tags:geological data, Chinese word Segmentation, text clustering, K-Means, information extraction
PDF Full Text Request
Related items