Font Size: a A A

The Design And Implementation Of The Industry Literature Automatic Analysis System

Posted on:2015-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:H YangFull Text:PDF
GTID:2298330452950768Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The development of the digital age makes a lot of information to emerge in frontof people, especially electronic information dissemination through the network.People began to face such a problem:the low information utilization,the difficultyofbrowsing massive information quickly. How fast and efficient access to available datafrom a large amount of information is a serious problem. In fact, people cannot getthe information they need to read through all of the electronic information andtraditional information acquisition technique is just simple, non-intelligent search, anurgent need for tools to extract information compression and efficient access toinformation. With the development of data mining technology, people have moremeans to access to information, in particular, the text parsingtechnique, fragmentationand information retrieval technology.Theautomatic analysis system for Industry literatures described in this thesis isto be developedby the main flow of text mining the main line, and a method of a PDFdocument fragmentation with pdf2htmlEXis a new technique proposed; to solve thelimits on parsing imaged text with the tesseract-OCR,and through the set work of textparsing processing and word segmentation, structured analysis and storage, andfinally achieve the keyword extraction and annotated text browsing. This articlefocuses on the research priorities of text parsing, Chinese word segmentation,andinformationretrieval and keyword extraction technology. The system uses compatibleMyEclipse with the EOS development platform to build the main modules are: textmanagement module, text analysis module, text analysis and reporting moduledisplay module, Users can retrieve the list directly or Browse by clicking on theliterature.In this thesis,it give a summary on text mining technology, and some detailedstudy on document parsing techniques; done comparative experiments and draw therelevant conclusions on the popular JAVA realization of Chinese Analyzers and ontheir compatibility for Lucene; done a comparative analysis for Ansj and Lucenekeyword extraction. The system through the PDF file parsing, extracting PDF textmessage, by word, through structural analysis and fragmentation, wordsegmentationand indexing, synonyms merge and finally extracted textkeyword.The system will display the results in a visual display interface; the user canretrievethe relevant keywords according to the conditions. The system achievedannotated PDF document browser and services for thein-depth analysis and literaturemining.
Keywords/Search Tags:ChineseWordSegmentation, Text Mining, Structuring, KeywordExtraction, Information Retrieval
PDF Full Text Request
Related items