The Design And Implementation Of The Industry Literature Automatic Analysis System

Posted on:2015-09-07

Degree:Master

Type:Thesis

Country:China

Candidate:H Yang

Full Text:PDF

GTID:2298330452950768

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The development of the digital age makes a lot of information to emerge in frontof people, especially electronic information dissemination through the network.People began to face such a problem:the low information utilization,the difficultyofbrowsing massive information quickly. How fast and efficient access to available datafrom a large amount of information is a serious problem. In fact, people cannot getthe information they need to read through all of the electronic information andtraditional information acquisition technique is just simple, non-intelligent search, anurgent need for tools to extract information compression and efficient access toinformation. With the development of data mining technology, people have moremeans to access to information, in particular, the text parsingtechnique, fragmentationand information retrieval technology.Theautomatic analysis system for Industry literatures described in this thesis isto be developedby the main flow of text mining the main line, and a method of a PDFdocument fragmentation with pdf2htmlEXis a new technique proposed; to solve thelimits on parsing imaged text with the tesseract-OCR,and through the set work of textparsing processing and word segmentation, structured analysis and storage, andfinally achieve the keyword extraction and annotated text browsing. This articlefocuses on the research priorities of text parsing, Chinese word segmentation,andinformationretrieval and keyword extraction technology. The system uses compatibleMyEclipse with the EOS development platform to build the main modules are: textmanagement module, text analysis module, text analysis and reporting moduledisplay module, Users can retrieve the list directly or Browse by clicking on theliterature.In this thesis,it give a summary on text mining technology, and some detailedstudy on document parsing techniques; done comparative experiments and draw therelevant conclusions on the popular JAVA realization of Chinese Analyzers and ontheir compatibility for Lucene; done a comparative analysis for Ansj and Lucenekeyword extraction. The system through the PDF file parsing, extracting PDF textmessage, by word, through structural analysis and fragmentation, wordsegmentationand indexing, synonyms merge and finally extracted textkeyword.The system will display the results in a visual display interface; the user canretrievethe relevant keywords according to the conditions. The system achievedannotated PDF document browser and services for thein-depth analysis and literaturemining.

Keywords/Search Tags:

ChineseWordSegmentation, Text Mining, Structuring, KeywordExtraction, Information Retrieval

PDF Full Text Request

Related items

1	Research On The Key Techniques Of Web Information Intelligent Acquisition
2	Web Text Mining Research
3	Data Mining Research In Web Information Retrieval And Classification
4	Keyword Extraction Based On Sequential Pattern Mining
5	Text Mining And Its Application In Text Retrieval
6	Semantics-based language models for information retrieval and text mining
7	Applying text mining to multi-level indexing and searching for enhancing probabilistic information retrieval
8	Geographic Information Changes Recognition Based On Massive Text Information Mining
9	Surveillance Video Structuring And Retrieval In Camera Networks
10	Research On The Methods Of Web Text Mining For Information Retrieval