Font Size: a A A

Extraction Effective Information And Calssification Of PDF Format Of Scientific Papers In Chinese

Posted on:2012-12-17Degree:MasterType:Thesis
Country:ChinaCandidate:L T RenFull Text:PDF
GTID:2178330332499626Subject:Software engineering
Abstract/Summary:PDF Full Text Request
As information technology continues to evolve, there has been various types ofinformation resources for people to use, this gives people the life and work a lot ofconvenience, the document is one of the most common information resources. PDFformatofthedocumentduetoitsplatformindependence,independencebetweendisplayand information,good security, graduallybythe people of all ages, is slowlydevelopinginto the main document format for the release and dissemination of electronicdocuments. Because PDF format document has these advantages, scientific papers aregenerally saved in PDF format. Now, when we are uploading and submission of PDFformat papers, generally use the manual way to input and process the papers' usefulinformation, this is not only inefficient but also a higher error rate. So how to operateand extract the information of the PDF document better determines whether people canmoreeffectivelyaccessanduseinformationtohelpthemsolvetheproblem.This paper discuss the extraction of the useful information from the PDF papers inchinese,extractionthetopicsentencefromsummary,andtheclassificationofthepapersaccording to the topic sentence of the abstract. Paper's useful information refers to thepaper title,author, keywords, abstract and other information. Around this subject, wehavedonethisseriesofresearchwork,andthemainresearchinthefollowingareas:1.TheextractionofusefulinformationfromPDFpapers.Theformatofchinesepaperis generally fixed, so you can extract the useful information with the way of thecombination of the format and fonts. Extraction useful information in this method takesall advantages of the characteristics of PDF documants, can accurately and efficientlyextractusefulinformationoutofpaper;2. Extraction the topic sentence out of paper abstract. This part is carried out by theabstractsemanticprocessing.Needlesstosay,thekeywordsisaconcisethesissummary,and paper abstract is an overview of the central idea. However, paper abstract oftencontains redundant information.Therefore, we discuss howto extract topicsentence outof paper abstract using keywords based on genetic algorithm, to achieve the purpose of streamliningtheabstractandexpandingthemeaningofkeywords.3.Classification of Papers using the topic sentence of abstract. The topic sentence ofabstract is the most refined to the central theme of the paper, so we can use it forclassification of papers.This section discuss how to use Lucene.Net and ICTCLASbasedonthesimpleBayesianalgorithmforclassificationofpapers.
Keywords/Search Tags:PDFchinsespaper, usfulinformationextraction, topicsentenceextraction, classication
PDF Full Text Request
Related items