Font Size: a A A

Research On Information Extraction And Automatic Classification Method For Open Access Journal Websites

Posted on:2013-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2248330392954744Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the growth of Open Access resources on the Internet, the state of “isolatedisland” for OA resources has become increasingly obvious. In order to achieve rapidsharing and access of OA journals, it has become a research focus to integrate OAresources for creating a digital repository organized by subject. The acquisition andclassification of OA journal websites organizations is the premise of the whole issue, itis the main content in this paper.Firstly, we propose a method to extract information from OA journal websites. Itincludes the extraction method for URLs in the seed website and the extraction methodfor title and text in OA journal websites. The system uses HTML tags to locate theregion of URLs, and then uses keywords to recognize URLs of OA journal websites.After acquiring URLs of OA journal websites, the system parses homepage of OAjournal website to extract text. If no text can be extracted, deeper page will be locatedand parsed according to keywords in anchor text. The extraction result is saved in aformat for the subsequent classification process.Secondly, based on the title and text extracted, we analyze characteristics of OAjournal site and design an algorithm to get subject keywords of OA journal websites asa corpus. Then we propose a classification method based on titles for OA journalwebsites. In order to compare with classification methods based on SVM, we analyzethe characteristics of multi-classification methods based on SVM. We propose amethod to build nodes based on the minimum Euclidean distance between categories.So, the building process of directed acyclic graph support vector machine (DAG-SVM)is improved and the cumulative error is reduced.Finally, based on the methods above, we develop the extraction and classificationsystem for OA journal websites and forecast next step of work.
Keywords/Search Tags:Open access journal, Information extraction, Text classification, Supportvector machine, Directed acyclic graph
PDF Full Text Request
Related items