Font Size: a A A

The Design And Development Of Textrank And Log-Likelihood Based Chrome Chinese Keyword Cloud Extension

Posted on:2016-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:H F JiFull Text:PDF
GTID:2298330467990812Subject:Foreign Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
Nowadays, with the rapid development of web technology, people can get access to the information from the Internet more easily. At the same time, some side-effects, like information redundancy and overflow, also appears. Under such circumstances, it becomes really important to get the key information from text mining for improving the user experience on the web and the working/reading efficiency.The research has designed and implemented a Google Chrome Chinese Keywords Extraction Extension using the algorithms of TextRank and Log-Likelihood. It could get the user’s current webpage, generate a keyword cloud with some business logic.The TextRank algorithm is based on Graph Theory and model, it calculates the weights of all the vertices in a graph, and sorts them by their weights. This research applied TextRank to ensure it could be used in the text application appropriately and could return the keywords on the webpage the user is visiting; whereas the Log-Likelihood algorithm is based on frequency and the reference corpus, generates and returns keywords by calculating their log-likelihood ratio. As for the word cloud, it is a way of displaying keywords explicitly and directly to the user by modifying font size and relative position of the words. The program calculates the weight of all the keywords got from two algorithms, and generating keywords cloud accordingly.For the web architecture, the research applied Nginx web server as a fundamental backend, and held the event-driven programming model during the development. The backend working logic was handled by Node.js. The whole program transferred the data in a valid but light-weight way. In the text processing and keywords extraction part, the research deployed a Python script on server, and handled the text cleaning, decoding, segmenting, and keywords extraction of the given webpage. By asynchronous mechanism, the response speed of the server could be improved, and the server load was reduced. More, the research also discussed the data security, stability, and scalability of the program.After the design and development, the research completed the extension which could provide word cloud to the user. The test shows that the extension offered a decent keyword list. However, the extension has to improve on some facts such as varieties of functions, user interface, and processing speed.As an extension and application of Corpus Linguistics research, the browser extension being designed and developed can help the users know the content of a webpage quickly by telling them keywords, and therefore, making their web experience better. Last, the extension is expected to become a case of applying the research of Corpus Linguistics and web technology together, and to prompt more similar research in the future.
Keywords/Search Tags:keywords extraction, TextRank, Log-Likelihood, word cloud, Chrome extension
PDF Full Text Request
Related items