|
关键词自动提取概述:
关键词自动提取(Automatic Keyword Extraction)是一种识别有意义且具有代表性片段或词汇的自动化技术(曾元显,1997)。由于关键词是表达文件主题意义的最小单位,因此大部分对非结构化文件的自动处理,如自动摘要、自动分类、自动聚类、相关反馈、自动过滤、事件检测与跟踪、知识挖掘、信息可视化、概念检索、检索提示、关联知识分析、自动问答等,都可先进行关键词提取的动作,再进行其他的处理。可以说,关键词提取可以作为所有文件自动处理的基础与核心技术。关于关键词用途的详细论述可参见文(A. Montejo-Raez & R. Steinberger,2004)。
大多文档都不具有关键词,同时手工标引费力费时且主观性较强,因此关键词自动标引是一项值得研究的技术(李素建,2004)。主要有三个领域(如图1所示)的研究者对关键词提取(或称自动标引)进行了不同角度的研究,即:① 图书情报领域,主要从资源构建角度进行研究,为主题标引提供了丰富的词表资源;② 语言学领域从语言分析的角度研究了主题提取的机制与方法,利用词法知识、句法知识、语义知识以及篇章知识进行不同层次的主题提取研究;③ 人工智能领域主要从机器学习角度对自动标引进行了大量的研究,如利用启发式知识、标记数据的机器学习、无标记的机器学习、集成学习等方法的运用。
图1. 关键词提取研究领域图 (See: 自动标引研究的回顾与展望)
关键词自动提取(Automatic Keyword Extraction)在图书情报领域中,属于自动标引(Automatic Indexing)研究范畴。自动标引可细分如下两种类型:
表1对自动标引方法(包括构建标引用词典资源)进行了分类和简要描述。 表1. 自动标引方法分类与简要描述 (See: 自动标引研究的回顾与展望)
Links:
Our Works:
图5. 传统评价方法(a)与改进后的评价方法(b)比较图(See: 章成志, 周冬敏, 2009)
Our Projects:
Our Publications:
Reference:
l Arturo Montejo Raez & R. Steinberger, "Why keywording matters". High Energy Physics Libraries Webzine, Issue 10, December 2004. URL:
l Anjewierden A, Kabel S. Automatic Indexing of Documents with Ontologies. In: Proceedings of the 13th Belgian/Dutch Conference on Artificial Intelligence (BNAIC-01). Amsterdam, Neteherlands, 2001: 23-30. l Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawa. Using the wisdom of the crowds for keyword generation. In: Proceeding of the 17th international conference on World Wide Web (WWW2008), Beijing, China, 2008: 61-70. [PDF] l Baxendale P E. Machine-made Index for Technical Literature--an Experiment.. IBM. Journal of Research and Development, 1958, 2(4): 354-361. l Bendersky, M. and Croft, W. B. Discovering Key Concepts in Verbose Queries. In: Proceedings of the 31st Annual ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08),Singapore, 2008, 491-49. [PDF] l Boris L, Andreas H. Automatic Multi-lable Subject Indexing in a Multilingual Environment. In: Proceedings of 7th European Conference in Research and Advanced Technology for Digital Libraries (ECDL 2003). Trondheim, Norway, 2003: 140-151. l Chien L F. PAT-tree-based Keyword Extraction for Chinese Information Retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR1997). Philadelphia, PA, USA, 1997: 50-59. l Cohen J D. Highlights: Language and Domain-independent Automatic Indexing Terms for Abstracting.. Journal of the American Society for Information Science, 1995, 46(3): 162-174. l Deerwester S, Dumais S T, Landauer T K, Furnas G W, Harshman R A. Indexing by Latent Semantic Analysis.. Journal of the American Society for Information Science, 1990, 41(6): 391-407. l Dennis S F. The Design and Testing of a Fully Automatic Indexing-searching System for Documents Consisting of Expository Text[M]. In: G. Schecter eds. Information Retrieval: a Critical Review, Washington D. C.: Thompson Book Company, 1967: 67-94. l Devadason F. Computerization of Deep Structure Based Indexes.. International Classification, 1985, 12(2): 87-94. l Dillon M, Gray A S. FASIT: A Fully Automated Syntactically Based Indexing System.. Journal of the American Society for Information Science, 1983, 34(2): 99-108. l Edmundson H P, Oswald V A. Automatic Indexing and Abstracting of the Contents of Documents. Planning Research Corp, Document PRC R-126, ASTIA AD No. 231606, Los Angeles, 1959: 1-142. l Edmundson H P. New Methods in Automatic Abstracting Extracting.. Journal of the Association for Computing Machinery, 1969, 16(2): 264-285. l Ercan G, Cicekli I. Using Lexical Chains for Keyword Extraction.. Information Processing and Management, 2007, 43(6): 1705-1714. l Frank E, Paynter G W, Witten I H. Domain-Specific Keyphrase Extraction. In: Proceedings of the 16th International Joint Conference on Aritifcal Intelliegence. Stockholm, Sweden, Morgan Kaufmann, 1999: 668-673. l Hoyeon Ryu, Gunhee Kim, Kyeongjong Yoo, Kyeongjong Yoo, Sungdo Ha. n-Keyword based Automatic Query Generation. In: Proceedings of International Conference on Hybrid Information Technology - Vol2 (ICHIT'06), 2006: 90-96. [PDF] l Hulth, A. & Megyesi, B. (2006). A study on automatically extracted keywords in text categorization. In: Proceedings of 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (CoLing/ACL 2006), Sydney, 2006: 537-544. [PDF] l Hulth A. Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In: Proceedings of the 2003 Conference on Emprical Methods in Natural Language Processing. Sapporo, Japan, 2003: 216-223. l Keith Humphreys J B. Phraserate: An Html Keyphrase Extractor. Technical Report, University of California, Riverside, 2002: 1-16. l Lahtinen T. Automatic Indexing: an Approach Using an Index Term Corpus and Combining Linguistic and Statistical Methods. Academic Dissertation, University of Helsinki, Finland, 2000: 34. l Li Sujian, Wang Houfeng, Yu Shiwen, Xin Chengsheng, News-Oriented Automatic Chinese Keyword Indexing. In: Proceedings of Sighan workshop ACL2003, Sapporo, Japan, 2003: 92-97. [PDF] l Lois L E. Experiments in Automatic Indexing and Extracting.. Information Storage and Retrieval, 1970, 6: 313-334. l Luhn H P. A Statistical Approach to Mechanized Encoding and Searching of Literary Information.. IBM Journal of Research and Development, 1957, 1(4): 309-317 l Luhn H P. The Automatic Creation of Literature Abstracts.. IBM Journal of Research and Development. 1958. 2(2): 159-165. l Maron M E, Kuhns J L. On Relevance, Probabilistic Indexing and Information Retrieval.. Journal of the Association for Computer Machinery, 1960, 7(3): 216-244. l Matsuo Y, Ishizuka M. Keyword Extraction from a Single Document Using Word Co-ocuurrence Statistical Information.. International Journal on Artificial Intelligence Tools, 2004, 13(1): 157-169. l Medelyna O. Automatic Keyphrase Indexing with a Domain-Specific Thesaurus[D]. Master Thesis, University of Freiburg, Germany, 2005: 23-26. l Salton G, Buckley C. Automatic Text Structuring and Retrieval –Experiments in Automatic Encyclopaedia Searching. In: Proceedings of the Fourteenth SIGIR Conference. New York: ACM, 1991: 21-30. l Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing.. Communications of ACM, 1975, 18(11): 613-620. l Salton G, Yang C S, Yu C T. A Theory of Term Importance in Automatic Text Analysis.. Journal of the American society for Information Science, 1975, 26(1): 33-44. l Salton G, Yang C S. On the Specification of Term Values in Automatic Indexing.. Journal of Documentation, 1973, 29(4): 351-72. l Silva W T, MiliDiu R L. Belief Function Model for Information Retrieval. Jounral of the American Society for Information Science, 1993, 44(1): 10-18. l Somol P., & Pudil P. Multi-Subset Selection for Keyword Extraction and Other Prototype Search Tasks Using Feature Selection Algorithms. In: Proceedings of the 18th International Conference on Pattern Recognition - Volume 02 , 2006: 736-739. [PDF] l Tomokiyo T, Hurst M. A language Model Approach to Keyphrase Extraction. In: Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition & Treatment. Sapporo, Japan, 2003: 33-40. l Turney P D. Learning to Extract Keyphrases from Text. NRC Technical Report ERB-1057, National Research Council, Canada. 1999: 1-43. l Vibhanshu Abhishek, Kartik Hosanagar. Keyword generation for search engine advertising using semantic similarity between terms. In: Proceedings of the ninth international conference on Electronic commerce. Minneapolis, MN, USA, 2007: 89-94. [PDF] l Wen-tau Yih, Joshua Goodman, Vitor R. Carvalho. Finding advertising keywords on web pages. In: Proceedings of the 15th international conference on World Wide Web (WWW2006), Edinburgh, Scotland, 2006: 213-222. [PDF] l Wenfeng Yang. Chinese keyword extraction based on max-duplicated strings of the documents. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR2002). Tampere, Finland, 2002: 439-440. [PDF] l Witten I H, Paynter G W, Frank E, Gutwin C, Nevill-Manning C G. KEA: Practical Automatic Keyphrase Extraction. In: Proceedings of the 4th ACM Conference on Digital Library (DL’99). Berkeley, CA, USA, 1999: 254-26. l Xiaoyuan Wu & Alvaro Bolivar. Keyword Extraction for Contextual Advertisement. In: Proceeding of the 17th international conference on World Wide Web (WWW2008), Beijing, China, 2008: 1195-1196. [PDF] l Yaakov H-K. Automatic Extraction of Keywords from Abstracts. In: Proceedings of the 7th Internationl Conference on Knowledge-Based Intelligent Information and Engineering Systems (KES2003), Oxford, UK, 2003: 843-946. l Zhang K, Xu H, Tang J, Li J Z. Keyword Extraction Using Support Vector Machine. In: Proceedings of the Seventh International Conference on Web-Age Information Management (WAIM2006). Hong Kong, China, 2006: 85-96. l 韩客松, 王永成. 中文全文标引的主题词标引和主题概念标引方法. 情报学报, 2001, 20(2): 212-216. l 侯汉清, 章成志, 郑红. Web概念挖掘中标引源加权方案初探..情报学报, 24(1): 87-92. l 李素建, 王厚峰, 俞士汶, 辛乘胜.关键词自动标引的最大熵模型应用研究.计算机学报, 2004, 27(9):1192-1197. l 马颖华, 王永成, 苏贵洋, 张宇萌. 一种基于字同现频率的汉语文本主题抽取方法. 计算机研究与发展, 2004, 40(6): 874-878. l 索红光, 刘玉树, 曹淑英. 一种基于词汇链的关键词抽取方法.. 中文信息学报, 2006, 20(6): 25-30. l 曾元显. 关键词自动提取技术与相关词反馈. 中国图书馆学会会报, 1997, 59: 59-64. l 张庆国, 薛德军, 张振海, 张君玉. 海量数据集上基于特征组合的关键词自动抽取. 情报学报, 2006, 25(5): 587-593.
|






