Automatic Keyword Extraction

 
        关键词自动提取概述:  
       
       关键词自动提取(Automatic Keyword Extraction)是一种识别有意义且具有代表性片段或词汇的自动化技术(曾元显,1997)。由于关键词是表达文件主题意义的最小单位,因此大部分对非结构化文件的自动处理,如自动摘要、自动分类、自动聚类、相关反馈、自动过滤、事件检测与跟踪、知识挖掘、信息可视化、概念检索、检索提示、关联知识分析、自动问答等,都可先进行关键词提取的动作,再进行其他的处理。可以说,关键词提取可以作为所有文件自动处理的基础与核心技术。关于关键词用途的详细论述可参见文A. Montejo-Raez & R. Steinberger,2004)。
       大多文档都不具有关键词,同时手工标引费力费时且主观性较强,因此关键词自动标引是一项值得研究的技术(李素建,2004)。主要有三个领域(如图1所示)的研究者对关键词提取(或称自动标引)进行了不同角度的研究,即:图书情报领域,主要从资源构建角度进行研究,为主题标引提供了丰富的词表资源; 语言学领域从语言分析的角度研究了主题提取的机制与方法,利用词法知识、句法知识、语义知识以及篇章知识进行不同层次的主题提取研究; 人工智能领域主要从机器学习角度对自动标引进行了大量的研究,如利用启发式知识、标记数据的机器学习、无标记的机器学习、集成学习等方法的运用。
 
                                                    
图1.  关键词提取研究领域图 (See: 自动标引研究的回顾与展望)
 
       
        关键词自动提取(Automatic Keyword Extraction)在图书情报领域中,属于自动标引(Automatic Indexing)研究范畴。自动标引可细分如下两种类型:
  •   关键词自动提取:又称为自动抽词标引,即:直接从文本里抽取能表达文本主题(或内容)的词语;
  •   自动赋词标引(Automatic Keyword Assignment)即:利用主题词表(Thesaurus)中的主题词对文本进行标注。

        表1对自动标引方法(包括构建标引用词典资源)进行了分类和简要描述。

 
表1.  自动标引方法分类与简要描述 (See: 自动标引研究的回顾与展望)   

 
 
        Our Works:
  • 基于条件随机场(Conditional Random Field,CRF)的关键词提取模型研究(See: Zhang Chengzhi & Wang Huilin et al., 2008),其中关键词提取所依据的特征如表2所示,图2为CRF关键词提取训练数据示例,利用CRF模型提取文本关键词的结果示例如图3所示(点击图形可放大);
 表2.  关键词自动提取特征表 (Zhang Chengzhi & Wang Huilin et al., 2008

 

图2.   CRF关键词提取训练数据示例 
图3.   CRF关键词提取结果示例(点击图形可放大)
  • 基于集成学习的自动标引方法研究(See: 章成志, 2009),集成学习关键提取结果示例如图4所示;

    图4.   集成学习关键提取结果示例(See: 章成志, 2009)
     
  •  
  • 自动标引通用评价模型研究(See: 章成志, 周冬敏, 2009),其中在有标引结果参照情况下的评价原理图如图5所示;

图5.   传统评价方法(a)与改进后的评价方法(b)比较图(See: 章成志, 周冬敏, 2009



  • 关于文本篇章信息对主题表达能力的影响程度的调研,主要体现为,标引过程中篇章信息的权重选择,篇章信息主要包括段落、词语所在位置信息等(See: 侯汉清, 章成志, 郑红, 2005);
  • 基于KNN、Citation-KNN的隐含主题词提取研究(自动赋词),通过KNN、Citation-KNN方法找到待标引文本的相似文档集,将相似文本集中大量出现的隐含主题词作为待标引文本的主题词(See: Zhang Chengzhi & Xu Hongjiao, 2009,Zhang Qingguo & Zhang Chengzhi, 2008);
  • 自动标引相关系统,包括《全国报刊索引》自动标引与自动分类系统(V1.5);
  • 关键词提取测试集,用于检验关键词自动提取结果的测试集,包括带有篇章段落等信息的关键词标注文本语料。
      
        Our Projects: 
  • National Key Project of Scientific and Technical Supporting Programs funded by Ministry of Science & Technology of China: Information Service System of Scientific and Technical Documents: Key Techniques and Application Demonstration  (No. 2006BAH03B02, 2006BAH03B04) (2006-2009)
  • Youth Research Support Fund funded by Nanjing University of Science & Technology (No. JGQN0701): Domain Ontology Learning (2007-2009), PI
  • Scientific Research Foundation funded by Nanjing University of Science & Technology (No. AB41123): Key Techniques in Topic Clustering (2007-2009 ), PI
  • Graduate Innovation Program of Jiangsu Province: Topic Clustering and Its Application (2006-2007), PI
  • The National Philosophy and Social Sciences Fund Project (No. 02BTQ012): Automatic indexing & classification System for Web Page  Based on the Knowledge Database (2002-2004)
  • Project funded by Shanghai Library: Automatic indexing & classification System for National Index to Chinese Newspapers & Periodicals (2004-2005)
        Our Publications:    
  1. Zhang Chengzhi, Bai Zhentian. Automatic Indexing and Classification of Documents. Nanjing, China: Southeast University Press, 2009. (in Chinese)
  2. Zhang Chengzhi. Topic Clustering and Its Application. Ph.D. Dissertation, Department of Information Management, Nanjing University, Nanjing, China, 2007.  (in Chinese with English abstract) [Abstract].
  3. Zhang Chengzhi. Web Concept Mining Based on Text Layer Model. M.S. Thesis, Department of Information Management, Nanjing Agricultural University, Nanjing, China, 2002. (in Chinese with English abstract)  [Abstract] [PPT]. 
  4. Zhang Chengzhi. Combining Statistical Machine Learning Models to Extract Keywords from Chinese Documents. In: Proceedings of the 5th International Conference on Advanced Data Mining and Applications (ADMA2009). Lecture Notes in Artificial Intelligence, 5678/2009. Springer Berlin / Heidelberg, Beijing, China, 2009: 745-754.
  5. Zhang Chengzhi, Wang Huilin, Yao Liu, Dan WU, et.al. Automatic Keyword Extraction from Documents Using Conditional Random Fields. Journal of Computational Information Systems, 2008, 4(3): 1169-1180. [PPT
  6. Zhang Chengzhi & Xu Hongjiao. Using Citation-KNN for Automatic Keyword Assignment. In: Proceedings of 2009 International Conference on Electronic Commerce and Business Intelligence (ECBI2009), Beijing, China, 2009: 131-134.
  7. Zhang Qingguo,  Zhang Chengzhi. Automatic Chinese Keyword Extraction Based on KNN for Implicit Subject Extraction. In: Proceedings of the International Symposium on Knowledge Acquisition and Modeling (KAM 2008), Wuhan, China, 2008: 689-692.
  8. Zhang Chengzhi, Liu Yao, Wang Huilin. Automatic Implicit Semantic Subject Extraction Based on Citation-KNN. In: Proceeding of the 9th Chinese Lexical Semantics Workshop (CLSW2008), Singapore, 2008: 371-379. (in Chinese with English abstract) 
  9. Zhang Chengzhi, Zhou Dongmin. General Evaluation Model for Automatic Indexing, Journal of the China Society for Scientific and Technical Information, 2009, 28(1): 40-47.   (in Chinese with English abstract)
  10. Zhang Chengzhi, Su Xinning. Automatic Indexing Model Based on Conditional Random Fields, Journal of Library Science in China, 2008, 34 (05): 89-94,99. (in Chinese with English abstract)  
  11. Zhang Qingguo,Zhang Chengzhi, Xue Dejun, Zhang Junyu. Automatic Keyword Extraction Based on KNN for Implicit Subject Extraction. Journal of the China Society for Scientific and Technical Information, 2009, 28(2) : 163-168. (in Chinese with English abstract)
  12. Hou Hanqing, Zhang Chengzhi, Zheng Hong. Research On the Weighting of Indexing Sources for Web Concept Mining, Journal of the China Society for Scientific and Technical Information, 2005, 24 (1): 87-92. (in Chinese with English abstract)
  13. Zhang Chengzhi. Review and Prospect of Automatic Indexing. New Technology of Library and Information Service, 2007, (11) : 33-39. (in Chinese with English abstract)
  14. Zhang Chengmin, Xu Xin, Zhang Chengzhi. Analysis of the Factors Affecting the Performance of CRF-based Keywords Extraction Model. New Technology of Library and Information Service, 2008, (06): 34-40.(in Chinese with English abstract) 
        Reference:

  • Arturo Montejo Raez & R. Steinberger, "Why keywording matters". High Energy Physics Libraries Webzine, Issue 10, December 2004. URL: <http://library.web.cern.ch/library/Webzine/10/papers/2/>
  • Anjewierden A, Kabel S. Automatic Indexing of Documents with Ontologies. In: Proceedings of the 13th Belgian/Dutch Conference on Artificial Intelligence (BNAIC-01). Amsterdam, Netherlands, 2001: 23-30.
  • Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawa. Using the wisdom of the crowds for keyword generation. In: Proceeding of the 17th international conference on World Wide Web (WWW2008), Beijing, China, 2008: 61-70.
  • Baxendale P E. Machine-made Index for Technical Literature--an Experiment.. IBM. Journal of Research and Development, 1958, 2(4): 354-361.
  • Bendersky, M. and Croft, W. B. Discovering Key Concepts in Verbose Queries. In: Proceedings of the 31st Annual ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 08),Singapore, 2008, 491-49.
  • Boris L, Andreas H. Automatic Multi-lable Subject Indexing in a Multilingual Environment. In: Proceedings of 7th European Conference in Research and Advanced Technology for Digital Libraries (ECDL 2003). Trondheim, Norway, 2003: 140-151.
  • Chien L F. PAT-tree-based Keyword Extraction for Chinese Information Retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR1997). Philadelphia, PA, USA, 1997: 50-59.
  • Cohen J D. Highlights: Language and Domain-independent Automatic Indexing Terms for Abstracting.. Journal of the American Society for Information Science, 1995, 46(3): 162-174.
  • Deerwester S, Dumais S T, Landauer T K, Furnas G W, Harshman R A. Indexing by Latent Semantic Analysis.. Journal of the American Society for Information Science, 1990, 41(6): 391-407.
  • Dennis S F. The Design and Testing of a Fully Automatic Indexing-searching System for Documents Consisting of Expository Text[M]. In: G. Schecter eds. Information Retrieval: a Critical Review, Washington D. C.: Thompson Book Company, 1967: 67-94.
  • Devadason F. Computerization of Deep Structure Based Indexes.. International Classification, 1985, 12(2): 87-94.
  •  Dillon M, Gray A S. FASIT: A Fully Automated Syntactically Based Indexing System.. Journal of the American Society for Information Science, 1983, 34(2): 99-108.
  •  Edmundson H P, Oswald V A. Automatic Indexing and Abstracting of the Contents of Documents. Planning Research Corp, Document PRC R-126, ASTIA AD No. 231606, Los Angeles, 1959: 1-142.
  • Edmundson H P. New Methods in Automatic Abstracting Extracting.. Journal of the Association for Computing Machinery, 1969, 16(2): 264-285.
  • Ercan G, Cicekli I. Using Lexical Chains for Keyword Extraction.. Information Processing and Management, 2007, 43(6): 1705-1714.
  • Frank E, Paynter G W, Witten I H. Domain-Specific Keyphrase Extraction. In: Proceedings of the 16th International Joint Conference on Aritifcal Intelliegence. Stockholm, Sweden, Morgan Kaufmann, 1999: 668-673.
  • Hoyeon Ryu, Gunhee Kim, Kyeongjong Yoo, Kyeongjong Yoo, Sungdo Ha. n-Keyword based Automatic Query Generation. In: Proceedings of International Conference on Hybrid Information Technology - Vol2 (ICHIT'06),  2006: 90-96.
  • Hulth, A. & Megyesi, B. (2006). A study on automatically extracted keywords in text categorization. In: Proceedings of 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (CoLing/ACL 2006), Sydney, 2006: 537-544.
  • Hulth A. Improved Automatic Keyword Extraction Given More Linguistic Knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. Sapporo, Japan, 2003: 216-223.
  • Keith Humphreys J B. Phraserate: An Html Keyphrase Extractor. Technical Report, University of California, Riverside, 2002: 1-16.
  • Lahtinen T. Automatic Indexing: an Approach Using an Index Term Corpus and Combining Linguistic and Statistical Methods. Academic Dissertation, University of Helsinki, Finland, 2000: 34.
  • Li Sujian, Wang Houfeng, Yu Shiwen, Xin Chengsheng, News-Oriented Automatic Chinese Keyword Indexing. In: Proceedings of Sighan workshop ACL2003, Sapporo, Japan, 2003: 92-97.
  • Lois L E. Experiments in Automatic Indexing and Extracting.. Information Storage and Retrieval, 1970, 6: 313-334.
  • Luhn H P. A Statistical Approach to Mechanized Encoding and Searching of Literary Information.. IBM Journal of Research and Development, 1957, 1(4): 309-317
  • Luhn H P. The Automatic Creation of Literature Abstracts.. IBM Journal of Research and Development. 1958. 2(2): 159-165.
  • Maron M E, Kuhns J L. On Relevance, Probabilistic Indexing and Information Retrieval.. Journal of the Association for Computer Machinery, 1960, 7(3): 216-244.
  • Matsuo Y, Ishizuka M. Keyword Extraction from a Single Document Using Word Co-occurrence Statistical Information.. International Journal on Artificial Intelligence Tools, 2004, 13(1): 157-169.
  • Medelyna O. Automatic Keyphrase Indexing with a Domain-Specific Thesaurus[D]. Master Thesis, University of Freiburg, Germany, 2005: 23-26.
  • Salton G, Buckley C. Automatic Text Structuring and Retrieval –Experiments in Automatic Encyclopaedia Searching. In: Proceedings of the Fourteenth SIGIR Conference. New York: ACM, 1991: 21-30.
  • Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing.. Communications of ACM, 1975, 18(11): 613-620.
  • Salton G, Yang C S, Yu C T. A Theory of Term Importance in Automatic Text Analysis.. Journal of the American society for Information Science, 1975, 26(1): 33-44.
  • Salton G, Yang C S. On the Specification of Term Values in Automatic Indexing.. Journal of Documentation, 1973, 29(4): 351-72.
  • Silva W T, MiliDiu R L. Belief Function Model for Information Retrieval. Jounral of the American Society for Information Science, 1993, 44(1): 10-18.
  • Somol P., & Pudil P. Multi-Subset Selection for Keyword Extraction and Other Prototype Search Tasks Using Feature Selection Algorithms. In: Proceedings of the 18th International Conference on Pattern Recognition - Volume 02 , 2006: 736-739. 
  • Tomokiyo T, Hurst M. A language Model Approach to Keyphrase Extraction. In: Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition & Treatment. Sapporo, Japan, 2003: 33-40.
  • Turney P D. Learning to Extract Keyphrases from Text. NRC Technical Report ERB-1057, National Research Council, Canada. 1999: 1-43.
  • Vibhanshu Abhishek, Kartik Hosanagar. Keyword generation for search engine advertising using semantic similarity between terms. In: Proceedings of the ninth international conference on Electronic commerce. Minneapolis, MN, USA, 2007: 89-94. 
  • Wen-tau Yih, Joshua Goodman, Vitor R. Carvalho. Finding advertising keywords on web pages. In: Proceedings of the 15th international conference on World Wide Web (WWW2006), Edinburgh, Scotland, 2006: 213-222. 
  • Wenfeng Yang. Chinese keyword extraction based on max-duplicated strings of the documents. In: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR2002). Tampere, Finland, 2002: 439-440.
  • Witten I H, Paynter G W, Frank E, Gutwin C, Nevill-Manning C G. KEA: Practical Automatic Keyphrase Extraction. In: Proceedings of the 4th ACM Conference on Digital Library (DL’99). Berkeley, CA, USA, 1999: 254-26.
  • Xiaoyuan Wu & Alvaro Bolivar. Keyword Extraction for Contextual Advertisement. In: Proceeding of the 17th international conference on World Wide Web (WWW2008), Beijing, China, 2008: 1195-1196.
  • Yaakov H-K. Automatic Extraction of Keywords from Abstracts. In: Proceedings of the 7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems (KES2003), Oxford, UK, 2003: 843-946.
  • Zhang K, Xu H, Tang J, Li J Z. Keyword Extraction Using Support Vector Machine. In: Proceedings of the Seventh International Conference on Web-Age Information Management (WAIM2006). Hong Kong, China, 2006: 85-96.
  • 韩客松, 王永成. 中文全文标引的主题词标引和主题概念标引方法. 情报学报, 2001, 20(2): 212-216.
  • 侯汉清, 章成志, 郑红. Web概念挖掘中标引源加权方案初探..情报学报, 24(1): 87-92.
  • 李素建, 王厚峰, 俞士汶, 辛乘胜.关键词自动标引的最大熵模型应用研究.计算机学报, 2004, 27(9):1192-1197.
  • 马颖华, 王永成, 苏贵洋, 张宇萌. 一种基于字同现频率的汉语文本主题抽取方法. 计算机研究与发展, 2004, 40(6): 874-878.
  • 索红光, 刘玉树, 曹淑英. 一种基于词汇链的关键词抽取方法.. 中文信息学报, 2006, 20(6): 25-30.
  • 曾元显. 关键词自动提取技术与相关词反馈. 中国图书馆学会会报, 1997, 59: 59-64.
  • 张庆国, 薛德军, 张振海, 张君玉. 海量数据集上基于特征组合的关键词自动抽取. 情报学报, 2006, 25(5): 587-593.

 

 

 
 
 
 
 
Comments