主题聚类概述:
主题聚类结合信息组织技术中主题方法与数据挖掘技术中的聚类方法,达到文本聚类中降低特征维度、对聚类特征进行语义控制以及对聚类结果进行主题描述的作用,进而提高聚类质量,增强聚类方法的实用性。主题聚类主要步骤包括主题提取、基于主题的聚类、基于主题的聚类结果描述等三个关键部分,因此主题聚类涉及主题提取、文本聚类以及聚类描述三个关键技术问题。
主题聚类的基本原理如图1所示。我们在该研究方向的成果已经用于CNKI主题数字图书馆的构建(如图2所示,点击图形可放大)、学科热点检测及其研究趋势的预测(如图3所示,点击图形可放大)等应用中。目前正在进行的进一步研究包括:主题聚类理论与方法的完善、主题聚类在网络舆情分析、专利内容分析等任务中的应用研究。

Fig1. Mapping among Topic, Word and Document Space
|
|
Fig2. Sample of Topic Digital Libray (TDL) |
|

Fig3. Samples of Hotspots and Their Trends | |
Links:
System:
Our Projects:
- National Key Project of Scientific and Technical Supporting Programs funded by Ministry of Science & Technology of China: Information Service System of Scientific and Technical Documents: Key Techniques and Application Demonstration (No. 2006BAH03B02, 2006BAH03B04) (2006-2009)
- Scientific Research Foundation funded by Nanjing University of Science & Technology (No. AB41123): Key Techniques in Topic Clustering (2007-2009 ), PI
- Graduate Innovation Program of Jiangsu Province: Topic Clustering and Its Application (2006-2007), PI
Our Publications:
- Zhang Chengzhi. Topic Clustering and Its Application. Ph.D. Dissertation, Department of Information Management, Nanjing University, Nanjing, China, 2007. [Abstract].
- Zhang Chengzhi, Wang Huilin, Liu Yao. Document Clustering Description Extraction and Its Application. In: Proceedings of the 22nd International Conference on the Computer Processing of Oriental Languages (ICCPOL2009). Lecture Notes in Computer Science, 5459/2009. Springer Berlin / Heidelberg, Hong Kong, China, 2009: 370-377.
- Zhang Chengzhi, Wang Huilin, Xu Hongjiao, Wu Dan. Clustering Description Learning: A Comparative Study. Journal of Information and Computation Science, 2009, 6(3): 1181-1192. [PPT]
- Zhang Chengzhi, Wang Huilin, Yao Liu, Wu Dan, et.al. Automatic Keyword Extraction from Documents Using Conditional Random Fields. Journal of Computational Information Systems, 2008, 4(3): 1169-1180. [PPT]
- Zhang Chengzhi, Wang Huilin, Liu Yao. Document Clustering Description Extraction and Its Application. In: Proceedings of the 22nd International Conference on the Computer Processing of Oriental Languages (ICCPOL2009). Lecture Notes in Computer Science, 5459/2009. Springer Berlin / Heidelberg, Hong Kong, China, 2006: 370-377.
- Zhang Chengzhi, Wu Dan. Concept Extraction and Clustering for Topic Digital Library Construction. In: Proceedings of Workshop on Natural Language Processing and Ontology Engineering (NLPOE 2008) in conjunction with Conference on Web Intelligence (WI-08). Sydney, Australia, 2008: 299-302. [PPT]
- Zhang Chengzhi, Song Wei. Self-adaptive GA, quantitative semantic similarity measures and ontology-based text clustering. In: Proceedings of the 2008 IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE08) . Beijing, China, 2008: 95-102. [PPT]
- Zhang Chengzhi, Xu Hongjiao, Wang Huilin. Clustering Description Extraction Based on Statistical Machine Learning. In: Proceedings of the 2008 International Symposium on Intelligent Information Technology Application (IITA2008) , Shanghai, China, 2008: 22-26.
- Zhang Chengzhi, Zhang Qingguo. Topic Navigation Based on Topic Extraction and Clustering. In: Proceedings of the International Symposium on Knowledge Acquisition and Modeling (KAM 2008), Wuhan, China, 2008: 333-339.
- Zhang Chengzhi, Su Xinning & Zhou Dongmin. Document Clustering Using Sample Weighting. In: He YX, Xiao GZ, Sun MS eds. Recent Advance of Chinese Computing Technologies. Singapore: Chinese and Oriental Languages Information Processing Society, Singapore, 2007: 260-265.
- Zhang Chengzhi, Xu xiaoqing & Su Xinning. Query Similarity Computing Based on System Similarity Measurement. In: Proceedings of the 21st International Conference on the Computer Processing of Oriental Languages (ICCPOL2006), Singapore, 2006: 42-50.
- Zhang Chengzhi, Zhou Dongmin. General Evaluation Model for Automatic Indexing, Jouranl of the China Society for Scientific and Technical Information, 2009, 28(1): 40-47. (in Chinese with English abstract)
- Zhang Chengzhi, Shi Qinghui, Xue Dejun Document Clustering Algorithm Based on Sample Weighting, Jouranl of the China Society for Scientific and Technical Information, 27 (01): 42-48. (in Chinese with English abstract)
- Zhang Chengzhi. Document Clustering Description Algorithm Based on Machine Learning. Jouranl of the China Society for Scientific and Technical Information, 2009, 28(2): 225-232. (in Chinese with English abstract)
- Zhang Chengzhi. A Model for Chinese String Similarity Based on Multi-Level Features, Jouranl of the China Society for Scientific and Technical Information, 2005, 24(06): 696-701.(in Chinese with English abstract)
- Zhang Chengzhi, Zhang Qingguo, Shi Qinghui. Construction of Subject Digital Libraries Based on Subject Clustering, Journal of Library Science in China, 2008, 34 (06): 64-69. (in Chinese with English abstract)
- Zhang Chengzhi, Su Xinning. Automatic Indexing Model Based on Conditional Random Fields, Journal of Library Science in China, 2008, 34 (05): 89-94,99. (in Chinese with English abstract)
- Zhang Chengzhi, Hou Hanqing. A Study of Text Layer Model Oriented to Concept Mining, Journal of Library Science in China, 2005, 31 (02): 58-61.(in Chinese with English abstract)
- Zhang Qingguo,Zhang Chengzhi, Xue Dejun, Zhang Junyu. Automatic Keyword Extraction Based on KNN for Implicit Subject Extraction. Jouranl of the China Society for Scientific and Technical Information, 2009, 28(2) : 163-168. (in Chinese with English abstract)
- Hou Hanqing, Zhang Chengzhi, Zheng Hong. Research On the Weighting of Indexing Sources for Web Concept Mining, Jouranl of the China Society for Scientific and Technical Information, 2005, 24 (1): 87-92. (in Chinese with English abstract)
- Zhang Chengzhi. Survey on Document Clustering Description. New Technology of Library and Information Service, 2009, (02): 1-8. (in Chinese with English abstract)
Reference:
- Andreas R., Dieter M. SOMLib: A Digital Library System Based on Neural Networks [C]. Proceedings of the Fourth ACM conference on Digital Libraries, Berkeley, CA, USA, 1999: 240-241.
- Chang H-C, Hsu C-C. Using Topic Keyword Clusters for Automatic Document Clustering [J]. IEEE Transactions on Information and Systems, 2005, E88-D: 1852-1860.
- Cutting, D. R., Karger, D. R, Pedersen, J. O. and Tukey, J. W. Scatter/Gather: A cluster-based approach to browsing large document collections [C]. Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'92), Copenhagen, Denmark, 1992: 318-329.
- Hulth A. Improved Automatic Keyword Extraction Given More Linguistic Knowledge[C]. Proceedings of the 2003 Conference on Emprical Methods in Natural Language Processing, Sapporo, Japan, 2003: 216-223.
- Kang S S. Keyword-based Document Clustering [C]. Proceedings of the 6th International Workshop on Information Retrieval with Asian Languages, Sapporo, Japan, 2003: 132-137.
- Tseng Y-H, Lin C-J, Chen H H, Lin Y-H. Toward Generic Title Generation for Clustered Documents[C]. Proceedings of the 3rd Asia Information Retrieval Symposium, Singapore, 2006: 145-157.
- Turney P D. Learning to Extract Keyphrases from Text. NRC Technical Report ERB-1057[R]. National Research Council, Canada. 1999: 1-43.
- Zhao Y, Karypis G. Topic-driven Clustering for Document Datasets [C]. Proceedings of the Fifth SIAM International Conference on Data Mining, St.Louis, Missouri, 2005: 358-369.
- 马张华, 陈文广,金海燕等. 基于控制词集的中文信息动态自动聚类研究[J]. 大学图书馆学报, 2006, 24(6): 54-60.
- 孙学刚, 陈群秀, 马亮. 基于主题的Web文档聚类研究[J]. 中文信息学报, 2003, 17(3): 21-26.
- 赵世奇, 刘挺, 李生. 一种基于主题的文本聚类方法[J]. 中文信息学报, 2007, 21(2): 58-62.
|
|