Topic Clustering

 
   Description: 
 
    The Organization methods of information are the methods of information processing, organizing, storing and retrieving, which play an important role in the application service of Internet. The traditional organization methods of information include subject methods, classification methods, integration of these two methods, etc. Due to huge amount of manual intervention of the traditional methods, many institutions are confronted with the difficulties that lack of the artificial, material and financial resources. Under the Internet environment with massive data, the traditional methods cannot facilitate users’ information needs adequately and timely. On the other hand, the artificial intelligence techniques such as data mining and machine learning have irreplaceable function in the application service of Internet. It is difficult to response the service request in time due to the high-dimensional data computation in the process of application service. Meanwhile, because of lacking the mechanism of subject control or semantic understanding, there are a lot of information noises and it will result in the unsatisfactory service quality. The object of the Semantic Web is to solve this problem. As the infrastructure work of the Semantic Web, Ontology construction is also encountered the same situations as the traditional organization methods of information.
    To resolve the difficulties in the application of the organization methods of information and text mining, and to improve service quality of the application service on Internet, it’s urgent to integrate the organization methods of information and the learning methods of artificial intelligence techniques. Topic (or subject) clustering method emerges as the times require through the integration of the subject method and the clustering analysis method. Topic extraction is one of basic tasks in the information extraction, and topic clustering is the process of information clustering based on topic extraction. It is an exigent problem to improve the quality and utility degree of topic extraction and clustering. 
    The basic principle of topic clustering is shown as figure1. The applications based on topic clustering, i.e. the Topic Digital Library (TDL, See Figure 2) and Detecting Hotpot of Discipline (DHD, See Figure 3), are designed and implemented.       
          
 

 Fig1.  Mapping among Topic, Word and Document Space

                                                                                             

 

       Fig2.   Sample of Topic Digital Libray (TDL)

       Fig3.   Samples of Hotspots and Their Trends

 
       Related Links:
 
        System:

  

        Our Projects:  
  • National Key Project of Scientific and Technical Supporting Programs funded by Ministry of Science & Technology of China: Information Service System of Scientific and Technical Documents: Key Techniques and Application Demonstration  (No. 2006BAH03B02, 2006BAH03B04) (2006-2009)
  • Scientific Research Foundation funded by Nanjing University of Science & Technology (No. AB41123): Key Techniques in Topic Clustering (2007-2009 ), PI
  • Graduate Innovation Program of Jiangsu Province: Topic Clustering and Its Application (2006-2007), PI

        Our Publications:    

  1. Zhang Chengzhi. Topic Clustering and Its Application. PhD. Dissertation, Department of Information Management, Nanjing University, Nanjing, China, 2007. [Abstract].
  2. Zhang Chengzhi, Wang Huilin, Liu Yao. Document Clustering Description Extraction and Its Application. In: Proceedings of the 22nd International Conference on the Computer Processing of Oriental Languages (ICCPOL2009). Lecture Notes in Computer Science, 5459/2009. Springer Berlin / Heidelberg, Hong Kong, China, 2009: 370-377.
  3. Zhang Chengzhi, Wang Huilin,  Xu Hongjiao, Wu Dan. Clustering Description Learning: A Comparative Study. Journal of Information and Computation Science, 2009, 6(3): 1181-1192.  [PPT]
  4. Zhang Chengzhi, Wang Huilin, Yao Liu, Wu Dan, et.al. Automatic Keyword Extraction from Documents Using Conditional Random Fields. Journal of Computational Information Systems, 2008, 4(3): 1169-1180. [PPT
  5. Zhang Chengzhi, Wang Huilin, Liu Yao. Document Clustering Description Extraction and Its Application. In: Proceedings of the 22nd International Conference on the Computer Processing of Oriental Languages (ICCPOL2009). Lecture Notes in Computer Science, 5459/2009. Springer Berlin / Heidelberg, Hong Kong, China, 2006: 370-377. 
  6. Zhang Chengzhi,  Wu Dan. Concept Extraction and Clustering for Topic Digital Library Construction. In: Proceedings of Workshop on Natural Language Processing and Ontology Engineering (NLPOE 2008) in conjunction with Conference on Web Intelligence (WI-08). Sydney, Australia, 2008: 299-302. [PPT]   
  7. Zhang Chengzhi, Song Wei. Self-adaptive GA, quantitative semantic similarity measures and ontology-based text clustering. In: Proceedings of the 2008 IEEE International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE08) . Beijing, China, 2008: 95-102.  [PPT]
  8. Zhang Chengzhi, Xu Hongjiao, Wang Huilin. Clustering Description Extraction Based on Statistical Machine Learning. In: Proceedings of the 2008 International Symposium on Intelligent Information Technology Application (IITA2008) , Shanghai, China, 2008: 22-26.
  9. Zhang Chengzhi, Zhang Qingguo. Topic Navigation Based on Topic Extraction and Clustering. In: Proceedings of the International Symposium on Knowledge Acquisition and Modeling (KAM 2008), Wuhan, China, 2008: 333-339.
  10. Zhang Chengzhi, Su Xinning & Zhou Dongmin. Document Clustering Using Sample Weighting. In: He YX, Xiao GZ, Sun MS eds. Recent Advance of Chinese Computing Technologies. Singapore: Chinese and Oriental Languages Information Processing Society, Singapore, 2007: 260-265.
  11. Zhang Chengzhi, Xu xiaoqing & Su Xinning. Query Similarity Computing Based on System Similarity Measurement. In: Proceedings of the 21st International Conference on the Computer Processing of Oriental Languages (ICCPOL2006), Singapore, 2006: 42-50.  
  12. Zhang Chengzhi, Zhou Dongmin. General Evaluation Model for Automatic Indexing, Journal of the China Society for Scientific and Technical Information, 2009, 28(1): 40-47.   (in Chinese with English abstract)   
  13. Zhang Chengzhi, Shi Qinghui, Xue Dejun Document Clustering Algorithm Based on Sample Weighting, Journal of the China Society for Scientific and Technical Information, 27 (01): 42-48. (in Chinese with English abstract)
  14. Zhang Chengzhi. Document Clustering Description Algorithm Based on Machine Learning. Journal of the China Society for Scientific and Technical Information, 2009, 28(2): 225-232. (in Chinese with English abstract)
  15. Zhang Chengzhi. A Model for Chinese String Similarity Based on Multi-Level Features, Journal of the China Society for Scientific and Technical Information, 2005, 24(06): 696-701.(in Chinese with English abstract)
  16. Zhang Chengzhi, Zhang Qingguo, Shi Qinghui. Construction of Subject Digital Libraries Based on Subject Clustering, Journal of Library Science in China, 2008, 34 (06): 64-69. (in Chinese with English abstract)
  17. Zhang Chengzhi, Su Xinning. Automatic Indexing Model Based on Conditional Random Fields, Journal of Library Science in China, 2008, 34 (05): 89-94,99. (in Chinese with English abstract)  
  18. Zhang Chengzhi, Hou Hanqing. A Study of Text Layer Model Oriented to Concept Mining, Journal of Library Science in China, 2005, 31 (02): 58-61.(in Chinese with English abstract)
  19. Zhang Qingguo,Zhang Chengzhi, Xue Dejun, Zhang Junyu. Automatic Keyword Extraction Based on KNN for Implicit Subject Extraction. Journal of the China Society for Scientific and Technical Information, 2009, 28(2) : 163-168. (in Chinese with English abstract)
  20. Hou Hanqing, Zhang Chengzhi, Zheng Hong. Research On the Weighting of Indexing Sources for Web Concept Mining, Journal of the China Society for Scientific and Technical Information, 2005, 24 (1): 87-92. (in Chinese with English abstract)
  21. Zhang Chengzhi. Survey on Document Clustering Description. New Technology of Library and Information Service, 2009, (02): 1-8. (in Chinese with English abstract)
  
       References:
  • Andreas R., Dieter M. SOMLib: A Digital Library System Based on Neural Networks [C]. Proceedings of the Fourth ACM conference on Digital Libraries, Berkeley, CA, USA, 1999: 240-241.
  • Chang H-C, Hsu C-C. Using Topic Keyword Clusters for Automatic Document Clustering [J]. IEEE Transactions on Information and Systems, 2005, E88-D: 1852-1860.
  • Cutting, D. R., Karger, D. R, Pedersen, J. O. and Tukey, J. W. Scatter/Gather: A cluster-based approach to browsing large document collections [C]. Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'92), Copenhagen, Denmark, 1992: 318-329.
  • Hulth A. Improved Automatic Keyword Extraction Given More Linguistic Knowledge[C]. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Sapporo, Japan, 2003: 216-223.
  • Kang S S. Keyword-based Document Clustering [C]. Proceedings of the 6th International Workshop on Information Retrieval with Asian Languages, Sapporo, Japan, 2003: 132-137.
  • Tseng Y-H, Lin C-J, Chen H H, Lin Y-H. Toward Generic Title Generation for Clustered Documents[C]. Proceedings of the 3rd Asia Information Retrieval Symposium, Singapore, 2006: 145-157.
  • Turney P D. Learning to Extract Keyphrases from Text. NRC Technical Report ERB-1057[R]. National Research Council, Canada. 1999: 1-43.
  • Zhao Y, Karypis G. Topic-driven Clustering for Document Datasets [C]. Proceedings of the Fifth SIAM International Conference on Data Mining, St.Louis, Missouri, 2005: 358-369.
  • 马张华, 陈文广,金海燕等. 基于控制词集的中文信息动态自动聚类研究[J]. 大学图书馆学报, 2006, 24(6): 54-60.
  • 孙学刚, 陈群秀, 马亮. 基于主题的Web文档聚类研究[J]. 中文信息学报, 2003, 17(3): 21-26.
  • 赵世奇, 刘挺, 李生. 一种基于主题的文本聚类方法[J]. 中文信息学报, 2007, 21(2): 58-62.
 
 
 
 
          
Comments