Introduction

Commonsense

常識(多數人共享、一般非專業的知識)是人們間溝通、解決難題的基本要素。不幸的是, 雖然現代電腦的運算能力與儲存容量均急遽成長,電腦的「沒常識」卻是一個眾所週知的缺陷。欲將數百萬筆人類知識轉換成機器可處理的格式的確是一件費時且昂貴的工作。經過二十五年的努力,OpenCyc 2.0甫於2009年七月正式推出,其知識庫含47,000個「概念」,以及306,000筆知識工程師悉心編撰的「事實」。相對的,MIT媒體實驗室的「開放常識」計畫於十年內順利的從一萬五千名使用者貢獻了超過百萬筆英文句子。目前,兩個知識庫的內容均以英文為主,而且還極不完整。本研究計畫挑戰多語言常識知識庫的資料蒐集、驗證、與推理技術的開發,以期改善常識資料的涵蓋度、正確性、以及有效推理的能力。尤其是,本研究將旨在結合機器學習技術與具生產力社群遊戲來建構一個中文的嘗試知識庫。前者自動從非結構式與半結構式線上文件擷取出結構式知識;而後者則累積線上社群遊戲玩家的常識。所產出的知識庫可能含有錯誤或矛盾的語句。

本計劃將致力於探討下列三個議題:

在常識知識蒐集方面:

• 設計製作「具生產力社群遊戲」以「群眾外包」蒐集常識

• 以結構格式進行自動資料擷取,例如:非特定語言的集合擴展技術

• 以本體論主導的「大量語料巨觀讀法」技術來自動學習知識

在常識知識驗證方面:

• 設計製作「具生產力社群遊戲」以驗證自動學習產生的常識

• 搭配相關概念的半監督式自動學習,以增進正確性,例如:CBL

• 以多代理人架構來相互驗證並改善機器學習與社群遊戲所產生的常識

在常識知識推理方面:

• 以賽局理論分析來改進「具生產力社群遊戲」蒐集常識的質與量

• 發展類比分析技術以進行跨語言概念對應模式

• 將常識應用於情感運算,例如:情感式音樂播放系統

本研究將發揮本實驗室過去於人力計算、語義內容分析、知識表達與推理等相關研究經驗。

計畫將產出第一個提供中文常識的知識庫與工具,其所有的結果將以開放式軟體的模式提供

給全世界使用,以促使此知識庫持續改進與促成有用的應用系統。

Common sense (beliefs or propositions that most people consider prudent and of sound judgment, without reliance on esoteric knowledge or study or research, but based upon what they see as knowledge held by people "in common" – by Merriam-Webster Online) is the fundamental framework of communication and problem solving for human beings. A person who lack of common sense may be considered as dull, even ridiculous. Unfortunately, this is just how the computers nowadays looked like – when you try to interact with them in the “human way”.

Translating millions of human knowledge into knowledge base could be an expensive, labor-intensive, and time-consuming procedure. Cyc project was started in 1984 by Douglas Lenat at Microelectronics and Computer Technology Corporation (MMC) and is developed by company Cycorp. It`s objective was to codify, in machine-usable form, millions of pieces of knowledge that comprise human common sense. Parts of the project are released as OpenCyc in July 2009 – included 47,000 “concepts” and 306,000 assertions encoded by the knowledge engineers. On the other hand, Open Mind Common Sense (OMCS) project based at the Massachusetts Institute of Technology (MIT) Media Lab aimed at building and utilizing a large commonsense knowledge base from the contributions of many thousands of people across the Web. Since its founding in 1999, it has accumulated more than a million English facts from over 15,000 contributors.

Our research interests include knowledge collection, verification, and reasoning in multi-language common sense knowledge bases; and currently focus on developing technologies which enhance the coverage, correctness and reasoning ability of common sense knowledge bases. We leveraged existence English knowledge as a “guiding knowledge base” to generate high quality questions and proposed an efficient crowd-sourcing knowledge collection mechanism. Moreover, by coupling text mining and game-with-a-purpose (GWAP), we constructed a self-sustainable knowledge collection loop and thus improved the precision and coverage of collected knowledge. In addition, an on-line language learning system was built based on our common sense knowledge base.

Our research topics and related mid-term plans are:

Common sense knowledge collection:

• Design and release more GWAP to gather knowledge by crowd-sourcing

• Extract knowledge from the internet automatically by leverage text mining technology

• Generate more common sense knowledge using ontology-driven methods.

Common sense knowledge verification:

•Design GWAP which can verify the collected knowledge

•Design a semi-supervised learning algorithm to improve the accuracy of knowledge in the KB

•Construct a mutually authentication mechanism to verify the knowledge generated by different components using multi-agent technology

Common sense knowledge reasoning

•Improve the quality of GWAP by game analysis (leveraging game theory)

•Develop cross-language concept mapping techniques. (analogy analysis)

•Apply common sense knowledge on affective computing.

We aim at develop the first Chinese common sense knowledge base, and its toolkits. And we are willing to share our results to the researchers/engineers who want to make computer systems more considerative!