UWIDE

Title: Unsupervised Wrapper Induction and Data Extraction

Leader: Prof. Chia-Hui Chang

Team members: Tian-Sheng Chen, Ming-Chuang Chan, Jhong-Li Ding

Abstract

The problem of web data extraction has been studied more than ten years. Because of the structural complexity and diversity in web pages, existing researches are limited to record-level data extraction. Beside, demand of extracting data from large amount of web pages make it a challenging task for researchers. Although the web data extracted by page-level approach is more complete than record-level approach, very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, existing web data extraction systems need IT background users, because these systems have not provide friendly GUI for users. In this project, we provide a web data extraction systems based on M.-C. Chen and T.-S. Chen. We provide a friendly GUI for users to improve the training procedure of the schema induction process. The experimental results show that the performance on list page websites remain high and the performance on detail pages are increased precision 33.08% and recall 32.4%. In addition, improved system get highest recall than other systems. For accuracy, our system is higher than TEX with default threshold. If we adjust the threshold of models, we can improve the overall accuracy form 94.5% to 98.8%; Overall accuracy is 27% higher than TEX.

Download Program: Download

Demo: (In Chinese)

Publication

  • 陳明權, 陳天盛, 張嘉惠, "應用路徑資訊輔助樣板探勘於網頁層級之資料擷取研究", Conference on Technologies and Applications of Artificial Intelligence, 2013. (pdf)

  • 陳天盛, 陳明權, 張嘉惠, "基於頁面層級之快速網頁資料擷取與綱要驗證", Conference on Technologies and Applications of Artificial Intelligence, 2014. (pdf)

  • 丁中立, 張嘉惠, 張淵琮, "詳細網頁完整綱要推導之改進", National Conference on Web Intelligence and Applications, 2015. (pdf)

  • Chia-Hui Chang, Tian-Shen Chen, Ming-Chun Chen, Jhong-Li Ding, Efficient Page-Level Data Extraction Via Schema Verification, PAKDD 2016.

  • Example set: s1~s9 from ExAlg, s10~s49 from WIER