免標記且高效率之完整網要推導與資料擷取方法之研究

Post date: Jul 27, 2016 3:40:05 AM

摘要:Web資料擷取是Web智慧及資訊整合的關鍵技術,免標記的資料擷取方法更是WWW、KDD等頂尖會議的重要研究主題之一。網頁層次(Page-level)的綱要推導相對於記錄層次(Record-level)的資料擷取,可以對Deep Web中相同樣版網頁產生完整網頁資料綱要,滿足對各種不同資料擷取的需求,可以說是資料擷取的完整解決方案。不過由於問題相對較複雜,因此相關研究相對較少。另一方面,過去網頁層級的資料擷取系統,僅著眼在達成非監督式的資料擷取,也就是從無標記的輸入網頁中,逕行擷取內嵌的資料,對於網頁綱要的維護與驗證較無著墨,事實上對於測試網頁的資料擷取與網頁綱頁的驗證是一體兩面的程序,同時可以節省大量網頁的資料擷取時間。在本計畫中,我們預計提出一套免標記的網頁層次的綱要推導系統,利用大量網頁中的部份網頁推導出網頁的完整綱要,繼而應用綱要驗證擷取網頁中內嵌的資料。在第一年中,我們將針對列表網頁(List pages)可以達到完整綱要的推導;第二年則對於詳細網要(Detail pages)的資料,運用HTML5 class ID、屬性等資訊,提出更精準的多序列排列(Multiple sequence alignment);最後一年,我們將建構一個Web Data Manipulation Service,整合所發展的資料擷取,提供一般使用者可以方便對深網資料進行資料的擷取、轉換及儲存等不同的操作。

Efficient and Scalable Web Data Extraction for Annotation-Free Full Schema Induction

Abstract: Unsupervised Web data extraction from annotation-free Web pages has been one of the main research topics for WWW and KDD conference. Page-level web data extraction provides a complete solution for various kinds of extraction needs, however very few researches focus on this task because of the difficulties and complexities in the problem. On the other hands, previous page-level systems focus on how to achieve unsupervised data extraction and pay less attention on schema verification, i.e. how to extract data by matching testing pages with an existing schema.

In this project, we emphasize the importance of schema verification for large-scale extraction tasks. Given a large amount of web pages for data extraction, the system uses part of the input pages for training the schema without supervision, and then extracts data from the rest of the input pages through schema verification. While the process feels like a supervised training process, the approach is actually unsupervised since users do not need to label the input pages. Therefore, we also call it annotation-free schema training. The benefit of such annotation-free schema training is the quick extraction of data from testing pages with the same template and immediate report of schema change if the website has changed its template or schema. Thus, annotation-free schema training and verification could achieve efficient and scalable Web data extraction.

In addition to the concern of extraction efficiency, schema induction for detail pages is also challenging since the number of data items is much larger than the case for record alignment of list pages. In this project, we utilize leaf nodes of the input DOM trees as the basic processing units and dynamically adjust the encoding for better alignment to speed up the processing. Meanwhile, the system also needs to deal with multi-order attribute-value pairs, which is rare for list pages and has serious effect on the design of the wrapper generation and verification.

In the last year of this project, we plan to provide a Web-based service to allow better manipulation of Web data for data reuse. By integrating the full schema induction from the first two years, we can facilitate any desired data extraction and crawling for general users. We expect the proposed system to work better than other page-level extraction systems in terms of schema accuracy and extraction efficiency. The proposed WDMS can also serve as a Web ETL tool to provide efficient and effective extraction, transformation, and loading of the data from the Web.