Selected Abstracts

Abstracts for Selected Publications

BigDAS 2020

Borin Min, Hyoseok Oh, Ga-Ae Ryu, Sang Hyun Choi, Carson Kai-Sang Leung, Kwan-Hee Yoo: Image classification for agricultural products using transfer learning. The Eighth International Conference on Big Data Applications and Services (BigDAS 2020) (26-28 November 2020, Busan, South Korea): 48-52

Abstract. We aimed to detect and classify the labels of normal and abnormal types from images of agriculture product using computer vision method. The purpose of this work was to operate throw computer vision's algorithm on the input image to give the outcome could be normal or abnormal of corn, cucumber, pepper, rice strawberry. Here, we proposed to construct an ideal method to deal with label classification task using transfer learning model which could work on similar task, minimal dataset and incredible less time to train.

BigDAS 2019

Oluwafemi Sarumi, Carson Leung: Scalable data science and machine learning algorithm for gene prediction. The Seventh International Conference on Big Data Applications and Services (BigDAS 2019) (21-14 August 2019, Jeju Island, South Korea): 118-126 Best Paper of BigDAS 2019

Abstract. Recent technological advances and scientific discoveries have revolutionized the current era of genomics. The use of next-generation sequencing (NGS) technologies has tremendously reduced the sequencing time and given rise to the production and collection of high volumes of genomic datasets. Predicting protein-coding genes from these copious genomic datasets is significant for the synthesis of protein and the understating of the regulatory function of the non-coding region. Over the past few years, researchers have developed methods for finding protein-coding genes in the genome of organisms. Notwithstanding, the recent data explosion in genomics accentuates the need for more efficient algorithms for gene prediction. In this paper, we propose a scalable naive Bayes-based machine learning algorithm that is deployed over a cluster of Apache Spark framework for efficient prediction of genes in the genome of eukaryotic organisms. Evaluation results on discovering the protein-coding genes from the human genome chromosome GRCh37 show that our algorithm led to sensitivity at 92.01%, specificity at 94.00%, and accuracy at 97.01%, respectively.

ICDM 2019

    1. Umama Dewan, Chowdhury Farhan Ahmed, Carson K. Leung, Redwan Ahmed Rizvee, Deyu Deng, Joglas Souza: An efficient approach for mining weighted frequent patterns with dynamic weights. 19th Industrial Conference on Data Mining (ICDM 2019) (17-21 July 2019, New York, NY, USA): 13-27

    2. Abstract. Weighted frequent pattern (WFP) mining is considered to be more effective than traditional frequent pattern mining because of its consideration of different semantic significance (weights) of items. However, most existing WFP algorithms assume a static weight for each item, which may not be realistically hold in many real-life applications. In this paper, we consider the concept of a dynamic weight for each item and address the situations where the weights of an item can be changed dynamically. We propose a novel tree structure called compact pattern tree for dynamic weights (CPTDW) to mine frequent patterns from dynamic weighted item containing databases. The CPTDW-tree leads to the concept of dynamic tree restructuring to produce a frequency-descending tree structure at runtime. CPTDW also ensures that no non-candidate item can appear before candidate items in any branch of the tree, and thus speeds up the construction time for prefix tree and its conditional tree during the mining process. Furthermore, as it requires only one database scan, it can be applicable to interactive, incremental, and/or stream data mining. Evaluation results show that our proposed tree structure and the mining algorithm outperforms previous methods for dynamic weighted frequent pattern mining.

    3. Redwan Ahmed Rizvee, Md Shahadat Hossain Shahin, Chowdhury Farhan Ahmed, Carson K. Leung, Deyu Deng, Jiaxing Jason Mai: Sliding window based weighted periodic pattern mining over time series data. 19th Industrial Conference on Data Mining (ICDM 2019) (17-21 July 2019, New York, NY, USA): 118-132

    4. Abstract. Sliding windows have been crucial in mining time series. Many existing studies focus on reconstruction of the underlying structure (e.g., suffix tree) for each new window. However, when the window size is large or when the window slides frequently, reconstruction may perform poorly. In this paper, we propose a solution that dynamically updates the structure (rather than reconstruction for each modification or sliding). Moreover, many existing studies rely on the weight of maximum weighted item in the database to avoid testing unnecessary patterns when mining weighted periodic patterns from time series, but it may still require lots of weight checking to determine whether a pattern is a candidate. In this paper, we also propose an additional solution to address this problem by discarding unimportant patterns beforehand so as to speed up the candidate generation process. Evaluation results on real-life datasets show the effectiveness of our two solutions in handling sliding window and pruning redundant candidate patterns.

    5. Nahian Ashraf, Riddho Ridwanul Haque, Md. Ashraful Islam, Chowdhury Farhan Ahmed, Carson K. Leung, Jiaxing Jason Mai, Bryan H. Wodi: WeFreS: weighted frequent subgraph mining in a single large graph. 19th Industrial Conference on Data Mining (ICDM 2019) (17-21 July 2019, New York, NY, USA): 201-215

    6. Abstract. Considering edge weights during frequent subgraph mining can help us discover more interesting and useful subgraph patterns when compared to its unweighted counterparts. Although some recent works have proposed weight adaptation in frequent subgraph mining from transactional graph databases, the consideration of edge-weights in mining subgraph patterns from single large graphs is mostly unexplored. However, such graph structures appear frequently, with instances being found in social networks, citation and collaboration graphs, chemical and biological networks, etc. In this paper, we propose WeFreS, an efficient algorithm for mining weighted frequent subgraphs in edge-weighted single large graphs. WeFreS takes into consideration the weight, or significance of the interactions between different types of entities, and only outputs subgraphs whose weighted support is greater than a given user-defined threshold. The resulting subgraph patterns are both frequent and significant from the application perspective. Moreover, for efficiency, WeFreS is also equipped with various pruning techniques and optimizations.

HPCS 2017

Peter Braun, Alfredo Cuzzocrea, Carson K. Leung, Adam G.M. Pazdor, Syed K. Tanbeer: Mining frequent patterns from IoT devices with fog computing. 2017 International Conference on High Performance Computing & Simulation (HPCS 2017) (17-21 July 2017, Genoa, Italy): 691-698

Abstract—In the current era of big data, high volumes of a wide variety of data of different veracity can be easily generated or collected at a high velocity from rich sources of data include devices from the Internet of Things (IoT). Embedded in these big data are useful information and valuable knowledge. Hence, frequent pattern mining and its related research problem of association rule mining, which aim to discover implicit, previously unknown and potentially useful information and knowledge—in the form of sets of frequently co-occurring items or rules revealing relationships between these frequent sets—from these big data have drawn attention of many researchers. For instance, since introduction of the research problems of association rule mining or frequent pattern mining, numerous information system and engineering approaches have been developed. These include the development of serial algorithms, distributed and parallel algorithms, as well as MapReduce-based big data mining algorithms. These algorithms can be run in local computers, distributed and parallel environments, as well as clusters, grids and clouds. In this paper, we describe some of these algorithms and discuss how to mine frequent patterns or association rules in fogs—i.e., edges of the computing network.

CMU J 16(2) (2017)

Prawit Buayai, Tatpong Kantanukul, Carson K. Leung, Kanda Runapongsa Saikaew: Boundary detection of pigs in pens based on adaptive thresholding using an integral image and adaptive partitioning. Chiang Mai University Journal of Natural Sciences (CMU J) 16(2): 145-155 (April-June 2017) EISSN 2465-4337

ABSTRACT

Boundary detection of pigs is important to pig weight estimation, pig feeding behavior analysis, and thermal comfort control. This research proposes a boundary detection method for pigs in a feeder zone with a high-density pen under insufficient and varied lighting, a dirty pen scene, and small field of view. The method is based on adaptive thresholding using an integral image and adaptive partitioning. First, we segment an original grayscale image with adaptive thresholding using an integral image, and then apply adaptive partitioning with connected components. Afterwards, we utilize the maximum entropy threshold of each partition and merge the results. Our experimental results using 230 images showed that the proposed method led to a high average detection rate in a short execution time. Moreover, to the best of our knowledge, our study is the first attempt to investigate pig boundary detection in a practical farm environment, which involved dirty pen scenes with insufficient and varied lighting.

ACM SAC 2015

Alfredo Cuzzocrea, Carson K. Leung: Upper bounds to expected support for frequent itemset mining of uncertain big data. 30th Annual ACM Symposium on Applied Computing (ACM SAC 2015), Vol. 1 (Artificial intelligence and agents, distributed systems, and information systems) - Data Mining (DM) track (13-17 April 2015, Salamanca, Spain): 919-921

ABSTRACT

Frequent itemset mining aims to discover implicit, previously unknown, and useful knowledge in the form of sets of frequently co-occurring items, events, or objects. To mine frequent itemsets from probabilistic datasets of uncertain data (in which each item in a transaction is usually associated with an existential probability expressing the likelihood of its presence in that transaction), the UF-growth algorithm captures important information about uncertain data in a UF-tree structure so that expected support can be computed for each itemset. An itemset is considered frequent if its expected support meets or exceeds the user-specified threshold. However, a challenge is that the UF-tree can be large. To handle this challenge, several algorithms use smaller trees such that upper bounds to expected support can be computed. In this paper, we examine these upper bounds, and determine which ones provide tighter upper bounds to expected support for frequent itemset mining of uncertain Big data.

LWDM 2015

Alfredo Cuzzocrea, Fan Jiang, Carson K. Leung: Frequent subgraph mining from streams of linked graph structured data. 18th EDBT/ICDT Workshops (CEUR-WS 1330) - Fifth International Workshop on Linked Web Data Management (LWDM 2015) (27 March 2015, Brussels, Belgium): 237-244

ABSTRACT

Nowadays, high volumes of high-value data (e.g., semantic web data) can be generated and published at a high velocity. A collection of these data can be viewed as a big, interlinked, dynamic graph structure of linked resources. Embedded in them are implicit, previously unknown, and potentially useful knowledge. Hence, efficient knowledge discovery algorithms for mining frequent subgraphs from these dynamic, streaming graph structured data are in demand. Some existing algorithms require very large memory space to discover frequent subgraphs; some others discover collections of frequently co-occurring edges (which may be disjoint). In contrast, we propose—in this paper—algorithms that use limited memory space for discovering collections of frequently co-occurring connected edges. Evaluation results show the effectiveness of our algorithms in frequent subgraph mining from streams of linked graph structured data.

GraphQ 2014

Juan J. Cameron, Alfredo Cuzzocrea, Fan Jiang, Carson K. Leung: Frequent pattern mining from dense graph streams. 17th EDBT/ICDT Workshops (CEUR-WS 1133) - Third International Workshop on Querying Graph Structured Data (GraphQ 2014) (28 March 2014, Athens, Greece): 240-247

ABSTRACT

As technology advances, streams of data can be produced in many applications such as social networks, sensor networks, bioinformatics, and chemical informatics. These kinds of streaming data share a property in common—namely, they can be modeled in terms of graph-structured data. Here, the data streams generated by graph data sources in these applications are graph streams. To extract implicit, previously unknown, and potentially useful frequent patterns from these streams, efficient data mining algorithms are in demand. Many existing algorithms capture important streaming data and assume that the captured data can fit into main memory. However, problems arise when such an assumption does not hold (e.g., when the available memory is limited). In this paper, we propose a data structure called DSMatrix for capturing important data from the streams—especially, dense graph streams—onto the disk when the memory space is limited. In addition, we also propose two stream mining algorithms that use DSMatrix to mine frequent patterns. The tree-based horizontal mining algorithm applies an effective frequency counting approach to avoid recursive construction of sub-trees as in many tree-based mining. The vertical mining algorithm makes good use of the information captured in the DSMatrix for mining.

LWDM 2014

Alfredo Cuzzocrea, Carson K. Leung, Syed K. Tanbeer: Mining of diverse social entities from linked data. 17th EDBT/ICDT Workshops (CEUR-WS 1133) - Fourth International Workshop on Linked Web Data Management (LWDM 2014) (28 March 2014, Athens, Greece): 269-274

ABSTRACT

Nowadays, high volumes of valuable data can be easily generated or collected from various data sources at high velocity. As these data are often related or linked, they form a web of linked data. Examples include semantic web and social web. The social web captures social relationships that link people (i.e., social entities) through the World Wide Web. Due to the popularity of social networking sites, more people have joined and more online social interactions have taken place. With a huge number of social entities (e.g., users or friends in social networks), it becomes important to analyze high volumes of linked data and discover those diverse social entities. In this paper, we present (i) a tree-based mining algorithm called DF-growth, along with (ii) its related data structure called DF-tree, which allow users to effectively and efficiently mine diverse friends from social networks. Results of our experimental evaluation showed both the time- and space-efficiency of our scalable DF-growth algorithm, which makes good use of the DF-tree structure.

EDB 2013

Carson Kai-Sang Leung: Mining frequent itemsets from probabilistic datasets. Fifth International Conference on Emerging Databases (EDB 2013) (19-21 August 2013, Jeju Island, South Korea): 137-148

Frequent itemset mining aims to discover implicit, previously unknown and potentially useful knowledge—in the form of sets of frequently co-occurring items—that are embedded in data. Many algorithms developed in the early days mined frequent itemsets from traditional transaction databases of precise data such as shoppers' market basket data, in which the contents of databases are known. However, we are living in an uncertain world, in which uncertain data can be found in many real-life applications. Hence, in recent years, researchers have paid more attention to frequent itemset mining from probabilistic datasets of uncertain data. In this paper, we present some algorithms for mining frequent itemsets from these probabilistic datasets.

SEBD 2012

Alfredo Cuzzocrea, Carson K. Leung: Frequent itemset mining of distributed uncertain data under user-defined constraints. 12th Italian Symposium on Advanced Database Systems / Sistemi Evoluti per Basi di Dati (SEBD 2012) (24-27 June 2012, Venice, Italy): 243-250

Abstract. Many existing distributed data mining algorithms do not allow users to express the patterns to be mined according to their intention via the use of constraints. Consequently, these unconstrained mining algorithms can yield numerous patterns that are not interesting to users. Moreover, due to inherited measurement inaccuracies and/or network latencies, data are often riddled with uncertainty. These call for constrained mining and uncertain data mining. In this paper, we propose a tree-based system for mining frequent itemsets that satisfy user-defined constraints from a distributed environment such as a wireless sensor network of uncertain data.

RSC 1(1) (2011)

Juan J. Cameron, Carson K. Leung: Mining frequent patterns from precise and uncertain data. Journal of Systems and Computer / Revista de Sistemas e Computação (RSC) 1(1): 3-22 (January-June 2011) ISSN 2237-2903

Abstract

Data mining has gained popularity over the past two decades and has been considered one of the most prominent areas of current database research. Common data mining tasks include finding frequent patterns, clustering and classifying objects, as well as detecting anomalies. To handle these tasks, techniques from different fields—such as database systems, machine learning, statistics, information retrieval, and data visualization—are applied to provide business intelligent (BI) solutions to various real-life problems. In this survey, we focus on the task of frequent pattern mining, which non-trivially extracts implicit, previously unknown and potentially useful information in the form of frequently occurring sets of items. Mined frequent patterns can be considered as building blocks for association rules, which help reveal associative relationships between items or events on the antecedent and the consequent of rules. Here, we describe some classical algorithms, as well as some recent innovative algorithms, for mining precise data (in which users are certain about the presence or absence of data items) and uncertain data (in which users are uncertain about the presence or absence of data items and they only know that data items probably occur).

JCIS (CIEF) 2005

Carson K.-S. Leung, Ruppa K. Thulasiram, Dmitri A. Bondarenko: Using data mining techniques for detecting noises and pre-processing financial time series. Eighth Joint Conference on Information Sciences (JCIS 2005, Vol. 2) - Fourth International Conference on Computational Intelligence in Economics and Finance (CIEF/CEF 2005) (21-26 July 2005, Salt Lake City, UT, USA): 1177-1180 (or conference disc 1138-1141)

Abstract

In this paper, we propose a system to detect noises and to pre-process financial time series. This novel system combines a statistical algorithm with a data mining algorithm. We implemented and tested both algorithms on real-life historical financial time series consisting of security prices with outliers. We observed the strengths and weaknesses of each of the two algorithms, and then developed a hybrid algorithm to overcome the weaknesses of the two algorithms. Consequently, the resulting (processed) datasets can be used as input for models used in forecasting future security prices and in predicting future market behaviour.

HPCS 2004

Carson Kai-Sang Leung: Efficient parallel mining of constrained frequent patterns. 18th Annual International Symposium on High Performance Computing Systems and Applications (HPCS 2004) 2004 (16-19 May 2004, Winnipeg, MB, Canada): 73-82

Abstract—Since its introduction, frequent-pattern mining has been generalized to many forms, which include parallel mining and constrained mining. The use of constraints permits user focus and guidance, enables user exploration and control, and leads to effective pruning of the search space and efficient mining of frequent patterns. In this paper, we focus on an important class of constraints, called succinct constraints, that comprises a majority of constraints. Specifically, we propose a novel parallel algorithm, called ParFPS, for Parallel tree-based mining of Frequent Patterns satisfying Succinct constraints. The proposed algorithm is a non-trivial integration of parallel computing with constraint pushing in a tree-based mining framework. ParFPS avoids the generate-and-test paradigm by exploiting succinctness properties of the constraints in a parallel environment. As a result, in terms of functionality, our algorithm is capable of handling not only succinct aggregate constraints but also many other succinct constraints in general. In terms of performance, our algorithm is more efficient and effective than many existing tree-based frequent-pattern mining algorithms.