Thesis supervision

By Ninh Pham

The following postgraduate research topics are intended for BSc(Hons) and MSc students enrolled at University of Auckland. Each topic contains both theoretical and practical components and requires solid background in algorithms (engineering and/or theoretical aspects) and big data analytics (machine learning and/or data mining techniques).

You are also welcome to present your research proposal which addresses computational or statistical challenges of big data.

For preparation, you should get high grades in algorithm or machine learning papers, e.g. CS 320, 752, 753, 760, or STATS 762, 784, etc. If you wish to pursue one of these following topics: 

1)  Send your CV, transcript and a short description of the kind of thesis you imagine writing, and 

2) Contact me to set up a meeting well in time before the start of the work.

Scale up density-based clustering with GPU

Density-based clustering is one of the popular clustering algorithms with arbitrary distance measures. While popular implementations of these clustering algorithms are CPU-based, the project will utilize advances of parallel computing of GPU to accelerate the clustering performance.

We will study and advance state-of-the-art randomized algorithms that are often scalable in parallel. We will design and implement these solutions with the CUDA library, and compare the performance with state-of-the-art GPU-based clustering library [1].

[1] https://developer.nvidia.com/blog/gpu-accelerated-hierarchical-dbscan-with-rapids-cuml-lets-get-back-to-the-future/

Prerequisites: C/C++, Python, CUDA library

Approximate nearest neighbor search with GPU

Approximate nearest neighbor search (ANNS) is the central problem in many computer science fields, e.g. recommender systems, large-scale classification, information retrieval. While most of the ANNS solvers are CPU-based, the project will utilize advances of parallel computing of GPU to accelerate the search performance.

We will study and advance state-of-the-art randomized algorithms that are often scalable in parallel. We will design and implement these solutions with the CUDA library, and compare the performance with state-of-the-art GPU-Faiss library [1].

[1] https://github.com/facebookresearch/faiss/wiki/Faiss-on-the-GPU

Prerequisites: C/C++, Python, CUDA library, Eigen library

Private approximate nearest neighbor search

Approximate nearest neighbor search (ANNS) is the central problem in many computer science fields, e.g. recommender systems, large-scale classification, information retrieval. In these applications, queries and data points constitute personal information and the growing concern of privacy limits the benefits of ANNS on sensitive data. We study and implement lightweight cryptographic tools to execute ANNS while preserving the privacy of both queries and data points.

We will use scalable private set-intersection cardinality protocols [1] and order-preserving encryption techniques [2] to implement a private ANNS solvers.

[1] https://github.com/osu-crypto/MiniPSI

[2] https://github.com/kevinlewi/fastore

 Prerequisites: C/C++, Python, CUDA library, Eigen library

Scale up unsupervised learning with hashing

Many unsupervised learning tasks, including clustering and outlier detection, in big data suffer from a significant computational bottleneck since they have to perform a huge amount of pairwise distance computations. This project will study state-of-the-art randomized techniques (e.g. locality-sensitive hashing, randomized sketching) to scale up unsupervised learning with a negligible loss of accuracy.

We will exploit and advance similarity-preserving summaries of data in the hashing/projecting space to significantly reduce pairwise distance computations for these unsupervised learning.

Prerequisites: C/C++, Python

Scalable deep learning with maximum inner product search

Maximum inner product search (MIPS) and its variant top-k MIPS which finds the top-k largest inner product points with a query are central tasks in many machine learning applications. It has been shown that efficient MIPS can significantly reduce the training and testing cost of deep learning, and multi-task learning in NLP. The project will exploit MIPS to reduce the operational cost of the training and testing mechanism of deep learning.

This project studies the state-of-the-art MIPS solvers to scale up deep learning models (e.g. DNN).


Prerequisites: C/C++, CUDA library, Python, GPU