DS & CS Capstone Symposium

Program 2021

Track 1

Computer Music, Computer Vision

Zhongwen Zhou, Li Guo and Haoming Liu

RGB-D Semantic Segmentation Based on CNN with Attention Module (presentation) (report)

With the incorporation of the depth information for RGB images, semantic segmentation models can have a better understanding of the scenes. However, it is still a challenge to fuse the depth and the RGB information because of their different properties. In this paper, we propose a Bi-PAM network to capture the relevance between the RGB-D inputs. One positional-wise attention module (PAM) extracts the RGB information and the other works on the depth channel. By modeling the semantic spatial interdependencies, they can relate features regardless of their distances. With a combination of the two attention maps, we incorporate the RGB and depth in-formation from a higher level. On the challenging scene semantic segmentation dataset, SUN RGB-D dataset, our model achieves a better performance than popular models which use pure RGB inputs or which concatenate RGB-D information naively.


Oscar Wan

Deep Image Matting Improvement (presentation) (report)

The output of semantic segmentation is similar to the concept trimap in the image matting task, hence the output can be borrowed to improve image matting result. In the capstone I try to reproduce DIM and PSP models. Then I treat the output of PSP as the trimap and apply it in the image matting model. The further hypothesis is that semantic segmentation can automate image matting.


Ningran Song

Optimization of Piano Performance Control Calculation (presentation) (report)

In the research “Transferring Piano Performance Control Across Environments”, recording of notes is tough and time consuming. To solve this problem, I build and train a prediction model to optimize the velocity transfer in the note transfer algorithm, which reduce the recording workload and maintain the performance of transfer result.


Yiming Huang

Multi-modal Personality Analysis for Mock Interview Performance Assessment (presentation) (report)

Computational personality analysis is becoming a popular tool for recent job recruiters who hope to identify suitable candidates from a large pool of applicants efficiently. However, automatically interpreting personality traits from an encounter during job recruitment, such as a mock interview, is a complex process and relies on information gathered from various aspects such as speech, facial expressions, and body language. In addition, in cases where near real-time analysis is required, such interpretations need to be repeatedly generated from encounters of significantly short duration. In light of these two complications, this project extends previous uni-modal approaches and aims to establish a CNN-based classifier that utilizes multiple modal features, including lexical features of speech transcripts, prosodic features of recorded audio, and visual features, all extracted from short duration Youtube clips of individuals speaking, in order to determine levels of Big-5 personality traits and interview performance of the individual. Results indicate that although the short duration of the video clips posed challenges on improving binary classification accuracy of the models beyond 70%, the incorporation of multi-modal data helped avoid issues of over-fitting that previously occurred in the uni-modal solution.


Fan Yuan, Jianchen Tian and Peixuan Yu

AI Movie Casting Director (presentation) (report)

An interactive digital art piece to reveal the correlation between Hollywood movie characters’ physical appearance and their textual representation in movie scripts, AI Movie Casting Director is a Machine Learning-based system that generates movie script-like descriptions of human portrait images. Due to the lack of ready-made movie related paired image-text data and the low quality of current image captioning models, we put together our own dataset and attempted alternate approaches towards the goal in place of direct image captioning, which include techniques of image retrieval and text generation. Our system is now capable of captioning a given human portrait in the style of Hollywood movie script language with physical details and a rather decent level of accuracy.

Track 2

Theoretical & Applied Data Science

Yukun Jiang and Junhai Ma

Purchaser Prediction and Product Recommendation for HSBC Retail Banking (presentation) (report)

In this capstone project, we put forward several approaches to help HSBC predict the likely purchasers and make recommendations accordingly. These issues are tough to tackle for their strong correlation with the business and customer diversity. We propose to solve the problem in two steps: (i) predict who is likely to purchase; (ii) make the fittest recommendation individually. We succeed in building a well-performing machine learning prediction model and designing a business-applicable recommendation strategy.


Zining Mao

Comparison of Methods for Robust QuadrupedRobot Policy (presentation) (report)

We compare two methods for improving the robustness of a reinforcement learning-based policy for a quadruped robot. The domain randomization method randomizes the robot’s structure during training to increase the coverage of models in the training data set, whereas the System identification method identifies the target robot model to adapt the RL-based policy obtained during training. We design experiments in a simulation environment to examine their performance and stability. The results show both methods improve the performance of the pure RL method. However, each of them has different characteristics. Furthermore, combining the two methods improves the policy further. We provide qualitative analysis for the performance of these two methods.


Jingyi Huang

Investigating Characteristics and Predictors of Ridesourcing and Ride-splitting Adoption: A Case Study of Chengdu, China (presentation) (report)

The paper aims to investigate the adoption and frequency of ridesourcing and ridesplitting trips at a disaggregated level and examine ridesourcing services’ substitution and complementary effect on public transit from both spatial and temporal perspectives. Using DiDi Chuxing data in Chengdu, China, we develop a prediction model of ridesplitting adoption which integrates various built environment variables and trip features. We train the data with Random Forest classifiers to discover predictors of ride-splitting adoption and look into the nonlinear relationship between variables, achieving an accuracy score of 84.25%. We find out that trip’s start time, trip distance, transit accessibility, etc. are influential factors impacting people’s choices of splitting rides. Based on our findings, policy implications can be drawn concerning how to implement spatial- and temporal-distinctive urban policies to promote the integration of public transit and ridesourcing and encourage ride-splitting.


Zhirui Yao and Kaiwen Dai

Spending Patterns of High Net Worth Individuals (presentation) (report)

The paper aims to investigate the adoption and frequency of ridesourcing and ride splitting trips at a dis-aggregated level and examine ridesourcing services’ substitution and complementary effect on public transit from both spatial and temporal perspectives. Using DiDi Chuxing data in Chengdu, China, we develop a prediction model of ridesplitting adoption which integrates various built environment variables and trip features. We train the data with Random Forest classifiers to discover predictors of ride-splitting adoption and look into the nonlinear relationship between variables, achieving an accuracy score of 84.25%. We find out that trip’s start time, trip distance, transit accessibility, etc. are influential factors impacting people’s choices of splitting rides. Based on our findings, policy implications can be drawn concerning how to implement spatial- and temporal-distinctive urban policies to promote the integration of public transit and ridesourcing and encourage ride-splitting.


Robert Melikyan and Houze Liu

Deep Learning Applied in Dynamic Aircraft Valuation (presentation) (report)

Efficient aircraft valuation and prediction are vital for Aircraft Appraisers, as well as other participants of the aviation industry. However, the monopolistic industry feature, pre-established theoretical valuation approaches, and lack of data transparency hindered the application of a quantitative modeling approach towards aircraft valuation. This project targets at solving those issues by pooling data from multiple sources such as reports from FAA, OEMs, Airlines, and Appraisers. Through a quantitative exploration of multiple models, including the Ridge Regression Model, the Random Forest Model, the XGBoost Model, this project eventually attains a Stacking Model. In addition to building upon these previous models, our team also explored the use of Deep Feed Forward Neural Networks with mixed results.


Siyi Lyu and Zixuan Xie

Optimal Deployment of Quarantine Facilities during the Pandemic (presentation) (report)

In our paper, we mainly focusing on the optimal deployment of centralized quarantined facilities. We set the scope to be Shanghai, which is the city that the Chinese government expect to hold the most international flights. The objective is to minimize the total cost of the quarantine facilities. We used a mixed integer programming approach to solve this problem. We divided the total cost into three parts, which includes the construction cost, the transportation cost for all the incoming people, and the transportation cost for the identified patients. This paper can be a reference for the government in determine the location of the quarantine facilities in the following years.


Yijian Liu and Yijie Wang

Applying Deep Reinforcement Learning to the Control Scheme for Multiple Elevators (presentation) (report)

Nowadays, it is a daily headache for a white collar to queue for the elevator to go to her office on the 30th floor. Imagine there is a way to decrease the average waiting time for people to take the elevators, the overall productivity of the companies and the people will be largely enhanced. This problem is well known as the elevator group control (EGC). Recently, Deep Reinforcement Learning (DRL) has received much attention due to its ability in solving decision problems with large state and action spaces. As many new DRL algorithms have been proposed, our project aims to test the feasibility of applying DRL algorithms to solve EGC problem. We prove that current DRL algorithms, while having many limitations, show a promising future of DRL application to real-world problems.


Yuhao Ding

Improving Fairness in Machine Learning Predictions (presentation) (report)

As machine learning methods are increasingly applied to real-world scenarios, it is crucial to make sure the models we used are fair. Historic biases in the datasets would easily make models discriminate against certain groups of people in terms of race, gender, etc. In this work, we present a novel approach to reduce bias that could be easily applied to machine learning applications. We find that by training a post-correction model that focuses learning and correcting bias patterns, the bias could be reduced by around 80%, with accuracy only limitedly harmed (<1%).

Track 3

Software Approaches for Real-World Applications

Adrien Ventugol

2.5D Action Game (presentation) (report)

The quality of video games has traditionally been placed solely on their entertainment value, much like toys. Recent studies in the fields of Video Game Research (Ludology) and Media Research suggest otherwise. Players would seek both entertainment and pleasure (Hedonia), as well as emotional investment and exposure to meaningfulness (Eudaimonia). By building a custom 2.5D Action Game and submitting it to Beta testing, this project validates the claims that Hedonia and Eudaimonia both play a role in a player’s gaming experience.


Gina Joerger and Oli Chen

An Algorithmic Stablecoin on the Internet Computer (presentation) (report)

As the distributed ledger technology industry grows, there is an increasing demand for fast, open, and secure blockchains that can support a diverse array of protocols. However, current impediments such as long finality time, low scalability, and low throughput have become great inhibitors. Furthermore, Bitcoin in its present state is simply too volatile to be used as a secure financial asset; an alternate solution is required. By building an Ampleforth implementation on the Internet Computer, we provide a price-stable digital asset on another blockchain platform that is not Ethereum. In doing so we increase the protocol’s degree of interoperability whilst also creating a way to measure scalability, security, and decentralization on the Internet Computer. In this report, we present our prototype implementation and the tests we conducted upon both Ethereum’s Ganache network and the Internet Computer’s Sodium network. Our main findings demonstrated that on these local networks the Ampleforth implementation on Ganache executed faster than our Internet Computer prototype. However, our findings also suggested that the Ampleforth implementation is highly unlikely to scale as efficiently (on Ethereum) compared to a deployed version of our prototype on the Internet Computer.


Tyler Wilson

Bi-Parental Moran Model Simulations (presentation) (report)

In genealogy researchers use what is called a Moran model to study the gene pools of finite populations over time. In this project, I support the theory researched by my supervisor, Professor Le Jan, which aims to explain genetic behavior under the conditions of a Bi-Parental Moran model. This type of model is defined by a fixed N number of individuals in the population and n generations used in the process. His work particularly focuses on the effects observed when N and n go to infinity. To support the theory I built a Python module that accurately emulates the Bi-Parental Moran model and created a series of simulations that can be used to further validate the theory, display the phenomena it describes, explore its explanatory scope, and offer new questions to consider. I find the results from the Moran module and simulations match expected behavior, and act as a new tool to analyze further research.


Qing Deng

An Investigation of the Applicability of Automated Testing Techniques in the FinTech Industry (presentation) (report)

Nowadays In the financial technology(FinTEch) industry, the reliability of software systems is especially important since a great deal of investment. Unfortunately, according to the field research, the current state of the practice in the FinTech industry still heavily relies on manual testing, largely due to the high complexity of their systems. Manual testing is a rather time- and labor-consuming approach for testers. Thus, there is a desperate and practical need to develop an algorithm that could automate the testing process. In this paper, I present a new testing methodology that consists of combinatorial testing, equivalent set, and execution path set to determine, with high accuracy(above 69%) and reliability, whether there is a potential bug within the software program.


Alexander Gonsalves

An Analysis of Genetic Algorithms in Training Snake (presentation) (report)

Computer agents, controlled by Neural Networks, can use computer learning methods to accomplish a great deal of tasks. In this research project, Genetic Algorithm trained agents to learn how to play the game Snake and are evaluated against each-other and a simple heuristic in an effort to determine an optimal method of training between the Genetic Algorithm variants and identify shortcomings of the algorithm. The Elitism and Coarse Grained Genetic Algorithms are trained 200 times for 2000 generations each, and then evaluated 1000 times per training run. They are compared against a Manhattan Distance model that is evaluated 200, 000 times. The Elitism Genetic Algorithm slightly out performed the Manhattan Distance model with a high-score of 62 apples and adapting to the model’s environment better. The Coarse Grained Parallel Genetic Algorithm under-performed both models due to a constricting parameter set that inhibited training. The Elitism Genetic Algorithm shows the most promise as an effective Genetic Algorithm, but future work should be done to fully investigate the efficacy of the Coarse Grained Parallel Genetic Algorithm.

Track 4

Computer Languages, Systems and Networks


Quang Luong

Compress Encapsulated Headers in WireGuard (presentation) (report)

This paper defines a header compression algorithm to use in WireGuard. WireGuard is a relatively new communication protocol focusing on performance and security. However, as it encapsulates Internet Protocol packets in a User Datagram Protocol tunnel, there is redundant information in a protocol data unit. Removing that redundancy will reduce the packet size, making the protocol more efficient. The compression uses differential encoding and piggybacked feedback channel to compress encapsulated packets. We designed two compression profiles for UDP and TCP to eliminate the overhead of inner IP headers and more, saving around 40 bytes per IPv6 packet.


Leyi Sun and Yifan Zhuo

Deployment of Scientific Workflows in the Cloud (presentation) (report)

Our project produces a Domain Specific Language (DSL) and constructs a framework to deploy and execute scientific workflows with cloud virtual machines (VMs). The DSL syntax is simple to learn and use, and the compilation of the DSL exploits potential data parallelism opportunities for the users. We built a master-slave architecture to control and monitor the execution of the workflow and seeks to achieve load balancing. Our framework prototype can correctly and successfully solve the Word Count problem.


Xiaonan Li

Long Short-Term Memory Neural Network Based Failure Detectors (presentation) (report)

This work aims to study whether and how machine learning can be applied to problems in the field of Failure Detection (FD). Few failure detector algorithms use machine learning techniques, so I intend to discover if machine learning works well for this problem or not. The difficulty of the problem is that machine learning performs well at analyzing extremely regular data, but failures or delays may not have apparent long-term regularity. In other words, the regularity exhibited during the failure time period is relatively short-lived. In order to continuously obtain a higher accuracy rate, the algorithm must constantly train the newly emerging data. This approach presumably causes not only low accuracy due to the small amount of real-time data but also uses many computing resources. So my main goal is to design a machine learning algorithm that requires minimum resources and time during real-time training while maintaining high accuracy. My work uses Chen’s FD algorithm as the baseline and starts from applying a basic long short-term memory neural network. After getting the preliminary results, I further optimize the model’s performance by adjusting the parameters and structure. In this paper, I present a long short-term memory neural network algorithm with a better performance than Chen’s in terms of accuracy and computing time synthetically, suggesting the availability of machine learning techniques in such an area.


Tiange Wang

Prolog Proof Constructor (presentation) (report)

Prolog is a logic programming language that presents the logical relations to programmers and leaves most of the control of a program to the Prolog compiler. It is therefore interesting to understand: How is the control handled by the compiler in the search for an answer? How do we know that its answers are correct? For the first question, we track the control flow of a Python Prolog interpreter in its search for an answer and plot a search tree. For the second question, we construct proof trees and formal proofs for each step of the search in the search tree. This project also shows that the two questions are related, and a proof/proof tree is a different articulation of the search tree.


Eden Wu, Jingyi Zhu

Frequent Itemsets Mining in the Cloud (presentation) (report)

Frequent itemsets mining (FIM) aims at discovering frequently correlated items. However, it is hard to design FIM algorithms that offer both scalability and the ability to handle incremental database updates. In this paper, we propose Freno, an incremental and distributable prefix tree of frequent itemsets and the corresponding building and mining algorithms. Through empirical experiments on big datasets Retail, Kosarak, Chainstore, and RecordLink, we find out that, although Freno can not outrace FP-growth, it can mine incremental and large datasets with good performance. It could be a good basis and future FIM algorithms.

Track 5

Natural Language Processing

Susan Chen

Classifying Tweet Data Using Topic Modeling Probabilities as Features (presentation) (report)

Is it depression or something else? This paper aims to provide a framework to supplement professional examinations when it comes to diagnosing patients for depression and covid fatigue. Like many other mental and physical illnesses, covid fatigue shares very similar symptoms with depression, causing a misdiagnosis to be likely. However, there are additional mechanisms that could be used in a diagnostic exam such as patient social media data, specifically Twitter data. Twitter allows users to assign hashtags to their tweets, and using these hashtags as labels, it is possible to correctly identify the label using topic modeling probabilities as features for a classifier. This paper examines two groups of hashtags: covid fatigue and depression implements topic modeling, and used the topic modeling probabilities to classifies the tweets. The best performing classifier achieved an accuracy of 79%.


Xinyi Chen and Angela Zheng

Deep Classification of Reddit Posts (presentation) (report)

Reddit requires that users select a subreddit to upload their post to, but does not provide a recommendation system to assist with this, and there has yet to be a solution implemented that addresses this issue. Therefore, deep classification on Reddit posts is interesting and meaningful. We approached the problem through a fastText model, an LSTM model, an SVM model with a TF-IDF vectorizer, and a neural network model with a TF-IDF vectorizer. We evaluate our results using top k accuracy and confusion matrices. The NN model gives the best top 1 result, for an accuracy of 74.32%, while the best top 3 accuracy is 88.35% achieved by the SVM model. Our approach provides an insight into building a recommendation system for Reddit posts.


Bilal Munawar, Abubakar Zahid

Urdu Extractive Summarization (presentation) (report)

Huge amount of online Urdu news data is generated daily and summarized by human experts, which is tedious and limited to small volumes of text. Urdu is as a low-resource language and contains no prior deep-learning based implementation for text summarization. In this paper, we explore two extractive text summarization approaches for Urdu: 1) An unsupervised approach based on K-Means Clustering and 2) A supervised approach called RoBERTaSUM based on Transformer (Vaswani et al., [6]). Our attempt is the first deep-learning based implementation for text summarization in Urdu language. We also create a summarization dataset containing 100 Urdu news articles to fine-tune and test our transformer models. Our experiments show that our transformer based supervised approach beat the prior state-of-the art unsupervised approach.


Qiyu Long, Zhichen Wang and Aiqing Li

A Deep Classificatin of Reddit Posts (presentation) (report)

This project aims to improve the current subreddit recommendation mechanism that is based on popularity. However, when users search for certain topics on Reddit, most of them want more relevant posts instead of trending ones. We constructed a model that can learn to associate certain content with the most relevant subreddit. We used Word2Vec as the embedding method. We selected 20 subreddits as our dataset, each with 200,000 to 500,000 posts. The ultimate accuracy achieved by TextCNN is 65.1%, and the best accuracy achieved by TextRNN is 55.2%.


Frederick Morlock

Exploring the Limitations of t-SNE (presentation) (report)

Introduced in 2008, t-SNE is a dimensionality reduction algorithm that has attracted the attention of many researchers due to its remarkable performance. However, there does not exist a clear consensus in the research community as to how t-SNE achieves its impressive empirical results. One of the important questions that remain is: What are the limitations of t-SNE? In this paper, we intend to explore the limitations of this popular algorithm. In doing so, we have constructed datasets based on the challenges that other researchers have faced in their utilization of t-SNE. By exposing the weak parts of t-SNE, we are opening a door to further improvement on dimensionality reduction algorithms.


Jingxian Xu, Weiran Zhang and Yian Zhang

News Recommendation with Document Understanding (presentation) (report)

News Recommendation is one of the most popular and influential applications of modern machine learning techniques. The recently-released MIND dataset includes millions of real-world news click-through examples extracted from microsoft news that include comprehensive textual and behavior information. In this work, we attempt to improve the performance of the NRMS model by using pretrained language models and by incorporating and modelling more features in the dataset. We also propose a potential enhancement method based on named entities that can be directly applied to any existing models.


Kende Orban and Jorge Barreno

Financial Sentiment Analysis on Twitter Posts (presentation) (report)

As the race to generate abnormal returns in securities markets intensifies, many investors are looking to alternative sources of information to gain an edge. Recently, one such source has been the massive collection of social media data that have allowed careful observers to gauge the sentiment of various ticker symbols through NLP analysis of tweets. This paper aims to tackle the classification of sentiment of finance-related tweets either as “Bearish” or “Bullish”. We started off with more traditional classification-based machine learning methods such as Logistic regression and SVMs before utilizing LSTM models of varying sizes and finally graduating to a further pre-trained BERT model, yielding a classification accuracy of 80%.

Track 6

Pedagogical Tools

Andrew Liu and Zane Fadul

Web-App for the Visualization adn Calculation of Schwarz-Christoffel Conformal Mapping (presentation) (report)

While the Schwarz-Christoffel conformal mapping method has had computational solutions for almost 30 years, there is only one toolkit that is readily available. Because this toolkit is a MATLAB implementation—there is a considerable barrier for those without a MATLAB license and those with low technical experience to use. Our goal is to create a user-friendly web application to act as a supplemental education tool. This report documents the process and findings towards creating a Web App dedicated to the interactive plotting and visualization of polygons implementing Schwarz-Christoffel Mapping. The project is divided into two parallel stages: Numerical Analysis and App Development.


Yitao Zhou and Shuxuan Liu

Game-Based Popularization of Blockchain Technology (presentation) (report)

Our work focuses on popularizing concepts in blockchain technology, mainly the proof-of-work consensus algorithm and distributed ledgers. As blockchain technology is gaining popularity and widespread application, it becomes crucial to develop pedagogical tools to increase people’s interests and educate non-professionals about blockchain technology. We intend to do so through a web-based puzzle game, through which the players can experience and understand the key concepts. We also designed an evaluation method to assess the performance of our teaching tool. Our work has made some initial achievements but there are still many areas in which we can improve our game.


Chelsea Polanco and Jia Zhao

Popularizing Blockchain Technology (presentation) (report)

The problem we tried to solve is to clearly explain three key components (validation, proof-of-work, and consensus) in blockchain technology. This problem is relevant because Bitcoin, as an application of blockchain technology, has become increasingly popular in recent times and has garnered a lot of misconceptions in the general public. As students ourselves, we wanted to teach other students who do not have prior knowledge of blockchain technology, as we ran into many misunderstandings when first learning about blockchain. The most difficult issue was to explain everything clearly, without obscuring technical details, and retain the original meaning. We designed an educational tool with an online story-telling approach that used proofs-and-refutations to clearly explain three key components of blockchain technology: validation, proof-of-work, and consensus. We focused on these components to teach how Bitcoin can prevent double-spending, and we evaluated the pedagogical quality of our tool through pre- and post-questionnaires. Based on self-assessment, the majority of post-questionnaire test takers stated they could explain the key components of blockchain technology well. However, the p-value for our hypothesis testing shows that there is not statistically significant evidence that our story improves students’ understanding of the three key components of blockchain technology.