Welcome to my personal website!

Junzhi Cao

PhD candidate in New York University, Data Scientist, Cosmology, Statistics, machine learning

I'm a PhD candidate in New York Univeristy and I joined NYU in 2015. My research is focus on large scale structure of Cosmology. We study the property of dark matter and galaxies by using Bayesian statistical method. The data we are using is from the SDSS survey, and we make comparable mocks from Hydro-dynamic simulations (Eg: TNG100/300) and N-body simulations (Eg: Bolshoi-Planck simulation).

In 2020 summer, I start my internship at Ant Finance (Alibaba) as a machine leraning engineer. My job focus on anomaly detection in large time-series data set, which is as big as 10 TB per minute. Our goal is to build a data-driven model that can handle multiple diagnostic parameters, and it should run really fast since the blooming of data growth. Now I have applied a LSTM based model that fits the data pretty well. Now I have built both supervised and un-supervised model for anomaly detection in time series data. Both models can predict anomaly pretty well. The supervised model can reach AUROC=0.98 and Accuracy=95%. Also, the false alarm (type I error) is only half of the number of type 2 error, which is good since false alarm has a higher cost than fail to alarm in our system.

I also applied a un-supervised model for predicting trend in time seires data, which is based on a modified version of Transformer. It can predict the data pretty well. If you are interested, please check this link: https://www.kaggle.com/peraktong/transformer-ts.

I'm interested in data analysis using Bayesian statistics. As the rapid growth of computational power these days, more and more high-resolution simulations are avaliable, which means we can study the large scale structure in the universe with high precision.

I'm also interested in Machine learning. These days more and more physicists try to apply Machine learning in their research. And there are already some promising results. Eg: The mophorlogy of galaxy. Galaxies are hugh objects in our sky. Take our Milky way for an example. The mass of our Milky way is aroung 10^12 M_sun, where M_sun is the mass of our sun! And the Virial radius of the Milky way is hundreds of kilo-parsecs, where one parsec is equal to 3.26 light year! Fortunately, our telescopy is able to capture the images of these hugh objects in the sky. The shapes of the galaxies are very different from each other, and the number of "arm" for each galaxy is different. Our Milky way has four spiral arms, and our Solar system is within the Orion arm. The question is: There are so many galaxies in the sky and how can we count their arms one by one? And sometimes we can't tell how many arms are there since the imaging data is not good enough or the image is misleading. An effective way to solve this is by using Generative Adversarial network(GAN) to deal with them. Similar to the dentification of hand-written image from 0 to 9, the number of spiral arms can be identified by a well-trained GAN model(Although there are a lot of details inside and my statement is a simple toy model).

Recently, I just finished a project about constraining scatter in the galaxy-halo connection using clustering property at small scale. And our new goal is to look at the total luminosity of satellites for each central, and study the relation between satellite luminosity and secondary property of halo. Also, we are working on the TNG data, which is a huge hydro-dynamic simulation released recently. We want to dig out the scatter information between galaxy and halo at different redshift. At the same time, I'm thinking about how to apply deep learning technique to my current projects(or future projects). It's always better to try something new. But there is also a huge challenge: The precision of the data-driven model is not as high as physical model (of course). And we need to use these data-driven model carefully since doing academic research is different from doing face-swap in a video: It's okay to have several broken frames in a video, but we need to explain why these frames are broken if we are doing serious academic research. Knowing why it works is much valuable than knowing it works. These are the problems we need to think about.