Blog

Xiaoyi Yang

Khoury College of Computer Sciences

Northeastern University

All the posts are my personal thoughts, ideas and experience.

Lego Art and How your computers learn pictures

I am a huge Lego fan.

Even though my hobby usually makes my moving extremely hard, I never stop buying and building more. On the bright side, all my friends know my passion so it becomes much easier if they want to send me a gift.

A couple of months before I left Pittsburgh, one of my friends gave me an Andy Warhol's Marilyn Monroe from Lego Art series. Yes, the one in my office now, which took me half a day to build. Frankly speaking, it is not an exciting piece comparing to my other collection but I do love the theme. The Andy Warhol museum in Pittsburgh taught me a lot about modern art and it is a good souvenir for my Pittsburgh time.

One morning when I stared at that Lego, I suddenly realized that the whole Lego Art series, those mosaic pieces, are really good examples about how the computers understand the pictures. If you know some basic programming, you will know computers only read two numbers "0" and "1". When we take a digital picture, the computer first divides it into small squares, stores it as a combination of pixels, the smallest unit of an image. The pixels store color information, like RGB , but coded in binary values. I am not sure how the human minds remember those pictures but the computers remember the information through those numbers. The idea of neural network is to feed the computer large amount of number lists so the computer may be able to recognize a similar one later.

See, the world is connected. Even a piece of toy can teach you something.

More importantly, as you can see, the data can be in various forms. Picture, video, text... All those things are informative. In Data science, we not only apply the regression and analysis the numerical data matrix, we also think about how to convert information from various source to numerical forms, which is easier to understand and analysis.

Maybe we are truly living in a Matrix!

Statistics and Panda

When I was a kid, I used to argue with my father whether the panda should belongs to the bear family or the cat family (see here for the definition of taxonomic rank in biology), since the Chinese translation of panda is more like "bear-cat". In the end, we could not persuade each other, so we turned to internet for the answer.

Guess what, it belongs to Panda family!

Similar conversation happened between me and my roommate when we were in college. She was a bio-stats major and she strongly disagreed when I tried to classify Statistics into natural science. She claimed that natural science should only cover the majors with an actual lab (not the computing one), and she believed Statistics should be more like a social science. Again, we google it on Wikipedia and Wikipedia says Statistics belong to formal science, same as Math. Great, I learned a new phrase~

People are funny. We try to classify things to help us remember and organize, but in the end we start to argue with each other about the classification itself. Wait, is this even a classification problem?

In most of the Statistics setting, it is not hard to distinguish classification and clustering. If there is a pre-defined label for the groups, it is classification. Otherwise, it is the clustering which only measures the similarities between the subjects. For most of the cases, we do have a pre-defined structure for either the living species or university majors. I am pretty sure there are some reasons behind each structures but is it the universal truth?

We can only describe the world with our current knowledge about the world. Even though the structure at this moment is reasonable for everything we have learned, what if in the next second, we find a complete new species and it is totally different from everything else? How to decide whether to stuff it into a current group or open a new group just for it? With more and more new species coming in, when the old structure may collapse and we need to build a new one?

From a history perspective, we have the examples like the transition from Geocentric model to Heliocentrism. However, from a statistics perspective, I find the problem is extremely hard. It is a question of dynamic classification and the number of groups are also changing. Converting it into a clustering problem may be easier on the algorithm (we have a lot of studies to decide the number of clusters directly from the data) but also will be hard to interpret.

I don't have the answer at this moment but I think it is an really interesting question when the data is not static. Let me know if you have any ideas.

Bayesian and Original sin

I work in a Catholics Jesuit school but I am not religious.

Trust me. I have done careful research before applying the jobs. Even though I am not religious, I want to be respected to others' choices. Never lie about yourself to get a job or a relationship.

First, I confess I probably did not learn Catholics and Bible tradition well enough so it is possible that my interpretation is not fully correct. A couple years ago, I was told it is in general, it is hard for a Chinese to believe in Catholics and one of the reasons is that some parts of it against our tradition ideas. When I was very little, we all have to read and recite some tradition books and paragraphs. Almost the very first of those is “Human nature is good at the beginning” (Well, I swear in the Chinese it is a really beautiful sentence but I am not really good at translation). If you have this idea implemented in your mind, it becomes very hard to accept the idea of original sin in the Bible later.

I probably will never be a religious but after I accepting the job offer, I started to think about how the different school of thoughts affect people. The ability of thinking distinguish us from animals and in long history of human, every moment, there are new ideas and belief jump out. So, how do those ideas affect us?

In Bayesian theory, we learn the posterior from prior and data, but the data in the end is much more important than the prior. You can have different assumptions in the beginning and we can update it with more and more evidence and information later. In the end, the assumption becomes much less important. It means that the way you assume the world is less important than your true experience when interacting with the world. You can assume people come with different purposes. As long as you keep learning about the world, eventually we will be more similar.

We all come from different background with a different prior, but in the end we can all be a reasonable posterior, aka, a good person, as long as your experience is positive.

Fighting the majority illusion

During my PhD years, I spent a lot of time on social network.

Sometimes I try to tell people this is only because my Thesis is social network related but to be honest, I just need something outside my academic life. I love my Thesis and my advisors, but sometimes I need a break. Yes, this is similar to those middle age men who do love their wives but still want to be alone for a moment either in the car, basement or bathroom.

In fact, I do make friends with a lot of great people online. Most of them are also PhD students who come from very different background and work in various fields. We still support each other in every aspects even though most of us are graduated. Let me make a claim here that even though my experience towards online social is in general positive, it is not always delightful. Please always protect your private information and yourself.

Online social network is more like high school social network, which is not mutual. More importantly, the degree distribution is in general highly right-skewed. You will have some popular kids who have much more connections than their peers while the most of the rest tend to social within a few close friends. In 1991, the sociologist Scott L. Feld observed that on average most people have fewer friends than their friends have, which is what we called "friendship paradox". In 2015, Kristina Lerman and her co-works found that the friendship paradox will lead to a "majority illusion" that people tend to think a behavior appear far more common if the behavior comes from those with more connections. For example, if your twitter follows some of your daily friends and a couple of famous vegan bloggers. You may suddenly realize that half of your follow are vegan so vegan must be the majority diet, even though it is not true. (paper here)

Believing vegan is popular is not a bad thing, but majority illusion happens more when the behaviors are dangerous and seditious. It makes people believe that those harmful behaviors, like drinking and drug are more common among their peers than the truth. So, here is the question, as a normal people, how to fight with the majority illusion when you are online?

The very first step of fighting an illusion is always recognizing the illusion, which I have already told you. Second, think when you are reading. I know this is hard, since everyday a huge amount of information will smash on us and it seems that it is easier to just accept it. I try to ask students to do some extra credit activities by thinking about whether a piece of news is solid with reasonable source, data and analysis methodology. It may takes some time to form a habit. Also, try to look at the questions at different scope. Is it still true in your daily life? in general public? Do I know someone who is totally different from what they say? Every schools will talk about how to develop the ability of critical thinking, which more importantly is to keep asking questions in your daily life. This is not a trust issue, it is the way we learn the world.

In the end, please do not blame a knife even though someone uses it in a murder. I don't recommend people to discard the online social apps or even the Internet. View the virtual life as an extension, instead of a replacement of your true life.

Time to go outside for a while!

Probability theorem and Fake news

This is my favorite statistical paradox.

Let's see this question together. Susan is a college student and she is good with math. Which of the following statements is more likely to be true?

Susan is a singer.
Susan's major is Statistics and she is a singer.

The answer is the first one. The reason is that the second statement is included in the first one, therefore, even though the second part of the second statement is highly likely, as long as the first statement is not true, the second one can not be true. In probability, this means P(A) is always greater or equal than P(A and B).

The most interesting part is that I was told this is how the fake news is working. In fact, in the reality, such paradox is much harder to detect compare to our example here. If I only tell you a ridiculous conclusion, you probably won't believe it. However, if I also tell you a bunch of details that are obviously true before I told you the conclusion, it seems that now the ridiculous conclusion becomes more reliable. However, as an "and" sentence, all the information has to be true so that the whole statement can be true.

The story tells us two things. First, if you are the people who make the statements, don't let a single negligence ruin all your work (In Chinese, we have an idiom called a mouse shit ruined a pot of porridge. Let me know if there is something similar in English). More importantly, as in the majority illusion post earlier, in the information era, sometimes it is hard to maintain rationality and logistics when you are the receiver of those information. You need to realize that not only the each piece of information is important, but how the information are connected is also decisive. At least, you should not only read the titles of a newspaper.

Learning and forget

Teaching is hard.

I have quickly learned this thing after two-month of the faculty life. Besides of the tons of the things I have prepared every week (why kids have so many homework...), sometimes it is just hard to explain clearly some concepts. Even though I know those things (Wait, maybe I don't...), it just hard to speak it out. My students are very tolerant but I know I have a lot to improve.

If you have ever learned any theories about learning, you will know in general people divide the learning process into four steps:

Unconscious Incompetence
Conscious Incompetence.
Conscious Competence.
Unconscious Competence.

However, it is always hard to complete the full cycle through one course. Let me use the statistical modeling to explain the process.

Unconscious incompetence means you have no idea what to do when receiving a data so you do a lot of things and it feels good.
Conscious incompetence means you realize that you need to fit a model to the data but when you try with a simple linear model, it does not seems to work.
Conscious Competence means you understand the assumption of linear model and you can explain why it is not working and you can even try with some other fancy model to make the performance better
Unconscious Competence means now once you receive a data, you can quickly decide what is the best modeling approach without carefully checking the assumption and you can predict which part of the data may lead to a poor fitting. You can find the most appropriate feature engineering without trying all.

The difference between good students and masters are usually between 3 and 4. One interesting that I have realized is that the grading scheme is usually based on the Conscious Competence, where students have to list their reasoning and write the analysis based on certain template. This is usually a fair approach since students can clearly see the part they have missed (if they choose to review the grading homework/project, but...).

However, unconscious Competence is the final step of the learning and that's why the capstones are introduced. When I was in the graduate school, a student told me that the capstone are so much different from the in-class projects after finishing a pretty successful capstone project. I think the reason is that in a capstone now you have a data that may be a total mess. Most of the things you have learned in a class may not be workable. However, the good thing is that now you have a whole semester to look at the same data again and again. You also will receive feedback once in one or two weeks and that process also helps to correct any mistakes.

Once I realize these difference, I decide to mimic some of the measures in my own courses. Instead of having people to complete multiple projects in a semester. I slow down the process to allow students spending more time with their data. In the introduction level courses, I break down the large project into four smaller pieces:

Find a data and do some EDA and a short presentation to the data
Fit a linear model and write a report
Fit random forest model and decision tree model to the data. Compare your models. Do a peer review.
Adjust your final report based on the peer review and do a final project.

In the advanced level, I am currently updating the strategy with

Provide a data with multiple choices on the response variable
Do a peer review and write down what you have learned from each other's work
Try the data with another response variable choice
Look at your work and decide the final submission

Both of these designs force students to spend more time to think about the same problem from different perspectives. So far the introduction level courses is a success and I am looking forward to keeping improving the learning strategy in the advanced level courses.

Maniac killer and Social network

Yes, I understand the title sounds a little scary.

The story starts with my BFF, who accidently found an online post about an interesting psychological question:

Suppose you are chased by a maniac killer and unfortunately, you run into a restroom and realize that there is no way out. There are five partitions in this restroom and let's number them from 1 to 5. Number 1 is the one closest to door and number 5 is the farthest. Question, if you are going to choose one partition to hide, which one you are going to choose and why?

After spending a whole evening reading 800+ answers online, my BFF shared this question to me. We have different opinions, just like those 800+ unique answers. Like statistical modeling, there is no perfect answer. There is even no right and wrong. When I shared this question to my collaborator in psychology, she said, you know, different choices actually reflects different personality.

We the people, not only the psychologist, love to divide people into multiple groups. Enneagram and MBTI are popular among the young generation ands rumors are some companies even use those tests during the interview. Then I start to have this idea: if we add those psychology grouping on a social network, e.g. our own daily social network, what we can learn? Can we see the pattern like why we like/dislike someone? Are people more likely to connect with the same/different type? Is there any connection combinations that are common than the others? This is the background story of my latest student project and please let me know if you are interested.

How to beat AI painting

New year is coming and this year, I decide to give my parents a gift.

Four years ago, my parents purchased a super small Teddy Bear dog (less than half size of my cat...) and soon the dog become their favorite child. Considering the fact that me and my step siblings spend years overseas, you can even consider the dog is the only child.

What I am going to do is that I want to get a painting for the dog. Two years ago, I met an artist online who did a beautiful painting for my cat Rainne based on the picture I put on my website contact page. I decide to bother him again. However, this time, I don't really have any specific idea about the painting. My initial idea is to paint the dog as a cute little girl, so I downloaded an App called Wonder, which is a popular one for AI painting. After inputting the dog's picture and words "a girl in a red dress", I did get a picture kind of fitting my expectation. Well, it is a little weird, like all the other AI paintings.

Then I bring the picture to the artist and describe what I want. The guy was writing scripts for his next comedy show (He is also a part-time comedian). After seeing the AI painting result, he actually quite admired the AI techniques. We made a joke saying in his next show, he probably can say because of the AI invasion, he has to give up the artist career and become a full-time comedian. Then he told me that he wants to beat the AI, not through painting a better-looking picture, but through a picture with in-depth stories.

In the next 24 hours, he conducted a series of interview to me about my parents and the dog. Name, what they like to do, any interesting story happened...In the end, we decide to use the travel as the theme, since my parents spent a lot of time traveling this years and they always bring the dog with them. The final painting is here.

Great picture. On the other hand, this experience also inspire me to think about the AI's role in the data science. In the past 5 years, machine learning and deep learning are so popular among statistics and CS. In the academia, every paper wants to include the word machine learning in the title even though they just use a linear model. The stronger and stronger computation resources now allow us to conduct complicate computations and in most of the cases, those complicated algorithms do lead to more accurate predictions. In the practice, we see how AI beats people in multiple areas, from chess games to understanding biology structures. If one day, we can throw the data and research question in a machine, and then in half an hour, the machine will return you the best data fitting, in this case, do we still need data scientist? What will be the role for a data scientist?

Future is hard to say but at this moment, I want to do the similar choice as my artist friend. Models and algorithms are tools. They can be really efficient tools but they are still just the tools. We, the human, can decide whether we want to use the tool, based on multiple factors, resources, interpretability, ethnics background, and social justice. By giving those considerations to the model, we can create the analysis easier to be understood and used in the practical life.

Another metaphor: When our tech leaders are fascinating about the fancy virtual metaverse, some people may just want to spend their weekend through forest camping.

Google Sites

Report abuse