Building Better Cities

Utilizing Computer Vision to perform Urban Analysis

Nikhil Yarram, Roshan Verma, Srinath Srinivasan

CS 639 Final Project: Fall 2021


Goal of the project

In this research project, we will use computer vision to train a machine learning model with video footage from the VisDrone dataset. Our goal is to be able to perform multi object tracking, crowd counting and segmentation on a given scene environment.

With the rapidly increasing population in urban cities, it's becoming progressively difficult to quantify the effects of urban development plans. Madison, for example, was one of the fastest growing cities in Wisconsin, with a 16% population increase over the past decade. Computer vision enables an automated process for the evaluation of expansion schemes. Moreover, it allows for large scale estimation of the viability and functionality proposed changes.

The insights that drone footage can provide can be invaluable when run through the proper computer vision mechanisms for gaining insights.


As the world becomes increasibly urban, with more than 2/3 of the global population expected to live in cities in the next 15-30 years, it will become increasingly important for elected officials and urban planners to leverage technology to Build Better Cities.

To solve this problem, our group will be using the VisDrone Dataset, available here: https://github.com/VisDrone/VisDrone-Dataset. The dataset contains

  • 288 video clips, formed by 261,908 frames, and 10,209 static images, all captured by drones

  • The data was taken from 14 cities, both urban and country in China, and includes objects and density


This data serves as the perfect source to solve the issue of Urban analysis, as it gives an accurate depiction of cities and the life inside of it. We will be able to analyze usage patterns of things like streets, and give an understanding of how many people pass through a certain area and in what means of transportation (walking or biking). We will also do image segmentation, which will give a visual understanding of how an environment is broken down. By doing this over a time interval, we will be able to get a temporal Urban analysis, which can be important for crafting city policies and in the construction of new ones.



Multi-Object Tracking

Predicting trajectories of various objects (humans, cars, etc) upon definition of their bounding boxes.

Crowd Counting

The ability to identify the number of people (at this time) in a given frame.


Terrain Segmentation

Given an overhead view of any land footage, perform pixel segmentation to classify each pixel into a class based on its and provide a colored output based on the classification.

Check out our GitHub repo: https://github.com/roshanverma2001/CS639-Final-Project

The code and implementations for each of our tasks are in the individual branches in the repo.



Project Presentation link:

https://docs.google.com/presentation/d/1hIT3Q8vcFLk5sb9SPPlhR4jhJujhwC4LKCwSqEMALbE/edit?usp=sharing