Introduction to Construction Activity Monitoring

Video recording is becoming ubiquitous in many construction sites for project monitoring purposes. The need for continuous benchmarking and improving the amount of time that equipment and craft workers spend on actual construction from one side, and the rate at which site videos are being generated, have accelerated the demand for machine understanding to enable better activity analysis capabilities.  An automated method for activity analysis allows project management to spend less time on assessing the workface, rather spending their time on the more important task of identifying opportunities for productivity improvement. However, automating activity analysis using site video streams is not a trivial task. First, building a computer vision based method that can segment a jobsite video sequence for each detected resource (e.g., equipment, worker) into temporal parts that contain single activities, and separately classifying the activities is technically difficult. Second, recognizing an activity within each video segment is still challenging due to the high degree of intra-class variability in resources, occlusions, scene clutter, and difficulties in defining visually-distinct activities (See Figure below). Third obstacle is the lack of adequate benchmarking datasets and validation metrics to evaluate vision-based activity analysis algorithms. 

To overcome these limitations, research is focusing on developing new automated methods for activity analysis of dynamic construction resources in highly varying long-sequences of videos obtained from fixed cameras. These methods should be capable of analyzing and assigning activity labels at the level of individual video frames in a reasonable time to justifies their practical significance. 

An end-to-end computer vision based solution for construction activity analysis using video cameras typically involves three main steps: 1) detecting construction resources --equipment and workers-- from videos; 2) tracking their location in 2D and/or 3D; and 3) recognizing the time-series of their activities. Over the past few years many efforts have focused on the task of detecting and tracking construction resources in 2D frames and/or in 3D. While very promising results have been reported on the task of detection and tracking, there has been less attention on activity recognition which is more important for construction productivity improvements. As a step toward addressing the problem of identifying sequences of construction activities from a video, several recent methods have focused on inferring construction activities using location information. Using prior knowledge about activity locations on the jobsite, and/or by combining accelerometers, these method infer the state of the resource activities (e.g. idle vs. non-idle). Still distinguishing between two activities that many have the same location, for example ``Load Bucket'' versus ``Swing Bucket Loaded'' could be challenging.

As part of a computer vision based monitoring solution, there is a need for methods that can leverage video sequences captured from construction sites and 'classify' atomic construction activities. By recognizing atomic construction activities --e.g. ``Load Bucket'', ``Swing Bucket Loaded'', ``Dump Bucket'', ``Swing Bucket Empty'', ``Moving'', and ``Idling'' for an excavator-- we mean classifying the activities of a single resource (e.g. an equipment) from video sequences wherein each activity is self-contained within a video. In other words, the video starts with one activity of a single resource and ends with the same activity (See Figure 2). 

Scope of this machine problem

In this machine problem, we will work on developing supervised machine learning based methods for construction activity analysis from computer vision features captured from video streams that contain one equipment performing a single activity. For training our model, we have collected labeled data from actual construction operations. Our input consists of video sequences recorded using consumer-level cameras, from which we extract visual features that can be fed into learning/inference algorithms. For recording purposes, we assume the camera is setup on a tripod in an approximate distance of 50 to 250 meters so that the resources are within the field-of-view of the camera.