Human Occupancy Prediction

Train Set / Validation Set / Test Set Split

In this section, we are going to talk about how we use the data from plug-level and smart meters power consumption to predict whether there is any person in the house and analyze the prediction results we get from using different classification models and data sets.

In our project, we split our data set into train set, validation set and test set. The training set and validation sets are used during training, once training is finished, we run against our test set and verify that the accuracy is sufficient. Specifically, we split data set so that 80% would be training and validation set, 20% would be test set. For training and validation set, we do 10 fold cross-validation and observe statistics including accuracy scores across 10 folds and importance of features. For test set, we apply our classifier on it and show f1-score, confusion matrix, accuracy score and features importance if available.

Features Selection

We make human occupancy prediction tests based on plug-level and smart meters power consumption data respectively.

  • Using plug-level power consumption to predict human occupancy

Plug-level power consumption data is available at Eco_DataSet/house#02/02_plugs_csv. For each type of 12 appliances, we extract from data the min, max, range, mean, standard variation, correlation, time statistics within time window of 15 min, construct 12*7 features and use them as training features for prediction.

  • Using smart meters power consumption to predict human occupancy

Smart meter power consumption data is available at Eco_DataSet/house#02/02_sm_csv with total power, phase 1 power, phase 2 power and phase 3 power information. For each type within 15 min time window, apart from min, max, range, mean, standard variation, correlation statistics, we also include onoff feature. Onoff events occur when an appliance is switched on or off. If the difference between a sample and its predecessor is bigger than a threshold ThA and this difference remains higher than ThA for at least ThT seconds, an on/off event is detected. We set ThA = 30W and ThT = 30 s. We also calculate sum of difference between different phases of power within time window of 15 min, and denote as sad12, sad13, sad23. In conclusion, 31 features( 4*(min+max+range+mean+std+cor+onoff)+sad12+sad13+sad23 ) are used for occupancy prediction.

  • Feature analysis

The features we are using to implement the human occupancy prediction are min, max, range, mean, standard variation, correlation, time, onoff. Using these features we aim to capture both the absolute value and the variability of the electricity consumption.

1. Absolute value of the power consumption

The min, max and mean features are used to capture the absolute value of power consumption at each time slot. A high value of power consumption may provide an indicator of human occupancy.

The sad (sum of absolute difference) shows the sum of difference between different phases.

2. Variability of power consumption

Significant power changes are often the result of human actions. We use std, range, cor, and onoff features to indicate the variability.

3. Dependency of occupancy

Time. The occupancy heavily depends on the time of the day, for example, the house is much more possible to occupied in the evening than in the afternoon.

Classification Model Selection

  • K nearest neighbors: A simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions).
  • Support vector machine: A method of constructing a hyper plane or set of hyper planes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks.
  • Multilayer perceptron (MLP): A feed forward artificial neural network model that maps sets of input data onto a set of appropriate outputs.
  • Random forests: An ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes of the individual trees.


Prediction Results and Evaluations

Using plug-level power consumption to predict human occupancy

Cross-validation results on train/validation set:

To make sure that the classifier is not overfitting the data, we divide the data into training, validation and test sets. We have more than 12000 sets of data. Firstly we use 80% data to train the model and verify the model using the 10-cross validation. Then we test the model with the remaining 20% of data, which is the test set. Test set provides an unbiased test of the performance of the classifier on previously unseen data to make sure the classifier is workable.

The figure below shows the 10-cross validation results of using MLP, KNN, SVM and Random Forest. We can see that the Random Forest gives the best performance of prediction, which has 88% average accuracy and 0.0004 variation, and the performance of remaining 3 classifiers are nearly the same, which have around 73% average accuracy.

  • MLP:
    • Mean: 0.738735340387
    • Variance: 1.24969464681e-06
  • KNN:
    • Mean: 0.705162124561
    • Variance: 0.00219734525713
  • SVM:
    • Mean: 0.737848260261
    • Variance: 1.26067878686e-05
  • Random Forest:
    • Mean: 0.874245309898
    • Variance: 0.00130434557418

Prediction results on test set:



  • MLP:
    • F1 score: 0.719834710744
    • Accuracy score: 71.98%
    • Confusion matrix: Shown right



  • KNN:
    • F1 score: 0.810992708918
    • Accuracy score: 72.15%
    • Confusion matrix: Shown right



  • SVM:
    • F1 score: 0.722314049587
    • Accuracy score: 72.23%
    • Confusion matrix: Shown right



  • Random Forest:
    • Three most significant features: time_7 (11.98%), time_12 (8.19%), mean_9 (7.08%)
    • F1 score: 0.909601347558
    • Accuracy score: 86.69%
    • Confusion matrix: Shown right

Using smart meters power consumption to predict human occupancy

Cross-validation results on train/validation set:

We implement the same training and testing process to predict human occupancy using smart meters data. Similarly, Random Forest gives the best performance among the 4 picked classifiers, which has around 89% average accuracy and 0.0006 variation. The remaining 3 classifiers give around 74% average accuracy.

  • MLP:
    • Mean: 0.741041875995
    • Variance: 9.39250785765e-08
  • KNN:
    • Mean: 73.40%
    • Variance: 0.045%
  • SVM:
    • Mean: 0.741980136562
    • Variance: 3.58119200187e-06
  • Random Forest:
    • Mean: 88.90%
    • Variance: 0.06%

Prediction results on test set:



  • MLP:
    • F1 score: 0.7175
    • Accuracy score: 71.75%
    • Confusion matrix: Shown right



  • KNN:
    • F1 score: 0.825301204819
    • Accuracy score: 73.42%.
    • Confusion matrix: Shown right



  • SVM:
    • F1 score: 0.72125
    • Accuracy score: 72.13%
    • Confusion matrix: Shown right



  • Random Forest:
    • Three most significant features: time_2 (21.38%), max_2 (15.58%) and time_1 (8.94%).
    • F1 score: 0.928509905254
    • Accuracy score: 89.63%
    • Confusion matrix: Shown right

Generalization of Random Forest Model on house#1 ~ house#5

In this section, we use our Random Forest Model trained with house#2 to test with house#1 ~ house#5.

We trained our Random Forest model classifier, which gives the best prediction accuracy using house#2 smart meter data. In order to verify this classifier can be used for general houses, we apply this classifier from house#1 to house#5 and the results are shown below.

From the cross validation accuracy scores, F1 scores and the confusion matrix we can see that the classifier also works pretty good for house#01, house#05 but not good for house#03 and house#04.



  • House#01
    • Cross-validation accuracy score: 0.870008564124(mean) +- 0.00577392205639 (var)
    • F1 score: 0.828316953317
    • Accuracy: 72.14%
    • Confusion matrix: Shown right



  • House#02 (The results are as above)
    • Cross-validation accuracy score: 88.90%(mean) +- 0.06%(var)
    • F1 score: 0.928509905254
    • Accuracy: 89.63%
    • Confusion matrix: Shown right



  • House#03
    • Cross-validation accuracy score: 0.833767051183(mean) +- 0.00159029371636(var)
    • F1 score: 0.756574004508
    • Accuracy: 65%
    • Confusion matrix: Shown right



  • House#04
    • Cross-validation accuracy score: 0.92770465322(mean) +- 0.000128703396717 (var)
    • F1 score: 0.502538071066
    • Accuracy: 37%
    • Confusion matrix: Shown right



  • House#05
    • Cross-validation accuracy score: 0.878901464357 (mean) +- 0.000645106957656 (var)
    • F1 score: 0.932693795451
    • Accuracy: 87.3%
    • Confusion matrix: Shown right
  • Comparison

We predict the human occupancy by using power consumption data from smart meter and plugs separately. The prediction accuracy are almost the same, which makes us come to a conclusion that we can use smart meter data to predict human occupancy and don't have to get the detailed data from each appliance or plug directly.

Also, the highest prediction accuracy using our classifiers is close to 90%, which is high enough to help people to estimate if there is any person at home in order to turn on or off some appliances for human convenience or energy saving.