Forecast the electricity generation in the UK
Introduction
The objective is to forecast the electricity generation in the UK using different machine learning algorithms. The data are coming from the following sources (GenerationData, TemperatureData). Data range from 23/02/2015 to 13/06/2019. The problem to solve here is a regression problem where the electricity generation, in MW, needs to be predicted. Three machine learning algorithms will be developed and compared thanks to scikit-learn: decision tree, random forest, and neural network. Google Colab had been used as framework where a Python script has been created to pre-process the data, build the model and perform the analysis. The visualisation are done with the package Plotly.
Data Mining
There is a strong negative correlation, around -0.7, between the electricity generation and the temperature. It means the lower the temperature the higher the electricity generation. A variation of the generation is observed for different day of the week (0: Monday, 1: Tuesday, ..., 6: Sunday). Looking at the graph (2,1) is appears that Tuesdays and Thursdays are the days where generation is the more important. Followed by Wednesdays, then Mondays and Fridays. Finally, Saturday and Sunday are the less demanding days. This difference between weekend and weekdays can be seen on the graph (2,2). The months also require different generation level that could be due to the temperature and/or the holidays. The summer months require less electricity than the winter months. Days off and holidays are not considered in this dataset. This information might improve the prediction as long as there is enough data for each cases. The previous day consumption cannot be used as the data quality checks remove some days, hence the continuity of the days cannot be guaranted.
Machine learning
Three machine learning algorithms are developed and will be compared. The output to predict is the electricity generation (MW) for each 30 minutes. The inputs are: sequence (30 minutes sampling rate), day (0,1,2,3,4,5,6), weekend (0,1), month (1 to 12), normal (normal temperature), low (low temperature), high (high temperature). The data are split into training and test datasets with the following ratio: 0.8, 0.2.
Decision Tree
The left hand side graph shows a comparison between the electricity generation ('Sum') real data and prediction. Ideally, the scatter plot should follow a line. The right hand side graph depicts a comparison between the real generation data and the prediction for 23/08/2018 (day selected randomly). The prediction follows very accurately the real data for this specific day. This is not necessary a good thing, in this example it is a sign of an overfitting.
Random Forest
The left hand side graph shows a comparison between the electricity generation ('Sum') real data and prediction. Ideally, the scatter plot should follow a line. The right hand side graph depicts a comparison between the real generation data and the prediction for 23/08/2018 (day selected randomly). The prediction follows pretty well the real data for this specific day. Increasing the number of tree increases the computation time as well as the prediction accuracy. From n_estimators >= 100 the accuracy does not improve significantly, hence 100 trees are selected.
Neural Network
The left hand side graph shows a comparison between the electricity generation ('Sum') real data and prediction. Ideally, the scatter plot should follow a line. The right hand side graph depicts a comparison between the real generation data and the prediction for 23/08/2018 (day selected randomly). The prediction follows the general trend of the real data for this specific day but the prediction is underfitted. Increasing the number of layer increases the prediction accuracy. Reduce the number of neurons per layers as it gets closer to the output provide better prediction.
Comparison
Comparison of metrics for each machine learning algorithms
Metrics Decision Tree Random Forest Neural Network
R square (coefficient of determination) 0.963 0.978 0.845
Mean Absolute Error 730 573 1919
Mean Squared Error 1469355 847535 6205901
Root Mean Squared Error 1212 920 2491
Neural network model has the worst performance. Decision tree and random forest present similar performance regarding the absolute error. However, random forest is slightly over-performing decision tree.
One week of completly new data are used to evaluate the performance of each model. This highlight that the models are not good enough to accurately predict the electricity generation. The error can come from the data which might not be good enough in terms of quantity and quality. Moreover, model the electricity generation depends on more information than temperature, day, and month. This additional information must be indentified and considered to improve the model accuracy. Finally, the model parameters and training (especially neural network) could be tuned better. However, despite the not optimal accuracy, the trend of two generation peaks (morning and evening) and general shape is always predicted.
Last update: 11/06/2020