ASHRAE — Great Energy Predictor III- A Machine Learning Self Case Study

28 min readDec 25, 2020

I highly recommend you to check out the Heroku web app for this study here “ASHRAE great energy predictor” and this “Github repository” for full code.

ASHRAE — Great Energy Predictor III- A Machine Learning Case Study Introduction
About ASHRAE
Problem Statement
About The Dataset
Evaluation Metric
Challenges With The Dataset
Exploratory Data Analysis
First Cut Approach
Experimentations And Observations
Data Pre-processing
Cross-Validation Set, Hyper-parameter Tuning, And Modelling
Submission Results
Deploying Streamlit App On Heroku
Future Work/ Tips To Improve The Score
EndNote
References

ASHRAE — Great Energy Predictor III- A Machine Learning Case Study Introduction

Did you know buildings breathe so that you can breathe fresh? This might sound funny but they do. Building where you are in right now, workspace or any private/ commercial building has some kind of HVAC system installed which ensures you get fresh air to breathe, thermal comfort, balanced humidity which helps you to stay healthy (prevent diseases), be in good mood (which prevents you from getting into a fight ;) ) and improve productivity (who wants to live/ work in a burning building or in a freezing building (let it go :p)? ).

So What Is “HVAC”?

HVAC is the acronym for Heating, Ventilating, and Air Conditioning. It is a system/ technology which is installed in a building to achieve acceptable indoor air quality by letting outside air pass in and indoor air pass out which ensures people living in the building breathe fresh and don’t get sick, thermal comfort to ensure the environment is not too warm (which may make people feel tired) nor too cold (which may distract the people and make them restless). A proper HVAC system reduces the power consumption (cost of living should not be too much) and serve a comfortable indoor environment to live in.

If you have visited a shopping mall/ shopping center you would have seen big pipes and a centralized air conditioning system, this is a proper example of a large HVAC system. If you have an AC or the cooler/ room heater, congrats those are HVAC systems too.

Ductless Mini-Split System (Air conditioning unit)

Heating: Heating systems are to keep us warm when the environment is too cold (in winter). It is used to increase the temperature to meet our comfort.

Ventilation: Ventilation systems are to let Outdoor air inside and Indoor air outside. It is responsible for the distribution and flow of fresh air inside the built environment.

Air-conditioning Air conditioning is a system that is used to cool down space by removing heat from the space and moving it to some outside area. Air conditioning is used to control room temperature and humidity to create more comfortable indoor conditions.

About ASHRAE

The official ASHRAE website says:

“ASHRAE, founded in 1894, is a global society advancing human well-being through sustainable technology for the built environment. The Society and its members focus on building systems, energy efficiency, indoor air quality, refrigeration and sustainability within the industry. Through research, standards writing, publishing and continuing education, ASHRAE shapes tomorrow’s built environment today. ASHRAE was formed as the American Society of Heating, Refrigerating and Air-Conditioning Engineers by the merger in 1959 of American Society of Heating and Air-Conditioning Engineers (ASHAE) founded in 1894 and The American Society of Refrigerating Engineers (ASRE) founded in 1904.”

So, In a nutshell, ASHRAE (The American Society of Heating, Refrigerating and Air-Conditioning Engineers) is an American professional association seeking to advance heating, ventilation, air conditioning, and refrigeration systems design and construction.

ASHRAE’s Mission and Vision

Mission: To serve humanity by advancing the arts and sciences of heating, ventilation, air conditioning, refrigeration, and their allied fields.

Vision: A healthy and sustainable built environment for all.

How ASHRAE Tries To Achieve The Mission And Vision Of Sustainability?

ASHRAE works in the following ways to provide a sustainable built environment:

Through research
Standards writing
Publishing
Conferences
Continuing education
Certification

Problem Statement (A Regression Problem)

The problem we are trying to solve is one of the competitions hosted by ASHRAE on the Kaggle platform.

Question : “How much does it cost to cool a skyscraper in the summer?”.
Answer: “A lot! And not just in dollars, but in environmental impact.”
Significant investments are being made to improve building efficiencies to reduce costs and emissions. The question is, are the improvements working?

To answer the above question, we are given a dataset by the ASHRAE community. We need to develop accurate models of metered building energy usage in the following areas: chilled water, electric, hot water, and steam meters. The data comes from over 1,000 buildings over a three-year timeframe.

About The Dataset

Assessing the value of energy efficiency improvements can be challenging as there’s no way to truly know how much energy a building would have used without the improvements. The best we can do is to build counterfactual models. Once a building is overhauled the new (lower) energy consumption is compared against modeled values for the original building to calculate the savings from the retrofit. More accurate models could support better market incentives and enable lower-cost financing.

The dataset includes three years of hourly meter readings from over one thousand buildings at several different sites around the world.

You can download the data from here: Data Source

Files

train.csv

building_id - Foreign key for the building metadata.
meter - The meter id code. Read as {0: electricity, 1: chilledwater, 2: steam, 3: hotwater}. Not every building has all meter types.
timestamp - When the measurement was taken
meter_reading - The target variable. Energy consumption in kWh (or equivalent). Note that this is real data with measurement error, which we expect will impose a baseline level of modeling error.

building_meta.csv

site_id - Foreign key for the weather files.
building_id - Foreign key for training.csv
primary_use - Indicator of the primary category of activities for the building based on EnergyStar property type definitions
square_feet - Gross floor area of the building
year_built - Year building was opened
floor_count - Number of floors of the building

weather_[train/test].csv

Weather data from a meteorological station as close as possible to the site.

site_id
air_temperature - Degrees Celsius
cloud_coverage - Portion of the sky covered in clouds, in oktas
dew_temperature - Degrees Celsius
precip_depth_1_hr - Millimetres
sea_level_pressure - Millibar/hectopascals
wind_direction - Compass direction (0-360)
wind_speed - Meters per second

test.csv

The submission files use row numbers for ID codes in order to save space on the file uploads. test.csv has no feature data; it exists so you can get your predictions into the correct order.

row_id - Row id for your submission file
building_id - Building id code
meter - The meter id code
timestamp - Timestamps for the test data period

sample_submission.csv

A valid sample submission.

All floats in the solution file were truncated to four decimal places; we recommend you do the same to save space on your file upload.
There are gaps in some of the meter readings for both the train and test sets. Gaps in the test set are not revealed or scored.

Evaluation Metric

As we are dealing with a regression problem here, we will use something similar to squared error. We will use a slight modification to squared error called Root Mean Squared Logarithmic Error.

Using square root and log we are scaling the values hence making the loss function a little more interpretable.

Challenges With The Dataset

Huge dataset: Train data consist of 20216100 rows (20.21 million, 2016) and test data consists of 40+ million data points (2017 and 2018 combined) which are not memory friendly at all. In such cases we must not keep unwanted data in the main memory otherwise memory would explode and hence process will be terminated.
Data leaks: Target data is already available on the internet (links are given in the Resources section).

* Note: we have not used data leaks here

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an initial and crucial phase for any analysis task. In this phase, we get the first glimpse of our data which helps us to get a better understanding of making a proper roadmap. We use mathematical, statistical and visualization methods to better understand the context, to prove hypotheses, make assumptions.

Data Loading, Analysing, And Merging

As we have 3 data files (train.csv,building_meta.csv, weather_train.csv) we will load the data as a pandas data frame, do EDA on each data frame and finally merge it all to get a single data frame for modeling.

Load the data

%%time
dir_ = '../input/ashrae-energy-prediction/'

# read the data files into memory
train_df = pd.read_csv(dir_ + 'train.csv')
weather_train_df = pd.read_csv(dir_ + 'weather_train.csv')
building_meta_df = pd.read_csv(dir_ + 'building_metadata.csv')

EDA On “train.csv” File

Getting basic information about the data:-

# lets see that feature and data type does train set contains
train_df.head()

The first five rows of the train data set

'''
print the basic information about train data
'''
key_list = ["Min date value in the train set",
            "Max date value in the train set",
            "No of unique buildings",
            "No of rows in the train set",
            "Min date value in the train set",
            "Max date value in the train set",
            "No of unique site id (location)"]

value_list = [train_df.timestamp.min(),
              train_df.timestamp.max(),
              train_df.building_id.unique().__len__(),
              train_df.__len__(),
              train_df.meter_reading.min(),
              train_df.meter_reading.max(),
              train_df.site_id.unique().__len__()]
group_list = zip(key_list, value_list)
for group in group_list:
    print(group[0], group[1])

Observation

We have details about building id (a number used to identify buildings), meter type {0: electricity, 1: chilled water, 2: steam, 3: hotwater}, timestamp (time at with reading was recorded), and the meter reading (meter reading is our target variable).
We are given the timestamp, here we are dealing with time-series data
This train dataset contains meter reading entries(4 meter types) of year 2016(366 days) for 1449 unique buildings per hour (total row should be 366 * 24 * 4 * 1449) = 50912064.
We have 20,216,100 meter reading entries. The reason why we have 20M records and not 50M is because not all the buildings have all types of meters.
The minimum meter reading is 0 and the maximum meter reading is 21904700.0 (both are probably an outlier).
There are a total of 16 different sites.
So, we also have a meter reading value “0”, which is quite unusual. Asking myself, when do you have a meter reading 0? Turns out, there could be plenty of reasons.
Power outage: This could be one of many reasons when we could get a meter reading of 0. Though I also believe it could be marked as “nan” (missing) because there is no reading to read.
Seasonal reasons: We have 4-meter types (0:electricity, 1: chilled water, 2: steam, 3: hot water), If we have 0-meter reading for any among 3 meters (excluding electricity meter, because overall electricity will be used) it might because chilled water won’t be used in winter season or hot water or steam devices are not used at all in the summer season.
The closed building, Under construction or At maintenance: This could be another factor when we have a meter reading of 0.
Error in measuring instrument(error in the meter itself): This could be another reason, Here there is a glitch or fault in the instrument itself.

Looking at the building which has a meter reading of 21904700.0 (an outlier)

# get the building id of building where meter reading is 21904700.0
train_df.building_id[train_df.meter_reading == 21904700.0]

Building id (1099) which contains meter reading value “21904700”

''''
building 1099 has only two meter type, 0 and 2
'''

for meter_id in range(0, 3, 2):
  # set the data
  bid = 1099
  d1 = go.Scatter(
      x = train_df.timestamp[(train_df.meter == meter_id) & (train_df.building_id == bid)],
      y = train_df.meter_reading[(train_df.meter == meter_id) & (train_df.building_id == bid)],
      mode= "lines"
  )
# set the appearance
  layout = dict(width = 800,
                height= 400,
                title = f'Distribution of meter {meter_id} reading for Building {bid} over time',
                xaxis= dict(title= 'Date',ticklen= 1,zeroline= False),
                yaxis= dict(title= f'meter {meter_id} reading',ticklen= 1,zeroline= False))

  fig = dict(data = d1, layout = layout)
  # plot the data
  iplot(fig)
plotly.offline.init_notebook_mode(connected=True)

Observation

For the electricity meter (meter 0), we see a normal pattern there is no suspicion.
For steam meter (meter 2), Meter reading after around march is in millions, which is quite bizarre. Here we will treat it as an outlier and perform one of two things. 1) Remove it. 2) Winsorize it (using mean or some other averaging method)

Distribution of target variable(meter reading of electric meter) at site id 0 across 20 randomly selected buildings

# sample building id
sample_bid = []
for i in range(20):
    # pick up building id randomly
    bid = train_df.building_id.sample().values[0]
    sample_bid.append(bid)
    d1 = go.Scatter(
        x = train_df.timestamp[(train_df.meter == 0) & (train_df.building_id == bid)],
        y = train_df.meter_reading[(train_df.meter == 0) & (train_df.building_id == bid)],
        mode= "lines"
    )

    layout = dict(width = 800,
                  height= 400,
                  title = 'Distribution of electricity meter reading for Building {} over time'.format(bid),
                  xaxis= dict(title= 'Date',ticklen= 1,zeroline= False),
                  yaxis= dict(title= 'meter 0 reading',ticklen= 1,zeroline= False))

    fig = dict(data = d1, layout = layout)
    iplot(fig)

* NOTE: as there were too many graphs that could take a lot of space, I have decided not to show the output/ graphs for the above code, but the “observation” section should make it clear.

Observation

For some buildings, There are 0-meter readings for the electricity meters on consecutive days (and then sudden spike).

There are cases when we have the same reading for multiple days in a row

All these seem unusual to me. We will try to use time series imputation techniques to fix these holes.

Draw heatmap (of zero meter reading count) to see the big picture

Observation

The above heatmap for all the meter types show the following pattern:
Yellow color shows a high number of zero meter reading counts
The vertical yellow line shows consecutive buildings having zero meter reading (*assuming building are close to each other (neighboring))
The horizontal yellow line shows the same building has zero meter reading for consecutive days
Buildings having non-Yellow color contains no zero meter reading

For meter 0, there are consecutive buildings that have a 0-meter reading from day 1 to day 14 (they also have the same site id (located at the same location))

Meters 1, 2, and 3 (hot water, steam, and chilled water meter), have many horizontal yellow lines (zero meter reading), Which shows either of the devices are not being used. This is normal as not many people use chilled water in the winter or hot water in the summer.
Having zero values for meter reading (especially in the case of an electricity meter) could be problematic to the model while learning. We will try to remove buildings from the time frame if the meter reading is 0 for many consecutive days (treating them as outliers)

Distribution of each type of meter reading

fig, axs = plt.subplots(ncols = 2, nrows= 2, figsize=(10, 7), facecolor=’w’, edgecolor=’k’)
# pad each plot
fig.tight_layout(pad=4.0)row = 0
col = 0
m_type = [“electricity”, “hotwater”, “steam”, “chilled water”]for i in range(4):
 if i == 2:
 row += 1
 col = 0
 sns.distplot(train_df.target[train_df.meter == i], ax= axs[row][col])
 axs[row][col].title.set_text(f’Distribution of {m_type[i]} meter’)
 col += 1

Distribution of “meter_reading” (target) by meter type

Observation

All the meter types contain a high 0-meter reading value, which is ok in all other cases except for the electricity meter (we have discussed the measures to tackle this in the above plots).

Let us see if there is any relationship between building id and site id

# regression plot to capture a relationship (if any exist) between site id and building id
sns.relplot(x= “building_id”, y= “site_id”, hue= “site_id”, data= building_meta_df, legend=”full”)

Observation There is a positive relation between site id and building id. What does it mean?

It means, Building whose id is close, are actually close to each other (building id 45 is much closer to the building is 47 than building id 100 or building id 1.). Why this is important?

So if any site went down (due to some reason, maybe power outage or some natural disaster happens), there is a high probability that meter reading of buildings located at the same site will have a similar effect (because buildings are close to each other).

It also seems building ids are assigned in a sequential manner, one after another.

You can also see most of the buildings are from site 3, and hence the majority of train data has come from site 3.

Count the unique building in a particular site

fig, ax = plt.subplots()
building_count.plot.bar(x= "site_id", y= "building_id", rot=0, ax= ax)
ax.get_legend().remove()
plt.title("Count of building site wise")

Observation

The majority of the building is from site 3 followed by site 13.

Meter type count

# count plot of meter types
sns.countplot(train_df.meter)

Median of meter reading week wise per site

'''
for each site id, compute median of meter reading across week
'''
# dictionary to hold site id and building IDs of corresponding site id
site_id_df = {}

# for each site id
for i in range(16):
    # get all the building IDs locate at ith site
    site_id_df["site_id_"+str(i)] = building_meta_df.building_id[building_meta_df.site_id == i]
    
    # get meter reading of all the buildings located at current site (site i)
    site_df = train_df[train_df.building_id.isin(site_id_df["site_id_"+str(i)].values)]
    
    # group data by building id, week and week
    building_week_groupby = site_df[["building_id", "week", "meter", "target"]].groupby(["building_id", "week", "meter"])
    # compute median of meter reading
    building_median_week = building_week_groupby.agg({"target": "median"}).reset_index()
    # update column name
    building_median_week.rename(inplace=True,columns= {"target": "median_week"})
    # create a new plot
    plt.figure()
    sns.pointplot(x= "week", y= "median_week", hue="meter", data= building_median_week).set_title("site id " + str(i))
    plt.ylabel("median of meter reading")
    plt.show()

* NOTE: As all the graphs will take a lot of space, we are just plotting a graph for few sites

Observations We are having good visualization of how meter reading behaves at different sites.

There is almost always the opposite relation between meter 1 and meter 2 & 3 (combined). When 1 increases other decrease and vice-versa. It is obvious because meter one is a chilled water meter and 2 and 3 are stem and hot water. So people usually use chilled water in the summertime and hot/steam in the wintertime.
Site id 15 buildings are missing meter reading from week 6 to week 12 for meters 0, 2, and 3. (what could be the reason?)
Site id 0 has missing data for meter 1 (chilled water) and zero meter reading for meter 0 (electricity meter). We will remove this as we don’t have any prior data regarding these meter type
In most of the sites, meter 0 (electricity) doesn’t vary much. Except for site 3, 4, 5, 8 & 12

EDA on “weather_train.csv” file

# looking at data
weather_train_df.head()
# get the count of nan value rows columnwise
weather_null_count = weather_train_df.isna().sum().reset_index()
# alter the column name
weather_null_count.rename(columns={"index": "column_name", 0: "missing_row_count"}, inplace= True)
# compute the percentage and add new percentage column
weather_null_count["missing_row_percentage"] = weather_null_count.missing_row_count/weather_train_df.shape[0]
weather_null_count.plot.bar(x= "column_name", y= "missing_row_percentage")
plt.ylabel("Missing percentage")

First five rows of weather train data frame

Observation Weather file contains weather information about a particular site at the given time.

There are total of 139773 entries and 15 features

It contains missing values. Here is the brief summary:

There are 6 columns which contains missing values (cloud_coverage dew_temperature, precip_depth_1_hr, sea_level_pressure, wind_direction, wind_speed)
cloud_coverage has around 50% of the values that are missing, followed by precip_depth_1_hr which is around 0.36% of missing values. To address this issue, We will try to drop these two columns and see if it improve the metric. In the second phase, We will try to impute it and check if we get some significant boost.
For the rest of the missing columns, we will impute it using ML models.

Distribution of air_temperature, dew_temperature, sea_level_pressure, wind_direction, and wind_speed over time

# Distribution of air temperature day wise
sns.lineplot(x= "dayofyear", y= "air_temperature", sea_level_pressure, data= weather_train_df, hue= "site_id")

# Distribution of dew temperature day wise
sns.lineplot(x= "dayofyear", y= "dew_temperature", data= weather_train_df, hue= "site_id")

# Distribution of sea level pressure day wise
sns.lineplot(x= "dayofyear", y= "sea_level_pressure", data= weather_train_df, hue= "site_id", legend= "full")

# Distribution of wind direction day wise
sns.lineplot(x= "dayofyear", y= "wind_direction", data= weather_train_df, hue= "site_id", legend= "full")

# Distribution of cloud coverage day wise
sns.lineplot(x= "dayofyear", y= "cloud_coverage", data= weather_train_df, hue= "site_id")

# Distribution of precip_depth_1_hr day wise
sns.lineplot(x= "dayofyear", y= "precip_depth_1_hr", data= weather_train_df, hue= "site_id")

Distribution of air temperature day wise

Distribution of dew temperature day wise

Distribution of sea level pressure day wise

Distribution of precip_depth_1_hr day wise

Observation

By looking at the plots, there seems to a pattern in air_temperature, dew_temperature. It increases to the middle, then starts decreasing to the end (like U shape). I feel this variable should be helpful at the time of modeling.
wind_direction, sea_level_pressure, cloud_coverage, precip_depth_1_hr, and wind_speed seem to have no pattern, They are around the same (or some random fluctuations ) across time.

EDA on “building_meta.csv”

Getting a summary of data

# get summary of the data
building_meta_df.describe()
# get the sum of null rows, columns wise
building_meta_df.isna().sum()

Basic stats about the building data frame

Observation

Building meta file contains 1449 rows (same as the number of unique buildings in the dataset) and 6 columns
The oldest building was built in 1900 and the newest was built in 2017
The minimum building area is 283 and the maximum is 875000
The minimum floor count is 1 and the maximum is 26
Build year seems to be missing for more than 50% of the building, so does floor_count. So we will drop these features and observe the metric
Also as seen in the EDA notebook, there is a relation between site_id and building_id, We have to include these 2 features for sure.

Count of value in primary use column

# count plot 
sns.countplot(building_meta_df.primary_use)
# rotate the labels by 80 degrees
plt.xticks(rotation= 80)

Observation

Most of the building in this dataset is used for educational purpose, followed by office use, Entertainment/public assembly, Public services, Lodging/residential

Distribution of building area

# distribution plot of area in square feet
sns.distplot(building_meta_df.square_feet)

Observation

The distribution is positively skewed. There are few buildings whose area is more than 200000 square feet. To make it look more normal, we will apply log transformation while modeling the data.

Capturing the relationship between meter reading and square feet area

# compute median of meter reading across buildings
df_median_meter_reading = train_df.groupby("building_id").agg({"target": "median"}).reset_index()
# bringing square_feet area to the same scale
building_meta_df["square_feet"] = np.log1p(building_meta_df["square_feet"])
# merget both the dataframe
df_merged = df_median_meter_reading.reset_index().merge(building_meta_df[["building_id", "square_feet", "primary_use"]], on="building_id", how='left')
# regplot
sns.regplot(x = "target",y= "square_feet" , data= df_merged)

Regression plot to explore the relation between square_feet and meter_reading (target)

Observation

We can clearly see the positive relation between median meter reading and square feet (in log scale). The area of the building could be helpful at the time of modeling the problem

First Cut Approach

My first approach would be to get a baseline score to compare other complex ideas and models. My first approach will be a “Just” approach where I will pre-process data just enough to be fed into a model (e.g. light GBM) and get the scoring metric. This score will help me compare my complex ideas (feature engineering) and model and will guide me in the right direction and decision making. I will be using light GBM (as used by most of the Kagglers, at least for this competition) with 2 fold cross-validation data.

Other advanced modeling ideas are listed below:

Divide and conquer: We will train a model for each meter type and site id (location) individually. We will be using light GBM with 2 fold cross-validation split and will have a total of 16 * 2 = 32 models (16 models (there is a total of 16 sites, each ML model per site) and 2 cross-validation folds). The idea behind training different ML modes for the different site is simple, Because the weather will be the same for all the buildings located at the same site, And if there is any power outage at a site, most of the buildings (if not all) located at the corresponding site will be affected. So there is a direct relationship between site and building and hence, this could be exploited by training different models for different sites.

Ensembling: Here we may use ensembling of ML algorithms like the random forest, multi-layer perceptron(ANN), Light GBM, cat boost, and XGB. We will be using max voting to get prediction (this is one idea) or try meta classifier (stacking) (this is the second idea) to see which performs better.

The final model and submission will be selected based on the leader board score (both public and private)

Experimentations And Observations

This section should be after the “submission” section but as I am just writing the good stuff (work that worked) on this blog, I feel we should first know what didn’t work. After a lot of experimentations, here is what worked and what didn’t.

Model-based Imputation (using cat boost) didn’t work nor do simple mean or median based imputation. I have also tried using group imputation based on relevant column/s like imputing using mean of per day/ month/ site_id and so on. The best result I got when I let the algorithm handle missing data by itself (we are using lightGBM here).
Adding more and complex features were making the leader board score worse. Adding time-series features like a rolling window, expanding window didn’t work as expected. Exposing the target value is probably a bad idea (using the target value as a feature).
Dividing and Training model based on meter type and site_id didn't seem to work well either. My hypothesis was, Training models for each (meter, site_id) pair would yield good results because buildings located at the site site_id would have similar behavior (imagine outage at a particular site). Training a single model on a whole dataset worked better than training models based on pair.
Predicting log(1 + meter_reading) transformation normalized by building area(square_feet) tends to make score better than the prediction meter_reading itself. Here is the quote from this discussion on Kaggle.

Near the end of the competition, we tried standardizing meter_reading by dividing by square_feet; i.e., we predicted log1p(meter_reading/square_feet). Isamu came up with the idea after reading this discussion post by Artyom Vorobyov. The models trained with the standardized target added diversity to our final ensemble and improved our score by about 0.002. If we had more time we would have liked to explore this idea further; for example, we could have tried to predict log1p(meter_reading)/square_feet or created features using the standardized targets.

I have also trained ML algorithms like catboost, XGboost including deep neural networks like CNN-LSTM, MLP. They all seem to either overfit or made the score worse.
Treating it as a Time series problem didn’t work well for me. So I had treated as

After a lot of experimentation and spending too much time on feature engineering and data cleaning I got my benchmark model as my best model(which is a decent one) which was 2 lightGBM with 2 fold set, each model was trained on 1 fold and validated on other.

The key take away here is, first get the best cross-validation set which reflects the test/ hidden set well so that you can trust your score on the cross-validation set and keep improving upon it

You can have a look at my old data pipeline here: Preprocessing And Feature Engineering On ASHRAE

Data Pre-processing

As the benchmark model was the winner, We parse the DateTime object to get the DateTime feature like day, weekend, month, etc. We added is_holiday a feature (American holidays in 2016, 17, 18) which helps us to reduce overfitting a bit. We also added a solar horizontal radiation feature given in this Kaggle discussion This feature helped me to reduce the overfitting and also helped me to get a better private score. We have used label encoder to encode text categorical feature primary_use, We have also dropped floor_count and built_year because they were missing more than 60% of values (and removing them yield better results). Lastly, we apply log transformation on the target variable normalized by square_feet (building area).

Here is the full pre-processing code:

# transform text categorical feature
df.primary_use = le.transform(df.primary_use)

# apply log1p to the area (making it more normal and dealing with extreme values)
df.square_feet = np.log1p(df.square_feet)

# add dayofyear column
df["dayofyear"] = df.timestamp.dt.dayofyear
# add day column
df["day"] = df.timestamp.dt.day
# add week column
df["weekday"] = df.timestamp.dt.weekday
# add hour column
df["hour"] = df.timestamp.dt.hour
# add month column
df["month"] = df.timestamp.dt.month
# add weekend column
df["weekend"] = df.timestamp.dt.weekday.apply(lambda x: 0 if x <5 else 1)
'''
"It is supposed to calculate the solar horizontal radiation coming into the building". Source:- https://www.kaggle.com/c/ashrae-energy-prediction/discussion/124984
'''
latitude_dict = {0 :28.5383,
                 1 :50.9097,
                 2 :33.4255,
                 3 :38.9072,
                 4 :37.8715,
                 5 :50.9097,
                 6 :40.7128,
                 7 :45.4215,
                 8 :28.5383,
                 9 :30.2672,
                 10 :40.10677,
                 11 :45.4215,
                 12 :53.3498,
                 13 :44.9375,
                 14 :38.0293,
                 15: 40.7128}

df['latitude'] = df['site_id'].map(latitude_dict)
df['solarHour'] = (df['hour']-12)*15 # to be removed
df['solarDec'] = -23.45*np.cos(np.deg2rad(360*(df['day']+10)/365)) # to be removed
df['horizsolar'] = np.cos(np.deg2rad(df['solarHour']))*np.cos(np.deg2rad(df['solarDec']))*np.cos(np.deg2rad(df['latitude'])) + np.sin(np.deg2rad(df['solarDec']))*np.sin(np.deg2rad(df['latitude']))
df['horizsolar'] = df['horizsolar'].apply(lambda x: 0 if x <0 else x)

# Holiday feature
holidays = ["2016-01-01", "2016-01-18", "2016-02-15", "2016-05-30", "2016-07-04",
            "2016-09-05", "2016-10-10", "2016-11-11", "2016-11-24", "2016-12-26",
            "2017-01-01", "2017-01-16", "2017-02-20", "2017-05-29", "2017-07-04",
            "2017-09-04", "2017-10-09", "2017-11-10", "2017-11-23", "2017-12-25",
            "2018-01-01", "2018-01-15", "2018-02-19", "2018-05-28", "2018-07-04",
            "2018-09-03", "2018-10-08", "2018-11-12", "2018-11-22", "2018-12-25",
            "2019-01-01"]
df["is_holiday"] = df.timestamp.dt.date.astype("str").isin(holidays).astype(int)

# Drop the columns which contains lots of missing values and have less or no effect on predicting the target
drop_features = ['floor_count', 'year_built']
df.drop(drop_features, axis=1, inplace=True)

That is all for the pre-processing part.

Cross-Validation Set, Hyper-parameter Tuning, And Modelling

For cross-validation split, I have split the dataset into 2 folds (without shuffling, half-half).

For the regression problem, I have used a boosting algorithm that is faster to train (as compared to the ANN) and can able to handle non-linear data (unlike linear regression which is fast but only captures linear pattern). We will train 2 lightGBM models each trained on 1 fold and cross-validated on another, So each fold will have only one chance to be treated as a cross-validation set. In this way, we are exposing the whole dataset to the model. To get the final prediction we will just take the average of values predicted by these two models.

I have grid searched manually to get the best hyper-parameter values. Here is the list of Hyper-parameter names that I had been searching:

num_leaves, feature_fraction, learning_rate,reg_lambda

The best hyperparameter found was:

{'num_leaves': 90,
 'feature_fraction': 0.85,
 'learning_rate': 0.05,
 'reg_lambda': 10}

I have saved all the models which were part of grid search (which is not very efficient, but I did it anyway to save myself from retraining on the best hyperparameters value). Here is how each model performs on the train and the validation set.

Mean train and mean validation loss for each model (grid search)

The best model seems to be 38th where train and validation loss is at a minimum. We will choose the 38th model and submit our predictions on the test set to get the leader board score.

* NOTE: we already have the 38th model saved on the disk. We have pickled it so that we can load and reuse it when needed i.e. at the time of productization of the model (see Deploying Streamlit App On Heroku section).

Find full code here: ASHRAE_Great_Energy_Predictor_III(Modeling).ipynb

Submission Results

Here is the Kaggle leader board score.

Here we get a private score of 1.322 and a public score of 1.137 which in my opinion is a decent score given the challenges in the competition like “huge dataset” and “data leaks” (we didn’t use data leaks here, could be one of the reasons why we got the above score).

Deploying Streamlit App On Heroku

In this section, we will deploy our best model on Heroku. To make the machine learning application interactive and to give users the freedom to input the data I have used easy to use “streamlit”(python library to create a web app for machine learning).

Heroku is the platform as a service (PaaS) which provides a platform to run our app quickly and easily. Unlike AWS service like “EC2” which is Infrastructure as service(IaaS) that we need to configure to provide runtime environment (or platform) for our application to run.

To deploy the python app on Heroku follow the steps:

Sign up for a Heroku account here
After getting your account done, go to the dashboard, or just follow this link.
Click on “new”-> “create new app”. Enter app name and region here and click “create app”.

4. First let us see what files we would require other than project files/ folder itself to run our streamlit app on Heroku. Here are a list of file that Heroku would need in order to run your app: * NOTE: These all files should be in the local workspace, we have not uploaded any file on the Heroku just yet.

requirement.txt: This file contains all the modules required by your python app. This is how it might look like:

seabornmatplotlibstreamlit….

* NOTE: you can be specific with the version as well like seaborn==0.11.1

To get the dependencies of your current project you can use pipreqs, a python library. It will generate a requires.txt file for your project.

Procfile: Contains the command which Heroku app will use in startup order to run your app on its platform. Here is how “procfile” might look like:

web: sh setup.sh && streamlit run Final.py

It says: it’s a web process type (enable Heroku app to receive HTTP request), run “setup.sh” file on the shell “sh” and start streamlit server “streamlit run Final.py”. To know more about “Procfile” used by Heroku read this.

setup.sh: This file is solely used for streamlit configuration. This is the reason why we first run “setup.sh” and then start streamlit server “streamlit run Final.py”. The configuration would look like this:

mkdir -p ~/.streamlit/echo “\[general]\n\email = \”youremail@domainName”\n\“ > ~/.streamlit/credentials.tomlecho “\[server]\n\headless = true\n\enableCORS=false\n\port = $PORT\n\“ > ~/.streamlit/config.toml

This script will create a “streamlit” folder in the home directory and add two subfiles: “credentials.toml” and “config.toml” which tells streamlit what port to use, whether to run headless, and if we want to allow CORS.

Just copy and paste the above code in “setup.sh” file and it should work.

Here is how your project folder would look like:

5. After creating a new app, Heroku will ask you to choose one of three deployment methods and would provide you the steps to follow in order to deploy your app.

Here we will just focus on “Heroku Git”. To use Heroku git, you need to download the Heroku CLI, You can download it here

Here are the instructions that Heroku would provide if you would like to deploy via “Deploy using Heroku Git”.

Open cmd prompt or terminal and follow the steps:

5.1) Login to Heroku by typing:

heroku login

This will open a tab on a web browser for authentication.

5.2) Locate your project root folder and type:

git init

This will create a local git repository and enable you to maintain the versions of the python app, We need to select the app and then push all the main files to the Heroku

5.3) You need to select the Heroku app where you want to upload your python application and other additional files (because you can have multiple Heroku app, you need to tell Heroku where you would like to upload the files), to do this type:

heroku git: remote — app “app_name”

“app_name” is the name of the Heroku application which you have created in step 3.

5.4) Now add and commit all the files in the local repository by typing:

git add .git commit -am “my first commit”

5.5) Finally push the whole project to the Heroku server

git push

6. After uploading all the necessary files on the Heroku server, We just need to do one last step, Which is to start a “dynos”. Dynos are the heart of Heroku, They are lightweight virtualized Linux container that runs the code (our streamlit app), You can learn more about Heroku dynos here. If you look at the “Resources” tab in your Heroku app you will see something like:

snapshot of Heroku resource tab before starting a Heroku web dynos

This means there is no Dynos (container) that is running our code (and hence no process). To start the dynos type:

heroku ps:scale web=1

This command tells Heroku to get 1"web” dynos which will run all our pushed code. To know what else dynos configurations are available read this.

After executing this command you should see under the “Resources” tab similar to:

snapshot of Heroku resource tab after starting a Heroku web dynos

So our app is running on Heroku web dynos now, you can just visit to “yourappname.herokuapp.com” and see if everything is working alright.

Check out ASHRAE great energy predictor Heroku web app to see how your streamlit app would look.

Future Work/ Tips To Improve The Score

Using Ensembling techniques: Here we create multiple regression model which are different from each other (the idea behind doing this is, each model will learn a different aspect of data and hence overall prediction will be good) ie heterogeneous in nature. This may include, training different models, like a tree-based, artificial neural network, SVM, etc, and aggregating their results.
Using data leaks: Data for this competition is publicly available, Read This kernel to know more about data leaks. Using leaks target to guide our model in the right direction can surely help us to make the score better.
Feature engineering: As the dataset is huge, engineering new features and feature selection process is tricky, difficult, and not memory friendly but it could help us to get up on the leader board.
Using automated machine learning tool: Using auto ML libs like auto-sklearn, TPOT, h20, etc may help us to get better scores or even help us in the ensembling process but one big bottleneck could be due to the size of the dataset.
Experimenting with different cross-validation split: Here, we have used 2 fold CV split, choosing splits like 4 split(for each quarter) or 8–4 months split and so on, could help to capture the patterns better which may lead to a better score.
Using a subset of data: We may experiment with taking a few million rows for training the model(s). As we have 20.21 million training rows which might also contain irrelevant(noisy) data.

EndNote

It has been an amazing experience to post participate in the “ASHRAE Great Energy Predictor III” Kaggle competition, I learned a lot on a topic like handling big data with limited resources, feature engineering, feature selection and removal of colinear features, the importance of hyperparameter tuning, time series prediction and much more in the process.

I hope you have learned something from this post, Feel free to provide any feedback you have regarding this post so that I could stay motivated, improve myself, and fix my ways.

Thank you for reading!

Oh! Also, You can connect with me on Linkedin for any questions, queries, or suggestions.