webscraping, python,

Webscraping in Energy Sector

Stéphan Stéphan Follow on Github Oct 09, 2020 · 126 mins read
Webscraping in Energy Sector
Share this

In this article we will try to retrieve information using webscraping to plot some interesting graphs about energy consumption. This website provide interesting materials in energy sector: bp.com. Firstly, we're gonna install and import everthing we need.

In [5]:
url = "https://www.bp.com/en/global/corporate/energy-economics/statistical-review-of-world-energy/year-in-review.html"
In [ ]:
!pip install requests
!pip install bs4
!pip install iPython
In [ ]:
from IPython.display import IFrame
IFrame('https://www.bp.com/en/global/corporate/energy-economics/statistical-review-of-world-energy/year-in-review.html',
       width = 800, height = 450)
In [7]:
import requests
from bs4 import BeautifulSoup
In [8]:
response = requests.get(url)
html = response.content
In [9]:
type(response)
Out[9]:
requests.models.Response
In [10]:
response.status_code
Out[10]:
200

So here is our the beginning of our html page :

In [15]:
html[0:1000]
Out[15]:
b'\n    <!DOCTYPE HTML>\n    <html lang="en">\n        <head>\n            \n            <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push(\n{\'gtm.start\': new Date().getTime(),event:\'gtm.js\'}\n);var f=d.getElementsByTagName(s)[0],\nj=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\n\'https://www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);\n})(window,document,\'script\',\'dataLayer\',\'GTM-WJFXK46\');</script>\n            \n            \n    <meta charset="utf-8"/>\n    <meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n    <meta http-equiv="content-type" content="text/html; charset=UTF-8"/>\n    <meta name="keywords" content="Statistical Review of World Energy,Advancing the energy transition,Energy economics,Energy industry,Power generation,Spencer Dale"/>\n    <meta name="description" content="Growth in energy markets slowed in 2019 in line with weaker economic growth and a partial unwinding of some of the one-off factors that boosted energy demand in 2018'

We will use BeautifulSoup to make things quicker.

In [21]:
soup = BeautifulSoup(html, 'html.parser')
type(soup)
Out[21]:
bs4.BeautifulSoup

The parser transform our html in something more readable

In [ ]:
soup # to see the result

Then we use CSS selector to find what we want, here we want to extract the array : Fuel shares of primary energy and contributions to growth in 2019.

In [22]:
element = soup.select('td')
In [23]:
soup.select("div.field-items table tr td")
Out[23]:
[]
In [24]:
type(element)
Out[24]:
list
In [25]:
texte_data = [elements.get_text() for elements in element]
In [26]:
texte_data = texte_data[0:35]
texte_data
Out[26]:
['\xa0Oil',
 '\xa0193.0',
 '\xa01.6',
 '\xa033.1%\xa0',
 '\xa0-0.2%',
 '\xa0Gas',
 '\xa0141.5',
 '\xa02.8',
 '\xa024.2%',
 '\xa00.2%',
 '\xa0Coal',
 '\xa0157.9',
 '\xa0-0.9',
 '\xa027.0%',
 '\xa0-0.5%',
 '\xa0Renewables*',
 '\xa029.0',
 '\xa03.2',
 '\xa05.0%',
 '\xa00.5%',
 '\xa0Hydro',
 '\xa037.6',
 '\xa00.3',
 '\xa06.4%',
 '\xa00.0%',
 '\xa0Nuclear',
 '\xa024.9',
 '\xa00.8',
 '\xa04.3%',
 '\xa00.1%',
 '\xa0Total',
 '\xa0583.9',
 '\xa07.7',
 '\xa0',
 '\xa0']
In [27]:
EnergySource, ConsumptionExajoules, AnnualChangeExajoules, ShareOfPrimaryEnergy, PercentagePointChangeShare2018 = texte_data[::5], texte_data[1::5], texte_data[2::5], texte_data[3::5], texte_data[4::5]
In [28]:
EnergySource
ConsumptionExajoules
Out[28]:
['\xa0193.0',
 '\xa0141.5',
 '\xa0157.9',
 '\xa029.0',
 '\xa037.6',
 '\xa024.9',
 '\xa0583.9']
In [29]:
import pandas as pd
In [30]:
df = pd.DataFrame({"Energy Source" : EnergySource, "Consumption Exajoules" : ConsumptionExajoules, "AnnualChangeExajoules" : AnnualChangeExajoules, "ShareOfPrimaryEnergy" : ShareOfPrimaryEnergy, "PercentagePointChangeShare2018" : PercentagePointChangeShare2018})
df
Out[30]:
Energy Source Consumption Exajoules AnnualChangeExajoules ShareOfPrimaryEnergy PercentagePointChangeShare2018
0 Oil 193.0 1.6 33.1% -0.2%
1 Gas 141.5 2.8 24.2% 0.2%
2 Coal 157.9 -0.9 27.0% -0.5%
3 Renewables* 29.0 3.2 5.0% 0.5%
4 Hydro 37.6 0.3 6.4% 0.0%
5 Nuclear 24.9 0.8 4.3% 0.1%
6 Total 583.9 7.7
In [31]:
df.PercentagePointChangeShare2018[6]="0"
df.PercentagePointChangeShare2018
df.ShareOfPrimaryEnergy[6]="0"
df.ShareOfPrimaryEnergy[0]
Out[31]:
'\xa033.1%\xa0'
In [32]:
df.ShareOfPrimaryEnergy = df.ShareOfPrimaryEnergy.str.replace("\xa0", "").str.replace("%", "").str.replace(" ", "").astype(float)
In [465]:
df
Out[465]:
Energy Source Consumption Exajoules AnnualChangeExajoules ShareOfPrimaryEnergy PercentagePointChangeShare2018
0 Oil 193.0 1.6 33.1 -0.2%
1 Gas 141.5 2.8 24.2 0.2%
2 Coal 157.9 -0.9 27.0 -0.5%
3 Renewables* 29.0 3.2 5.0 0.5%
4 Hydro 37.6 0.3 6.4 0.0%
5 Nuclear 24.9 0.8 4.3 0.1%
6 Total 583.9 7.7 0.0 0
In [466]:
df.dtypes
Out[466]:
Energy Source                      object
Consumption Exajoules              object
AnnualChangeExajoules              object
ShareOfPrimaryEnergy              float64
PercentagePointChangeShare2018     object
dtype: object
In [467]:
df["Consumption Exajoules"]=df["Consumption Exajoules"].astype(float)
In [468]:
df.AnnualChangeExajoules = df.AnnualChangeExajoules.astype(float)
In [469]:
df.PercentagePointChangeShare2018 = df.PercentagePointChangeShare2018.str.replace("%", "").astype(float)
In [470]:
df.PercentagePointChangeShare2018
Out[470]:
0   -0.2
1    0.2
2   -0.5
3    0.5
4    0.0
5    0.1
6    0.0
Name: PercentagePointChangeShare2018, dtype: float64
In [471]:
df["Energy Source"]=df["Energy Source"].str.replace("*", "").str.replace("\xa0", "").str.replace(" ", "").astype(str)
In [472]:
df["Energy Source"]=df["Energy Source"].astype(str)
In [39]:
df["Energy Source"]
Out[39]:
0             Oil
1             Gas
2            Coal
3     Renewables*
4           Hydro
5         Nuclear
6           Total
Name: Energy Source, dtype: object
In [44]:
ax = df.plot(kind='bar',x='Energy Source', figsize=(15,3))
ax.set_title("Energy repartition")
ax.set_ylabel("Exajoule")
Out[44]:
Text(0, 0.5, 'Exajoule')
In [475]:
df.to_csv('Consumption.csv')
In [476]:
df
Out[476]:
Energy Source Consumption Exajoules AnnualChangeExajoules ShareOfPrimaryEnergy PercentagePointChangeShare2018
0 Oil 193.0 1.6 33.1 -0.2
1 Gas 141.5 2.8 24.2 0.2
2 Coal 157.9 -0.9 27.0 -0.5
3 Renewables 29.0 3.2 5.0 0.5
4 Hydro 37.6 0.3 6.4 0.0
5 Nuclear 24.9 0.8 4.3 0.1
6 Total 583.9 7.7 0.0 0.0
In [477]:
df[0:6].pivot_table(index='Energy Source', values='Consumption Exajoules')
Out[477]:
Consumption Exajoules
Energy Source
Coal 157.9
Gas 141.5
Hydro 37.6
Nuclear 24.9
Oil 193.0
Renewables 29.0
In [478]:
df[0:6].pivot_table(index='Energy Source', values='Consumption Exajoules').plot(kind='pie', x="Energy Source", subplots=True, stacked=True)
Out[478]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x0000023603EC70B8>],
      dtype=object)
In [516]:
df[0:6].pivot_table(index='Energy Source', values='Consumption Exajoules').plot(kind='bar')
Out[516]:
<matplotlib.axes._subplots.AxesSubplot at 0x2360871ac50>
In [480]:
df[["Energy Source","Consumption Exajoules"]][0:6]
Out[480]:
Energy Source Consumption Exajoules
0 Oil 193.0
1 Gas 141.5
2 Coal 157.9
3 Renewables 29.0
4 Hydro 37.6
5 Nuclear 24.9
In [481]:
df
Out[481]:
Energy Source Consumption Exajoules AnnualChangeExajoules ShareOfPrimaryEnergy PercentagePointChangeShare2018
0 Oil 193.0 1.6 33.1 -0.2
1 Gas 141.5 2.8 24.2 0.2
2 Coal 157.9 -0.9 27.0 -0.5
3 Renewables 29.0 3.2 5.0 0.5
4 Hydro 37.6 0.3 6.4 0.0
5 Nuclear 24.9 0.8 4.3 0.1
6 Total 583.9 7.7 0.0 0.0
In [46]:
! jupyter nbconvert --to html "Energy_Webscraping.ipynb"
[NbConvertApp] Converting notebook Energy_Webscraping.ipynb to html
[NbConvertApp] Writing 365849 bytes to Energy_Webscraping.html
Join Newsletter
Get the latest news right in your inbox. I never spam!
Stéphan
Written by Stéphan
Computer science student in Paris.