In this article we will try to retrieve information using webscraping to plot some interesting graphs about energy consumption. This website provide interesting materials in energy sector: bp.com. Firstly, we're gonna install and import everthing we need.
In [5]:
url = "https://www.bp.com/en/global/corporate/energy-economics/statistical-review-of-world-energy/year-in-review.html"
In [ ]:
!pip install requests
!pip install bs4
!pip install iPython
In [ ]:
from IPython.display import IFrame
IFrame('https://www.bp.com/en/global/corporate/energy-economics/statistical-review-of-world-energy/year-in-review.html',
width = 800, height = 450)
In [7]:
import requests
from bs4 import BeautifulSoup
In [8]:
response = requests.get(url)
html = response.content
In [9]:
type(response)
Out[9]:
In [10]:
response.status_code
Out[10]:
So here is our the beginning of our html page :
In [15]:
html[0:1000]
Out[15]:
We will use BeautifulSoup to make things quicker.
In [21]:
soup = BeautifulSoup(html, 'html.parser')
type(soup)
Out[21]:
The parser transform our html in something more readable
In [ ]:
soup # to see the result
Then we use CSS selector to find what we want, here we want to extract the array : Fuel shares of primary energy and contributions to growth in 2019.
In [22]:
element = soup.select('td')
In [23]:
soup.select("div.field-items table tr td")
Out[23]:
In [24]:
type(element)
Out[24]:
In [25]:
texte_data = [elements.get_text() for elements in element]
In [26]:
texte_data = texte_data[0:35]
texte_data
Out[26]:
In [27]:
EnergySource, ConsumptionExajoules, AnnualChangeExajoules, ShareOfPrimaryEnergy, PercentagePointChangeShare2018 = texte_data[::5], texte_data[1::5], texte_data[2::5], texte_data[3::5], texte_data[4::5]
In [28]:
EnergySource
ConsumptionExajoules
Out[28]:
In [29]:
import pandas as pd
In [30]:
df = pd.DataFrame({"Energy Source" : EnergySource, "Consumption Exajoules" : ConsumptionExajoules, "AnnualChangeExajoules" : AnnualChangeExajoules, "ShareOfPrimaryEnergy" : ShareOfPrimaryEnergy, "PercentagePointChangeShare2018" : PercentagePointChangeShare2018})
df
Out[30]:
In [31]:
df.PercentagePointChangeShare2018[6]="0"
df.PercentagePointChangeShare2018
df.ShareOfPrimaryEnergy[6]="0"
df.ShareOfPrimaryEnergy[0]
Out[31]:
In [32]:
df.ShareOfPrimaryEnergy = df.ShareOfPrimaryEnergy.str.replace("\xa0", "").str.replace("%", "").str.replace(" ", "").astype(float)
In [465]:
df
Out[465]:
In [466]:
df.dtypes
Out[466]:
In [467]:
df["Consumption Exajoules"]=df["Consumption Exajoules"].astype(float)
In [468]:
df.AnnualChangeExajoules = df.AnnualChangeExajoules.astype(float)
In [469]:
df.PercentagePointChangeShare2018 = df.PercentagePointChangeShare2018.str.replace("%", "").astype(float)
In [470]:
df.PercentagePointChangeShare2018
Out[470]:
In [471]:
df["Energy Source"]=df["Energy Source"].str.replace("*", "").str.replace("\xa0", "").str.replace(" ", "").astype(str)
In [472]:
df["Energy Source"]=df["Energy Source"].astype(str)
In [39]:
df["Energy Source"]
Out[39]:
In [44]:
ax = df.plot(kind='bar',x='Energy Source', figsize=(15,3))
ax.set_title("Energy repartition")
ax.set_ylabel("Exajoule")
Out[44]:
In [475]:
df.to_csv('Consumption.csv')
In [476]:
df
Out[476]:
In [477]:
df[0:6].pivot_table(index='Energy Source', values='Consumption Exajoules')
Out[477]:
In [478]:
df[0:6].pivot_table(index='Energy Source', values='Consumption Exajoules').plot(kind='pie', x="Energy Source", subplots=True, stacked=True)
Out[478]:
In [516]:
df[0:6].pivot_table(index='Energy Source', values='Consumption Exajoules').plot(kind='bar')
Out[516]:
In [480]:
df[["Energy Source","Consumption Exajoules"]][0:6]
Out[480]:
In [481]:
df
Out[481]:
In [46]:
! jupyter nbconvert --to html "Energy_Webscraping.ipynb"