There are many tutorials out there on exploratory data analysis using Python’s popular Pandas library, but very few, if at all, use water data as examples to teach. This is the inspiration behind my post —to share some of the useful (often simple, one-liner) Pandas commands I have learnt that can give quick insights into water datasets.
The dataset used for this practice is a panel data of physio-chemical and microbiological water quality measurements of rivers in Bristol, UK from 1995 to 2016. A very rich dataset to dig into. 🙂
For all my practices, I’ll be using the Spyder programming platform which comes with the IPython console and useful library packages such as Pandas pre-installed.
Importing libraries and loading data
First, we need to import Pandas using
import pandas as pd where
pd is the conventionally accepted short form for Pandas. By defining the short form in the beginning of the program, we can access Pandas functions with ‘pd.some_command’ instead of writing ‘pandas.some_command‘.
The river water quality dataset is a csv file, thus, it is loaded in with the
read_csv() function, with the full path to access the locally stored dataset specified. In the code below, the loaded data is assigned to a variable I named
If we type
bristol, we will get a snapshot of the data as the output, but it isn’t a pretty sight. There is not much useful information to glean from here, except that the ten earliest entries contain a lot of missing values (NaN). In Spyder’s IPython console, only 4 columns are displayed by default, with the middle section truncated. To increase the number of columns displayed, to say 20, we can use
Alternatively, we can get a quick feel of the dataset’s size with the
shape command. The output indicates that there are 4,372 rows and 19 columns.
What are these 19 columns, you might wonder. Well, we can find that out with
bristol.columns. From the output below, we see that the columns contain various water quality parameters such as pH, conductivity, turbidity, BOD, COD, as well as the sampling date and location of sample.
It is often also useful to know the data type of the columns as the kind of analyses you can perform depend on whether the column has numeric or string values. To check the data type, the
dtypes command is useful.
It turns out that all columns are ‘objects’, a data type for classifying string or mixed data types in Pandas. This is rather surprising as we would naturally expect that columns such as pH, Conductivity, Dissolved Oxygen, etc. contain only numeric values. Such unexpected results are possible when the program automatically infers the data type of columns when the dataset is loaded. If we spot discrepancies, we can manually change the datatype of specific columns with the
astype() function, provided all the entries of the column are indeed of the same data type.
Cleaning the data
Upon a closer visual inspection of the data, I spotted some minor errors (e.g. turbidity misspelled as Turburdity) and some inconsistencies in data entries (e.g. a ‘<‘ character among numeric entries and strings such ‘No data’ or ‘Faulty Probe’). So, let’s tackle these issues first!
First, renaming the columns — this is achieved with the
Next, we need to access the individual daily entries that we wish to edit. For instance, I want to edit the Nitrate measurement on 26 Jan 1995 from ‘>1’ to simply ‘1’. To select individual elements based on their row and column number, we can use the
iloc function with iloc[row number, column number] specified. It is crucial to note that the index for the first element is [0,0] and not [1,1]. Much debugging later, I learnt it the hard way that this piece of knowledge is especially important when writing
for loops. We learn from the output below that the first element is a missing value (nan), and the data type is ‘float’ (interesting, I wonder why).
It would be quite tedious to check each of the 4,372 x 19 entries one by one, so I have written a small code to do that for me. In essence, the
for loop checks every column
j within every row
i to see if the entry starts with either ‘<‘ or ‘>’. If it does, we ask the program to exclude the ‘<‘ or ‘>’ from that entry.
Next, we get rid of unwanted string values in numeric columns. In the code below, I am using a quick and dirty way to pick out non-numeric values with the
isdecimal() function and changing them to NaN (not a number). In reality, this may or may not affect the analyses we perform later but for the sake of this practice, this workaround is sufficient. Finally, I create a new dataset named
bristol_new with the first row of string values with measurement units excluded.
Descriptive statistics of the data
To see the descriptive statistics of each column, we use the
describe command. If a column contains numeric values, we will get statistical measures such as the mean, median, max, min and percentiles as the output. For columns with mixed data types, the output is measures such as the count of entries, number of unique values, top occurring values and its frequency. For instance, let’s try to get descriptive stats of the column, ‘Turbidity’.
Well, I was expecting to see the mean, median, percentiles etc. since all entries are now numeric values but we are getting a more condensed description. There are 129 unique values, the top occurring turbidity value is 1.0 NTU (NaNs are automatically ignored), and there are 26 entries with 1.0 NTU.
To dive into the more specific descriptive statistics, we can try the following commands:
min(). The beautiful thing about Pandas’ aggregation functions are that NaN values are automatically ignored. From the output below, we can tell that the turbidity values range from 0 to a whopping 760 NTU! The median value is 41 NTU, while the average is higher at 61.9 NTU (probably, skewed by the outliers).
Finally, we can plot a histogram and a box plot of the Turbidity column to see how its distribution looks like. By just eye-balling the plots, we can tell that most turbidity measurements are less than around 150 NTU. Values higher than 150 NTU have a lower occurrence.
In conclusion, we looked at relatively simple Pandas functions that can give quick insights into water datasets. Please feel free to let me know your comments below. In the second part of this post series, I’ll explore answering deeper questions such as how have the water quality of Bristol’s water bodies changed over time or the relationship between certain water parameters (e.g. temperature vs. Dissolved oxygen or turbidity vs. bacterial count). In the meantime, you can also check out these references which I found immensely helpful in getting started on my Pandas journey.