Water data with Pandas – Part 1

There are many tutorials out there on exploratory data analysis using Python’s popular Pandas library, but very few, if at all, use water data as examples to teach. This is the inspiration behind my post —to share some of the useful (often simple, one-liner) Pandas commands I have learnt that can give quick insights into water datasets.

The dataset used for this practice is a panel data of  physio-chemical and microbiological water quality measurements of rivers in Bristol, UK from 1995 to 2016. A very rich dataset to dig into. 🙂

For all my practices, I’ll be using the Spyder programming platform which comes with the IPython console and useful library packages such as Pandas pre-installed.

Importing libraries and loading data

First, we need to import Pandas using import pandas as pd where pd is the conventionally accepted short form for Pandas. By defining the short form in the beginning of the program, we can access Pandas functions with ‘pd.some_command’ instead of writing ‘pandas.some_command‘.

The river water quality dataset is a csv file, thus, it is loaded in with the read_csv() function, with the full path to access the locally stored dataset specified. In the code below, the loaded data is assigned to a variable I named bristol.

If we type bristol, we will get a snapshot of the data as the output, but it isn’t a pretty sight. There is not much useful information to glean from here, except that the ten earliest entries contain a lot of missing values (NaN). In Spyder’s IPython console, only 4 columns are displayed by default, with the middle section truncated.  To increase the number of columns displayed, to say 20, we can use pd.set_option('display.max_columns',20)

snippet1.JPG

Alternatively, we can get a quick feel of the dataset’s size with the shape command. The output indicates that there are 4,372 rows and 19 columns.

pandas shape

What are these 19 columns, you might wonder. Well, we can find that out with bristol.columns. From the output below, we see that the columns contain various water quality parameters such as pH, conductivity, turbidity, BOD, COD, as well as the sampling date and location of sample.

pandas columns

It is often also useful to know the data type of the columns as the kind of analyses you can perform depend on whether the column has numeric or string values. To check the data type, the dtypes command is useful.

dtypes

It turns out that all columns are ‘objects’, a data type for classifying string or mixed data types in Pandas. This is rather surprising as we would naturally expect that columns such as pH, Conductivity, Dissolved Oxygen, etc. contain only numeric values. Such unexpected results are possible when the program automatically infers the  data type of columns when the dataset is loaded. If we spot discrepancies,  we can manually change the datatype of specific columns with the astype() function, provided all the entries of the column are indeed of the same data type.

Cleaning the data

Upon a closer visual inspection of the data, I spotted some minor errors (e.g. turbidity misspelled as Turburdity) and some inconsistencies in data entries (e.g. a ‘<‘ character among numeric entries and strings such ‘No data’ or ‘Faulty Probe’). So, let’s tackle these issues first!

incorrect entries_bristol data

inconsistent data entry_bristol data

First, renaming the columns — this is achieved with the rename function.

pandas rename

Next, we need to access the individual daily entries that we wish to edit. For instance, I want to edit the Nitrate measurement on 26 Jan 1995 from ‘>1’ to simply ‘1’. To select individual elements based on their row and column number,  we can use the iloc function with iloc[row number, column number] specified. It is crucial to note that the index for the first element is [0,0] and not [1,1]. Much debugging later, I learnt it the hard way that this piece of knowledge is especially important when writing for loops. We learn from the output below that the first element is a missing value (nan), and the data type is ‘float’ (interesting, I wonder why).

pandas iloc.JPG

It would be quite tedious to check each of the 4,372 x 19 entries one by one, so I have written a small code to do that for me. In essence, the for loop checks every column j within every row i to see if the entry starts with either ‘<‘ or ‘>’. If it does, we ask the program to exclude the ‘<‘ or ‘>’ from that entry.

Next, we get rid of unwanted string values in numeric columns. In the code below, I am using a quick and dirty way to pick out non-numeric values with the isdecimal() function and changing them to NaN (not a number). In reality, this may or may not affect the analyses we perform later but for the sake of this practice, this workaround is sufficient. Finally, I create a new dataset named bristol_new with the first row of string values with measurement units excluded.

Descriptive statistics of the data

To see the descriptive statistics of each column, we use the describe command. If a column contains numeric values, we will get statistical measures such as the mean, median, max, min and percentiles as the output. For columns with mixed data types, the output is measures such as the count of entries, number of unique values, top occurring values and its frequency. For instance, let’s try to get descriptive stats of the column, ‘Turbidity’.

pandas describe

Well, I was expecting to see the mean, median, percentiles etc. since all entries are now numeric values but we are getting a more condensed description. There are 129 unique values, the top occurring turbidity value is 1.0 NTU (NaNs are automatically ignored), and there are 26 entries with 1.0 NTU.

To dive into the more specific descriptive statistics, we can try the following commands: ismean(), ismedian(), nunique(), max() and  min(). The beautiful thing about Pandas’ aggregation functions are that NaN values are automatically ignored. From the output below, we can tell that the turbidity values range from 0 to a whopping 760 NTU! The median value is 41 NTU, while the average is higher at 61.9 NTU (probably, skewed by the outliers).

pandas descriptive stats

Finally, we can plot a histogram and a box plot of the Turbidity column to see how its distribution looks like. By just eye-balling the plots, we can tell that most turbidity measurements are less than around 150 NTU. Values higher than 150 NTU have a lower occurrence.

turbidity_histogramturbidity_boxplot

Conclusion

In conclusion, we looked at relatively simple Pandas functions that can give quick insights into water datasets. Please feel free to let me know your comments below. In the second part of this post series, I’ll explore answering deeper questions such as how have the water quality of Bristol’s water bodies changed over time or the relationship between certain water parameters (e.g. temperature vs. Dissolved oxygen or turbidity vs. bacterial count). In the meantime, you can also check out these references which I found immensely helpful in getting started on my Pandas journey.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: