Using mongodb to store UK deprivation data

Calum Macdonald
6 min readOct 27, 2020

Storing data in a clean efficient way is key to any analysis, no matter what the data is. It’s time that certain organisations and more people moved away from keeping data in horrible excel files.

In this article I will show you how to get, wrangle and store deprivation data obtained from gov.uk in a NoSQL database — mongodb , teaching you something basic about this data and modern databases!

You’ll need some experience with using the terminal and python to be able to follow this article — don’t be afraid to learn!

https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/833959/IoD2019_Infographic.pdf

They don’t make it easy…

When looking for UK data from a variety of government sources (ONS, geoportal, …) , it’s often only downloadable in horrible .xlsx and .ods formats. This requires a lot of manual manipulation to process the files to even begin to analyse them in python ,javascript etc.

At least in the case of the 2019 Index of Multiple Deprivation (IMD) Data (England Only), the data is packaged in a cleaner .csv format.

In this csv file, a row is given for each Lower Super Output Area (LSOA), scoring deprivation levels. These areas vary in size (Area [Km²] ), in order to contain a total population of about 1500–2000 people, on average, across the UK.

If we want to make some maps of say the IMD score for LSOAs in Sunderland, we could do this in a variety of ways with our favourite tools and modules inpython, R, javascript etc.

It’s pretty straight forward to load up the csv and connect it with geographical boundaries for the LSOAs (e.g. a geojson).

However, if you are working on multiple projects, or sharing data, it can be a pain to keep track of the files you’ve downloaded. Often you duplicate or have to rewrite a lot of the code in different programming languages as well.

Most people use SQL based databases, I personally prefer NoSQL databases, such as mongodb which I find a lot more intuitive when handling smaller databases. Especially, if like me, you have an object-orientated programming background and don’t like SQL (though mongodb is a document-orientated database).

Installing mongodb

To get started, you need to install mongodb , which will also setup a bunch of command line tools.

For a personal machine, you can use:

There are many instructions out there if you want to install the database on a Virtual Private Server (VPS), such as:

Running mongo

When you first try to run the mongo client (which comes with the installation of mongodb) on the command line, you’ll probably see a message telling you that there was an error connecting to mongodb://127.0.0.1:27017 , which is the default IP(localhost) and port it tries to listen to.

You need to start mongodb running first, there are many docs on how to do this, but (on macOS) you can configure the file /usr/local/etc/mongod.conf to select a different IP and change where the database is actually being stored.

To check things are working, we can run the following on the command line:

$ mongod --config /usr/local/etc/mongod.conf &
$ mongo
MongoDB shell version v4.2.3connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodbImplicit session: session { "id" : UUID("4e9e8e4f-77a6-4920-99d2-57dedc86d33b") }MongoDB server version: 4.2.3
> show dbs
admin 0.000GB
config 0.000GB
local 0.000GB

Adding a csv to mongodb

We can download and add the IMD csv file to a new database called MyDataBase in a new collection called LSOA_IMD , with a very simple command:

$ mongoimport \
-d MyDataBase \
-c LSOA_IMD\
--type csv --file /path/to/file.csv\
--headerline

Using the command line client interface (there is also a GUI if you prefer) we can check this has been loaded, then find and print a single object in the database :

$ mongo> show dbs
MyDataBase 0.039GB
admin 0.000GB
config 0.000GB
local 0.000GB
> use MyDataBase
switched to db MyDataBase
> show collections
LSOA_IMD
> db.LSOA_IMD.findOne()
{
"_id" : ObjectId("5f980d97a6598f6b4b064584"),
"LSOA code (2011)" : "E01000002",
"LSOA name (2011)" : "City of London 001B",
"Local Authority District code (2019)" : "E09000001",
"Local Authority District name (2019)" : "City of London",
"Index of Multiple Deprivation (IMD) Score" : 5.143,
...
}

There is extensive documentation learning more about these commands

Using PyMongo

I often use the python modulepymongo to manipulate and load data from mongodb seamlessly.

First you can connect to the database. If you are running this on a VPS or a shared network, you will likely want to configure users with authentications to avoid someone stealing and/or deleting your data.

import pymongo
ip = '127.0.0.1'
port = '27017'
db_name = 'MyDataBase'
client = pymongo.MongoClient('mongodb://%s:%s/%s' % (ip,port,db_name))
client.server_info()

So see all available collections within the database, you can use:

>> db = client[db_name]
>> db.list_collection_names()
['LSOA_IMD']

We can run a very similar command as we did with the mongo command line client, to show us a single entry in the collection. Once you learn the basics on mongo, you can use these commands in multiple languages. In the following example we also choose to only grab two “fields” (columns) from the database collection.

>> db['LSOA_IMD'].find_one({},\
{
'LSOA name (2011)':True,\
'Index of Multiple Deprivation (IMD) Score':True\
})
{'_id': ObjectId('5f980d97a6598f6b4b064584'),
'LSOA name (2011)': 'City of London 001B',
'Index of Multiple Deprivation (IMD) Score': 5.143}

If we want to grab a LSOA associated with the Local Authority District (LAD) of Sunderland, we can do something like:

>>> db['LSOA_IMD'].find_one(\
{\
'Local Authority District name\
2019)':'Sunderland'\
},\
{\
'LSOA name (2011)':True,\
'Index of Multiple Deprivation (IMD)\
Score':True\
})
{'_id': ObjectId('5f980d97a6598f6b4b066698'),
'LSOA name (2011)': 'Sunderland 008A',
'Index of Multiple Deprivation (IMD) Score': 38.378}

The following code will get retrieve all LSOAs associated with the LAD of Sunderland and put them into a pandas dataframe.

Obviously this could have also been performed using pandas.read_csv on the original csv file, the point however, is that now the data can be easily retrieved and saved from any location.

import pandas
df_imd = pandas.DataFrame(list(db['LSOA_IMD'].find({'Local Authority District name (2019)':'Sunderland'},\
{'_id':False,'LSOA code (2011)':True,'LSOA name (2011)':True,'Index of Multiple Deprivation (IMD) Score':True})))
df_imd

Adding to the DB with PyMongo

The beauty of the simplicity of pymongo means that we can easily add more data to the database in the form of these so-called collections.

As an example, this dataset from the ONS which comes in the form of an excel spreadsheet xlsx can be formatted using the python package xlrd and pandas, then inserted into the database with pymongo

First we have to manipulate the horrible spreadsheet we downloaded, before converting and inserting the data into a pandas dataframe:

The dataframe is uploaded into the database in a new collection called LSOA_pop_density_2019 with two additional lines:

#convert the dataframe into a dictionary
data = df_pop_density_2019.reset_index().to_dict(orient='records')
#insert into the database as a new collection
db['LSOA_pop_density_2019'].insert_many(data)

Going back to the mongo client, we can check this collection is stored:

> show collections
LSOA_IMD
LSOA_pop_density_2019
> db.LSOA_pop_density_2019.findOne(){
"_id" : ObjectId("5f981830a9d643007dcceb3f"),
"LSOA Code" : "E01011949",
"LSOA Name" : "Hartlepool 009A",
"Mid-2019 population" : 1954,
"Area Sq Km" : 0.5189,
"People per Sq Km" : 3765.658122952399
}

To remove the collection from the database while using the mongo client, it’s as simple as:

> db.LSOA_pop_density_2019.drop()

In python, with pymongo this would be:

>> db['LSOA_pop_density_2019'].drop()

Stay tuned for this blog series where we show you:

  • How to perform more complex mongodb queries in python and JavaScript (Node.js).
  • How to produce “choropleth maps” from government deprivation data and LSOA map boundaries.
  • How to apply and use some basic Machine Learning techniques with this data.

--

--

Calum Macdonald

Particle Physicist // Data Scientist // Data Analyst