Spatial join¶
Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.
The previous materials focused on learning how to perform a Point in
Polygon
query.
We could now apply those techniques and create our own function to
perform a spatial join between two layers based on their spatial
relationship. We could for example join the attributes of a polygon
layer into a point layer where each point would get the attributes of a
polygon that contains
the point.
Luckily, spatial join is already implemented in
Geopandas, thus
we do not need to create it ourselves. There are three possible types of
join that can be applied in spatial join that are determined with op
-parameter in the gpd.sjoin()
-function:
"intersects"
"within"
"contains"
Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, thus you should know how they work.
Let’s perform a spatial join between these two layers:
- Addresses: the address-point Shapefile that we created and reprojected previously
- Population grid: a Polygon layer that is a 250m x 250m grid
showing the amount of people living in the Helsinki Region.
- The population grid a dataset is produced by the Helsinki Region Environmental Services Authority (HSY) (see this page to access data from different years).
- For this lesson we will use the population grid for year 2015, which can be dowloaded from Helsinki Region Infroshare (HRI) open data portal
Clean the data¶
- Let’s read the data into memory and see what we have.
In [2]:
import geopandas as gpd
# Filepath
fp = "L4_data/Vaestotietoruudukko_2015.shp"
# Read the data
pop = gpd.read_file(fp)
# See the first rows
pop.head()
Out[2]:
INDEX | ASUKKAITA | ASVALJYYS | IKA0_9 | IKA10_19 | IKA20_29 | IKA30_39 | IKA40_49 | IKA50_59 | IKA60_69 | IKA70_79 | IKA_YLI80 | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 688 | 8 | 31.0 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | POLYGON ((25472499.99532626 6689749.005069185,... |
1 | 703 | 6 | 42.0 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | POLYGON ((25472499.99532626 6685998.998064222,... |
2 | 710 | 8 | 44.0 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | POLYGON ((25472499.99532626 6684249.004130407,... |
3 | 711 | 7 | 64.0 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | POLYGON ((25472499.99532626 6683999.004997005,... |
4 | 715 | 19 | 23.0 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | 99 | POLYGON ((25472499.99532626 6682998.998461431,... |
Okey so we have multiple columns in the dataset but the most important
one here is the column ASUKKAITA
(population in Finnish) that
tells the amount of inhabitants living under that polygon.
- Let’s change the name of that columns into
pop15
so that it is more intuitive. Changing column names is easy in Pandas / Geopandas using a function calledrename()
where we pass a dictionary to a parametercolumns={'oldname': 'newname'}
.
In [3]:
# Change the name of a column
pop = pop.rename(columns={'ASUKKAITA': 'pop15'})
# See the column names and confirm that we now have a column called 'pop15'
pop.columns
Out[3]:
Index(['INDEX', 'pop15', 'ASVALJYYS', 'IKA0_9', 'IKA10_19', 'IKA20_29',
'IKA30_39', 'IKA40_49', 'IKA50_59', 'IKA60_69', 'IKA70_79', 'IKA_YLI80',
'geometry'],
dtype='object')
- Let’s also get rid of all unnecessary columns by selecting only
columns that we need i.e.
pop15
andgeometry
In [4]:
# Columns that will be sected
selected_cols = ['pop15', 'geometry']
# Select those columns
pop = pop[selected_cols]
# Let's see the last 2 rows
pop.head()
Out[4]:
pop15 | geometry | |
---|---|---|
0 | 8 | POLYGON ((25472499.99532626 6689749.005069185,... |
1 | 6 | POLYGON ((25472499.99532626 6685998.998064222,... |
2 | 8 | POLYGON ((25472499.99532626 6684249.004130407,... |
3 | 7 | POLYGON ((25472499.99532626 6683999.004997005,... |
4 | 19 | POLYGON ((25472499.99532626 6682998.998461431,... |
Now we have cleaned the data and have only those columns that we need for our analysis.
Join the layers¶
Now we are ready to perform the spatial join between the two layers that
we have. The aim here is to get information about how many people live
in a polygon that contains an individual address-point . Thus, we want
to join attributes from the population layer we just modified into the
addresses point layer addresses_epsg3879.shp
.
- Read the addresses layer into memory
In [5]:
# Addresses filpath
addr_fp = "L4_data/addresses.shp"
# Read data
addresses = gpd.read_file(addr_fp)
# Check the head of the file
addresses.head()
Out[5]:
address | id | geometry | |
---|---|---|---|
0 | Kampinkuja 1, 00100 Helsinki, Finland | 1001 | POINT (24.9301701 60.1683731) |
1 | Kaivokatu 8, 00101 Helsinki, Finland | 1002 | POINT (24.9418933 60.1698665) |
2 | Hermanstads strandsväg 1, 00580 Helsingfors, F... | 1003 | POINT (24.9774004 60.18735880000001) |
3 | Itäväylä, 00900 Helsinki, Finland | 1004 | POINT (25.0919641 60.21448089999999) |
4 | Tyynenmerenkatu 9, 00220 Helsinki, Finland | 1005 | POINT (24.9214846 60.1565781) |
In order to do a spatial join, the layers need to be in the same projection
In [6]:
# Do they match? - We can test that
addresses.crs == pop.crs
Out[6]:
False
Re-project addresses to the projection of the population layer:
In [7]:
addresses = addresses.to_crs(pop.crs)
- Let’s make sure that the coordinate reference system of the layers are identical
In [8]:
# Check the crs of address points
print(addresses.crs)
# Check the crs of population layer
print(pop.crs)
# Do they match now?
addresses.crs == pop.crs
{'proj': 'tmerc', 'lat_0': 0, 'lon_0': 25, 'k': 1, 'x_0': 25500000, 'y_0': 0, 'ellps': 'GRS80', 'units': 'm', 'no_defs': True}
{'proj': 'tmerc', 'lat_0': 0, 'lon_0': 25, 'k': 1, 'x_0': 25500000, 'y_0': 0, 'ellps': 'GRS80', 'units': 'm', 'no_defs': True}
Out[8]:
True
Indeed they are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.
- Let’s now join the attributes from
pop
GeoDataFrame intoaddresses
GeoDataFrame by usinggpd.sjoin()
-function
In [9]:
# Make a spatial join
join = gpd.sjoin(addresses, pop, how="inner", op="within")
# Let's check the result
join.head()
Out[9]:
address | id | geometry | index_right | pop15 | |
---|---|---|---|---|---|
0 | Kampinkuja 1, 00100 Helsinki, Finland | 1001 | POINT (25496123.30852197 6672833.941567578) | 3326 | 173 |
1 | Kaivokatu 8, 00101 Helsinki, Finland | 1002 | POINT (25496774.28242895 6672999.698581985) | 3449 | 31 |
10 | Rautatientori 1, 00100 Helsinki, Finland | 1011 | POINT (25496808.64582102 6673146.836896984) | 3449 | 31 |
3 | Itäväylä, 00900 Helsinki, Finland | 1004 | POINT (25505098.34340289 6677972.568484426) | 5112 | 353 |
4 | Tyynenmerenkatu 9, 00220 Helsinki, Finland | 1005 | POINT (25495639.56049686 6671520.343245601) | 3259 | 1397 |
Awesome! Now we have performed a successful spatial join where we got
two new columns into our join
GeoDataFrame, i.e. index_right
that tells the index of the matching polygon in the pop
layer and
pop15
which is the population in the cell where the address-point is
located.
- Let’s save this layer into a new Shapefile
In [10]:
# Output path
outfp = "L4_data/addresses_pop15_epsg3979.shp"
# Save to disk
join.to_file(outfp)
Do the results make sense? Let’s evaluate this a bit by plotting the points where color intensity indicates the population numbers.
- Plot the points and use the
pop15
column to indicate the color.cmap
-parameter tells to use a sequential colormap for the values,markersize
adjusts the size of a point,scheme
parameter can be used to adjust the classification method based onpysal <http://pysal.readthedocs.io/en/latest/library/esda/mapclassify.html>
_, andlegend
tells that we want to have a legend.
In [12]:
%matplotlib inline
import matplotlib.pyplot as plt
# Plot the points with population info
join.plot(column='pop15', cmap="Reds", markersize=7, scheme='quantiles', legend=True);
# Add title
plt.title("Amount of inhabitants living close the the point");
# Remove white space around the figure
plt.tight_layout()
By knowing approximately how population is distributed in Helsinki, it seems that the results do make sense as the points with highest population are located in the south where the city center of Helsinki is.