Geometric operations¶

Overlay analysis¶

The aim here is to make an overlay analysis where we select only specific polygon cells from the data based on the borders of municipality of Helsinki.

Let’s first read the data.

import geopandas as gpd
import matplotlib.pyplot as plt

# File paths
border_fp = "/home/geo/data/Helsinki_borders.shp"
grid_fp = "/home/geo/data/TravelTimes_to_5975375_RailwayStation.shp"

# Read files
grid = gpd.read_file(grid_fp)
hel = gpd.read_file(border_fp)

Let’s check that the coordinate systems match.

In [1]: hel.crs
Out[1]: {'init': 'epsg:3067'}

In [2]: grid.crs
Out[2]: {'init': 'epsg:3067'}

They do.

Let’s see how our datasets look like, let’s use the Helsinki municipality layer as our basemap and plot the other layer on top of that.

In [3]: basemap = hel.plot()

In [4]: grid.plot(ax=basemap, linewidth=0.02);

# Use tight layout
In [5]: plt.tight_layout()

Let’s do an overlay analysis and select polygons from grid that intersect with our Helsinki layer.

Notice! This can be a slow procedure with less powerful computer. There is a way to overcome this issue by doing the analysis in batches which is explained below, see the performance-tip.

In [6]: result = gpd.overlay(grid, hel, how='intersection')

Let’s plot our data and see what we have.

In [7]: result.plot(color="b")
Out[7]: <matplotlib.axes._subplots.AxesSubplot at 0x12466e10>

# Use tight layout
In [8]: plt.tight_layout()

_images/helsinki_grid_borders_intersect.png

Cool! Now as a result we have only those grid cells included that intersect with the Helsinki borders and the grid cells are clipped based on the boundary.

Whatabout the data attributes? Let’s see what we have.

In [9]: result.head()
Out[9]: 
   car_m_d  car_m_t  car_r_d  car_r_t  from_id  pt_m_d  pt_m_t  pt_m_tt  \
  15981       36    15988       41  6002702   14698      65       73   
  16190       34    16197       39  6002701   14661      64       73   
  15727       33    15733       37  6001132   14256      59       69   
  15975       33    15982       37  6001131   14512      62       73   
  16136       35    16143       40  6001138   14730      65       73   

   pt_r_d  pt_r_t  pt_r_tt    to_id  walk_d  walk_t    GML_ID   NAMEFIN  \
 14698      61       72  5975375   14456     207  27517366  Helsinki   
 14661      60       72  5975375   14419     206  27517366  Helsinki   
 14256      55       62  5975375   14014     200  27517366  Helsinki   
 14512      58       70  5975375   14270     204  27517366  Helsinki   
 14730      61       72  5975375   14212     203  27517366  Helsinki   

       NAMESWE NATCODE                                           geometry  
Helsingfors     091  POLYGON ((391000.0001349226 6667750.00004299, ...  
Helsingfors     091  POLYGON ((390750.0001349644 6668000.000042951,...  
Helsingfors     091  POLYGON ((391000.0001349143 6668000.000042943,...  
Helsingfors     091  POLYGON ((390750.0001349644 6668000.000042951,...  
Helsingfors     091  POLYGON ((392500.0001346234 6668000.000042901,...  

Nice! Now we have attributes from both layers included.

Let’s see the length of the GeoDataFrame.

In [10]: len(result)
Out[10]: 3836

And the original data.

In [11]: len(grid)
Out[11]: 13231

Let’s save our result grid as a GeoJSON file that is another commonly used file format nowadays for storing spatial data.

resultfp = "/home/geo/data/TravelTimes_to_5975375_RailwayStation_Helsinki.geojson"

# Use GeoJSON driver
result.to_file(resultfp, driver="GeoJSON")

There are many more examples for different types of overlay analysis in Geopandas documentation where you can go and learn more.

Hint

Overlay analysis such as ours with around 13 000 polygons is quite CPU intensive task which can be quite slow to execute. Luckily, doing such analysis in batches improves the performance dramatically (spatial lookups are much quicker that way). The code snippet below shows how to do it with batch size of 10 rows which takes only around 1.5 minutes to run the analysis in our computer instance.

import geopandas as gpd
import numpy as np

# File paths
border_fp = "/home/geo/data/Helsinki_borders.shp"
grid_fp = "/home/geo/data/TravelTimes_to_5975375_RailwayStation.shp"

# Read files
grid = gpd.read_file(grid_fp)
hel = gpd.read_file(border_fp)

# Batch size
b = 10

# Number of iterations (round up with np.ceil) and convert to integer
row_cnt = len(grid)
iterations = int(np.ceil(row_cnt / b))

# Final result
final = gpd.GeoDataFrame()

# Set the start and end index according the batch size
start_idx = 0
end_idx = start_idx + b

for iteration in range(iterations):
    print("Iteration: %s/%s" % (iteration, iterations))

    # Make an overlay analysis using a subset of the rows
    result = gpd.overlay(grid[start_idx:end_idx], hel, how='intersection')

    # Append the overlay result to final GeoDataFrame
    final = final.append(result)

    # Update indices
    start_idx += b
    end_idx = start_idx + b

# Save the output as GeoJSON
outfp = "/home/geo/data/overlay_analysis_speedtest.geojson"
final.to_file(outfp, driver="GeoJSON")

Aggregating data¶

Aggregating data can also be useful sometimes. What we mean by aggregation is that we basically merge Geometries into together by some common identifier. Suppose we are interested in studying continents, but we only have country-level data like the country dataset. By aggregation we would convert this into a continent-level dataset.

Let’s aggregate our travel time data by car travel times, i.e. the grid cells that have the same travel time to Railway Station will be merged together.

In [12]: result_aggregated = result.dissolve(by="car_r_t")

# What do we have now
In [13]: result_aggregated.head()
Out[13]: 
                                                  geometry  car_m_d  car_m_t  \
car_r_t                                                                        
-1       (POLYGON ((388000.0001354737 6669000.000042855...       -1       -1   
 0       POLYGON ((386000.0001357812 6672000.000042388,...        0        0   
 7       POLYGON ((386250.0001357396 6671750.000042424,...     1059        7   
 8       (POLYGON ((386250.0001357467 6671500.000042468...     1207        8   
 9       (POLYGON ((387000.0001355996 6671500.000042449...     1768        8   

         car_r_d  from_id  pt_m_d  pt_m_t  pt_m_tt  pt_r_d  pt_r_t  pt_r_tt  \
car_r_t                                                                       
-1            -1  5996387      -1      -1       -1      -1      -1       -1   
 0             0  5975375       0       0        0       0       0        0   
 7          1059  5977007     447       6        6     447       6        6   
 8          1207  5978638    1026       9       11    1026       9       11   
 9          1768  5980269    1274      11       15    1274      11       15   

           to_id  walk_d  walk_t    GML_ID   NAMEFIN      NAMESWE NATCODE  
car_r_t                                                                    
-1            -1      -1      -1  27517366  Helsinki  Helsingfors     091  
 0       5975375       0       0  27517366  Helsinki  Helsingfors     091  
 7       5975375     447       6  27517366  Helsinki  Helsingfors     091  
 8       5975375     774      11  27517366  Helsinki  Helsingfors     091  
 9       5975375    1210      17  27517366  Helsinki  Helsingfors     091  

Let’s compare the number of cells in the layers before and after the aggregation.

# Number of rows before the aggregation
In [14]: len(result)
Out[14]: 3836

# Number of rows after the aggregation
In [15]: len(result_aggregated)
Out[15]: 51

Indeed the number of rows in our data has decreased and the Polygons were merged together.

Simplifying geometries¶

Docs will be added here soon.