Introduction to Geopandas¶
Reading a Shapefile¶
Spatial data can be read easily with geopandas using gpd.from_file()
-function:
# Import necessary modules
In [1]: import geopandas as gpd
# Set filepath (fix path relative to yours)
In [2]: fp = "/home/geo/Data/DAMSELFISH_distributions.shp"
# Read file using gpd.read_file()
In [3]: data = gpd.read_file(fp)
- Let’s see what datatype is our ‘data’ variable
In [4]: type(data)
Out[4]: geopandas.geodataframe.GeoDataFrame
Okey so from the above we can see that our data
-variable is a
GeoDataFrame. GeoDataFrame extends the functionalities of
pandas.DataFrame in a way that it is possible to use and handle
spatial data within pandas (hence the name geopandas). GeoDataFrame have
some special features and functions that are useful in GIS.
- Let’s take a look at our data and print the first 5 rows using the
head()
-function prints the first 5 rows by default
In [5]: data.head()
Out[5]:
id_no binomial origin compiler year \
0 183963.0 Stegastes leucorus 1 IUCN 2010
1 183963.0 Stegastes leucorus 1 IUCN 2010
2 183963.0 Stegastes leucorus 1 IUCN 2010
3 183793.0 Chromis intercrusma 1 IUCN 2010
4 183793.0 Chromis intercrusma 1 IUCN 2010
citation source dist_comm island \
0 International Union for Conservation of Nature... None None None
1 International Union for Conservation of Nature... None None None
2 International Union for Conservation of Nature... None None None
3 International Union for Conservation of Nature... None None None
4 International Union for Conservation of Nature... None None None
subspecies ... kingdom_na \
0 None ... ANIMALIA
1 None ... ANIMALIA
2 None ... ANIMALIA
3 None ... ANIMALIA
4 None ... ANIMALIA
phylum_nam class_name order_name family_nam genus_name \
0 CHORDATA ACTINOPTERYGII PERCIFORMES POMACENTRIDAE Stegastes
1 CHORDATA ACTINOPTERYGII PERCIFORMES POMACENTRIDAE Stegastes
2 CHORDATA ACTINOPTERYGII PERCIFORMES POMACENTRIDAE Stegastes
3 CHORDATA ACTINOPTERYGII PERCIFORMES POMACENTRIDAE Chromis
4 CHORDATA ACTINOPTERYGII PERCIFORMES POMACENTRIDAE Chromis
species_na category testCol \
0 leucorus VU 1
1 leucorus VU 1
2 leucorus VU 1
3 intercrusma LC 1
4 intercrusma LC 1
geometry
0 POLYGON ((-115.6437454219999 29.71392059300007...
1 POLYGON ((-105.589950704 21.89339825500002, -1...
2 POLYGON ((-111.159618439 19.01535626700007, -1...
3 POLYGON ((-80.86500229899997 -0.77894492099994...
4 POLYGON ((-67.33922225599997 -55.6761029239999...
[5 rows x 25 columns]
- Let’s also take a look how our data looks like on a map. If you just
want to explore your data on a map, you can use
.plot()
-function in geopandas that creates a simple map out of the data (uses matplotlib as a backend):
In [6]: data.plot();
Coordinate reference system (CRS)¶
GeoDataFrame that is read from a Shapefile contains always (well not always but should) information about the coordinate system in which the data is projected.
- We can see the current coordinate reference system from
.crs
attribute:
In [7]: data.crs
Out[7]: {'init': 'epsg:4326'}
Okey, so from this we can see that the data is something called epsg:4326. The EPSG number (“European Petroleum Survey Group”) is a code that tells about the coordinate system of the dataset. “EPSG Geodetic Parameter Dataset is a collection of definitions of coordinate reference systems and coordinate transformations which may be global, regional, national or local in application”. EPSG-number 4326 that we have here belongs to the WGS84 coordinate system (i.e. coordinates are in decimal degrees (lat, lon)). You can check easily different epsg-codes from this website.
Writing a Shapefile¶
Writing a new Shapefile is also something that is needed frequently.
- Let’s select 50 first rows of the input data and write those into a
new Shapefile by first selecting the data using index slicing and
then write the selection into a Shapefile with
gpd.to_file()
-function:
# Create a output path for the data
out = r"/home/geo/Data/DAMSELFISH_distributions_SELECTION.shp"
# Select first 50 rows
selection = data[0:50]
# Write those rows into a new Shapefile (the default output file format is Shapefile)
selection.to_file(out)
Task: Open the Shapefile now in QGIS that has been installed into our computer instance, and see how the data looks like.
Geometries in Geopandas¶
Geopandas takes advantage of Shapely’s geometric objects. Geometries are stored in a column called geometry that is a default column name for storing geometric information in geopandas.
- Let’s print the first 5 rows of the column ‘geometry’:
# It is possible to use only specific columns by specifying the column name within square brackets []
In [8]: data['geometry'].head()
Out[8]:
0 POLYGON ((-115.6437454219999 29.71392059300007...
1 POLYGON ((-105.589950704 21.89339825500002, -1...
2 POLYGON ((-111.159618439 19.01535626700007, -1...
3 POLYGON ((-80.86500229899997 -0.77894492099994...
4 POLYGON ((-67.33922225599997 -55.6761029239999...
Name: geometry, dtype: object
Since spatial data is stored as Shapely objects, it is possible to use all of the functionalities of Shapely module that we practiced earlier.
- Let’s print the areas of the first 5 polygons:
# Make a selection that contains only the first five rows
In [9]: selection = data[0:5]
- We can iterate over the selected rows using a specific
.iterrows()
-function in (geo)pandas:
In [10]: for index, row in selection.iterrows():
....: poly_area = row['geometry'].area
....: print("Polygon area at index {0} is: {1:.3f}".format(index, poly_area))
....:
Polygon area at index 0 is: 19.396
Polygon area at index 1 is: 6.146
Polygon area at index 2 is: 2.697
Polygon area at index 3 is: 87.461
Polygon area at index 4 is: 0.001
- Let’s create a new column into our GeoDataFrame where we calculate and store the areas individual polygons:
# Empty column for area
In [11]: data['area'] = None
- Let’s iterate over the rows and calculate the areas
# Iterate rows one at the time
for index, row in data.iterrows():
# Update the value in 'area' column with area information at index
data.loc[index, 'area'] = row['geometry'].area
- Let’s see the first 2 rows of our ‘area’ column
In [12]: data['area'].head(2)
Out[12]:
0 19.3963
1 6.1459
Name: area, dtype: object
- Let’s check what is the min and the max of those areas using familiar functions from our previous numpy lessions
# Maximum area
In [13]: max_area = data['area'].max()
# Minimum area
In [14]: min_area = data['area'].mean()
In [15]: print("Max area: %s\nMean area: %s" % (round(max_area, 2), round(min_area, 2)))
Max area: 1493.2
Mean area: 19.96
Creating geometries into a GeoDataFrame¶
Since geopandas takes advantage of Shapely geometric objects it is possible to create a Shapefile from a scratch by passing Shapely’s geometric objects into the GeoDataFrame. This is useful as it makes it easy to convert e.g. a text file that contains coordinates into a Shapefile.
- Let’s create an empty
GeoDataFrame
.
# Import necessary modules first
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point, Polygon
import fiona
# Create an empty geopandas GeoDataFrame
newdata = gpd.GeoDataFrame()
# Let's see what's inside
In [16]: newdata
Out[16]:
Empty GeoDataFrame
Columns: []
Index: []
The GeoDataFrame is empty since we haven’t placed any data inside.
- Let’s create a new column called
geometry
that will contain our Shapely objects:
# Create a new column called 'geometry' to the GeoDataFrame
In [17]: newdata['geometry'] = None
# Let's see what's inside
In [18]: newdata
Out[18]:
Empty GeoDataFrame
Columns: [geometry]
Index: []
Now we have a geometry column in our GeoDataFrame but we don’t have any data yet.
- Let’s create a Shapely Polygon repsenting the Helsinki Senate square that we can insert to our GeoDataFrame:
# Coordinates of the Helsinki Senate square in Decimal Degrees
In [19]: coordinates = [(24.950899, 60.169158), (24.953492, 60.169158), (24.953510, 60.170104), (24.950958, 60.169990)]
# Create a Shapely polygon from the coordinate-tuple list
In [20]: poly = Polygon(coordinates)
# Let's see what we have
In [21]: poly
Out[21]: <shapely.geometry.polygon.Polygon at 0xf05a160>
Okey, so now we have appropriate Polygon -object.
- Let’s insert the polygon into our ‘geometry’ column in our GeoDataFrame:
# Insert the polygon into 'geometry' -column at index 0
In [22]: newdata.loc[0, 'geometry'] = poly
# Let's see what we have now
In [23]: newdata
Out[23]:
geometry
0 POLYGON ((24.950899 60.169158, 24.953492 60.16...
Now we have a GeoDataFrame with Polygon that we can export to a Shapefile.
- Let’s add another column to our GeoDataFrame called
Location
with text Senaatintori.
# Add a new column and insert data
In [24]: newdata.loc[0, 'Location'] = 'Senaatintori'
# Let's check the data
In [25]: newdata
Out[25]:
geometry Location
0 POLYGON ((24.950899 60.169158, 24.953492 60.16... Senaatintori
Okey, now we have additional information that is useful to be able to recognice what the feature represents.
Before exporting the data it is useful to determine the spatial reference system for the GeoDataFrame.
As was shown earlier, GeoDataFrame has a property called .crs that shows the coordinate system of the data which is empty (None) in our case since we are creating the data from the scratch:
In [26]: print(newdata.crs)
None
- Let’s add a crs for our GeoDataFrame. A Python module called
fiona has a nice function called
from_epsg()
for passing coordinate system for the GeoDataFrame. Next we will use that and determine the projection to WGS84 (epsg code: 4326):
# Import specific function 'from_epsg' from fiona module
In [27]: from fiona.crs import from_epsg
# Set the GeoDataFrame's coordinate system to WGS84
In [28]: newdata.crs = from_epsg(4326)
# Let's see how the crs definition looks like
In [29]: newdata.crs
Out[29]: {'init': 'epsg:4326', 'no_defs': True}
- Finally, we can export the data using GeoDataFrames
.to_file()
-function. The function works similarly as numpy or pandas, but here we only need to provide the output path for the Shapefile. Easy isn’t it!:
# Determine the output path for the Shapefile
outfp = r"/home/geo/Data/Senaatintori.shp"
# Write the data into that Shapefile
newdata.to_file(out)
Now we have successfully created a Shapefile from the scratch using only Python programming. Similar approach can be used to for example to read coordinates from a text file (e.g. points) and create Shapefiles from those automatically.
Task: check the output Shapefile in QGIS and make sure that the attribute table seems correct.
Pro -tips (optional but recommended)¶
Grouping data¶
One really useful function that can be used in Pandas/Geopandas is .groupby(). This function groups data based on values on selected column(s).
- Let’s group individual fishes in
DAMSELFISH_distribution.shp
and export the species to individual Shapefiles.- Note: If your `data` -variable doesn’t contain the Damselfish data anymore, read the Shapefile again into memory using `gpd.read_file()` -function
# Group the data by column 'binomial'
In [30]: grouped = data.groupby('binomial')
# Let's see what we got
In [31]: grouped
Out[31]: <pandas.core.groupby.DataFrameGroupBy object at 0x000000000BCF4828>
groupby
-function gives us an object calledDataFrameGroupBy
which is similar to list of keys and values (in a dictionary) that we can iterate over.
# Iterate over the group object
In [32]: for key, values in grouped:
....: individual_fish = values
....:
# Let's see what is the LAST item that we iterated
In [33]: individual_fish
Out[33]:
id_no binomial origin compiler year \
27 154915.0 Teixeirichthys jordani 1 None 2012
28 154915.0 Teixeirichthys jordani 1 None 2012
29 154915.0 Teixeirichthys jordani 1 None 2012
30 154915.0 Teixeirichthys jordani 1 None 2012
31 154915.0 Teixeirichthys jordani 1 None 2012
32 154915.0 Teixeirichthys jordani 1 None 2012
33 154915.0 Teixeirichthys jordani 1 None 2012
citation source dist_comm island \
27 Red List Index (Sampled Approach), Zoological ... None None None
28 Red List Index (Sampled Approach), Zoological ... None None None
29 Red List Index (Sampled Approach), Zoological ... None None None
30 Red List Index (Sampled Approach), Zoological ... None None None
31 Red List Index (Sampled Approach), Zoological ... None None None
32 Red List Index (Sampled Approach), Zoological ... None None None
33 Red List Index (Sampled Approach), Zoological ... None None None
subspecies ... phylum_nam class_name order_name family_nam \
27 None ... CHORDATA ACTINOPTERYGII PERCIFORMES POMACENTRIDAE
28 None ... CHORDATA ACTINOPTERYGII PERCIFORMES POMACENTRIDAE
29 None ... CHORDATA ACTINOPTERYGII PERCIFORMES POMACENTRIDAE
30 None ... CHORDATA ACTINOPTERYGII PERCIFORMES POMACENTRIDAE
31 None ... CHORDATA ACTINOPTERYGII PERCIFORMES POMACENTRIDAE
32 None ... CHORDATA ACTINOPTERYGII PERCIFORMES POMACENTRIDAE
33 None ... CHORDATA ACTINOPTERYGII PERCIFORMES POMACENTRIDAE
genus_name species_na category testCol \
27 Teixeirichthys jordani LC 1
28 Teixeirichthys jordani LC 1
29 Teixeirichthys jordani LC 1
30 Teixeirichthys jordani LC 1
31 Teixeirichthys jordani LC 1
32 Teixeirichthys jordani LC 1
33 Teixeirichthys jordani LC 1
geometry area
27 POLYGON ((121.6300326400001 33.04248618400004,... 38.6712
28 POLYGON ((32.56219482400007 29.97488975500005,... 37.4457
29 POLYGON ((130.9052090560001 34.02498196400006,... 16.9395
30 POLYGON ((56.32233070000007 -3.707270205999976... 10.127
31 POLYGON ((40.64476131800006 -10.85502363999996... 7.7603
32 POLYGON ((48.11258402900006 -9.335103113999935... 3.43424
33 POLYGON ((51.75403543100003 -9.21679305899994,... 2.40862
[7 rows x 26 columns]
From here we can see that an individual_fish variable now contains all the rows that belongs to a fish called Teixeirichthys jordani
. Notice that the index numbers refer to the row numbers in the
original data -GeoDataFrame.
- Let’s check the datatype of the grouped object and what does the
key
variable contain
In [34]: type(individual_fish)
Out[34]: geopandas.geodataframe.GeoDataFrame
In [35]: print(key)