Geocoding

Overview of Geocoders

Geocoding, i.e. converting addresses into coordinates or vice versa, is a really common GIS task. Luckily, in Python there are nice libraries that makes the geocoding really easy. One of the libraries that can do the geocoding for us is geopy that makes it easy to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.

As said, Geopy uses third-party geocoders - i.e. services that does the geocoding - to locate the addresses and it works with multiple different service providers such as:

Thus, there are plenty of geocoders where to choose from! However, for most of these services you might need to request so called API access-keys from the service provider to be able to use the service.

Luckily, Nominatim, which is a geocoder based on OpenStreetMap data does not require a API key to use their service if it is used for small scale geocoding jobs as the service is rate-limited to 1 request per second (3600 / hour). As we are only making a small set of queries, we can do the geocoding by using Nominatim.

Note

  • Note 1: If you need to do larger scale geocoding jobs, use and request an API key to some of the geocoders listed above.
  • Note 2: There are also other Python modules in addition to geopy that can do geocoding such as Geocoder.

Hint

You can get your access keys to e.g. Google Geocoding API from Google APIs console by creating a Project and enabling a that API from Library. Read a short introduction about using Google API Console from here.

Geocoding in Geopandas

It is possible to do geocoding in Geopandas using its integrated functionalities of geopy. Geopandas has a function called geocode() that can geocode a list of addresses (strings) and return a GeoDataFrame containing the resulting point objects in geometry column. Nice, isn’t it! Let’s try this out.

Download a text file called addresses.txt that contains few addresses around Helsinki Region. The first rows of the data looks like following:

id;addr
1000;Itämerenkatu 14, 00101 Helsinki, Finland
1001;Kampinkuja 1, 00100 Helsinki, Finland
1002;Kaivokatu 8, 00101 Helsinki, Finland
1003;Hermannin rantatie 1, 00580 Helsinki, Finland

We have an id for each row and an address on column addr.

  • Let’s first read the data into a Pandas DataFrame using read_csv() -function:
# Import necessary modules
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

# Filepath
fp = r"addresses.txt"

# Read the data
data = pd.read_csv(fp, sep=';')
# Let's take a look of the data
In [1]: data.head()
Out[1]: 
     id                                           addr
0  1000       Itämerenkatu 14, 00101 Helsinki, Finland
1  1001          Kampinkuja 1, 00100 Helsinki, Finland
2  1002           Kaivokatu 8, 00101 Helsinki, Finland
3  1003  Hermannin rantatie 1, 00580 Helsinki, Finland
4  1005     Tyynenmerenkatu 9, 00220 Helsinki, Finland

Now we have our data in a Pandas DataFrame and we can geocode our addresses.

  • Let’s
# Import the geocoding tool
In [2]: from geopandas.tools import geocode

# Geocode addresses with Nominatim backend
In [3]: geo = geocode(data['addr'], provider='nominatim')
---------------------------------------------------------------------------
timeout                                   Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\urllib\request.py in do_open(self, http_class, req, **http_conn_args)
   1317                 h.request(req.get_method(), req.selector, req.data, headers,
-> 1318                           encode_chunked=req.has_header('Transfer-encoding'))
   1319             except OSError as err: # timeout error

C:\ProgramData\Anaconda3\lib\http\client.py in request(self, method, url, body, headers, encode_chunked)
   1238         """Send a complete request to the server."""
-> 1239         self._send_request(method, url, body, headers, encode_chunked)
   1240 

C:\ProgramData\Anaconda3\lib\http\client.py in _send_request(self, method, url, body, headers, encode_chunked)
   1284             body = _encode(body, 'body')
-> 1285         self.endheaders(body, encode_chunked=encode_chunked)
   1286 

C:\ProgramData\Anaconda3\lib\http\client.py in endheaders(self, message_body, encode_chunked)
   1233             raise CannotSendHeader()
-> 1234         self._send_output(message_body, encode_chunked=encode_chunked)
   1235 

C:\ProgramData\Anaconda3\lib\http\client.py in _send_output(self, message_body, encode_chunked)
   1025         del self._buffer[:]
-> 1026         self.send(msg)
   1027 

C:\ProgramData\Anaconda3\lib\http\client.py in send(self, data)
    963             if self.auto_open:
--> 964                 self.connect()
    965             else:

C:\ProgramData\Anaconda3\lib\http\client.py in connect(self)
   1399             self.sock = self._context.wrap_socket(self.sock,
-> 1400                                                   server_hostname=server_hostname)
   1401             if not self._context.check_hostname and self._check_hostname:

C:\ProgramData\Anaconda3\lib\ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
    400                          server_hostname=server_hostname,
--> 401                          _context=self, _session=session)
    402 

C:\ProgramData\Anaconda3\lib\ssl.py in __init__(self, sock, keyfile, certfile, server_side, cert_reqs, ssl_version, ca_certs, do_handshake_on_connect, family, type, proto, fileno, suppress_ragged_eofs, npn_protocols, ciphers, server_hostname, _context, _session)
    807                         raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
--> 808                     self.do_handshake()
    809 

C:\ProgramData\Anaconda3\lib\ssl.py in do_handshake(self, block)
   1060                 self.settimeout(None)
-> 1061             self._sslobj.do_handshake()
   1062         finally:

C:\ProgramData\Anaconda3\lib\ssl.py in do_handshake(self)
    682         """Start the SSL/TLS handshake."""
--> 683         self._sslobj.do_handshake()
    684         if self.context.check_hostname:

timeout: _ssl.c:733: The handshake operation timed out

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\geopy\geocoders\base.py in _call_geocoder(self, url, timeout, raw, requester, deserializer, **kwargs)
    142         try:
--> 143             page = requester(req, timeout=(timeout or self.timeout), **kwargs)
    144         except Exception as error: # pylint: disable=W0703

C:\ProgramData\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    222         opener = _opener
--> 223     return opener.open(url, data, timeout)
    224 

C:\ProgramData\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)
    525 
--> 526         response = self._open(req, data)
    527 

C:\ProgramData\Anaconda3\lib\urllib\request.py in _open(self, req, data)
    543         result = self._call_chain(self.handle_open, protocol, protocol +
--> 544                                   '_open', req)
    545         if result:

C:\ProgramData\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)
    503             func = getattr(handler, meth_name)
--> 504             result = func(*args)
    505             if result is not None:

C:\ProgramData\Anaconda3\lib\urllib\request.py in https_open(self, req)
   1360             return self.do_open(http.client.HTTPSConnection, req,
-> 1361                 context=self._context, check_hostname=self._check_hostname)
   1362 

C:\ProgramData\Anaconda3\lib\urllib\request.py in do_open(self, http_class, req, **http_conn_args)
   1319             except OSError as err: # timeout error
-> 1320                 raise URLError(err)
   1321             r = h.getresponse()

URLError: <urlopen error _ssl.c:733: The handshake operation timed out>

During handling of the above exception, another exception occurred:

GeocoderTimedOut                          Traceback (most recent call last)
<ipython-input-3-ba8493af24dd> in <module>()
----> 1 geo = geocode(data['addr'], provider='nominatim')

C:\ProgramData\Anaconda3\lib\site-packages\geopandas\tools\geocoding.py in geocode(strings, provider, **kwargs)
     60 
     61     """
---> 62     return _query(strings, True, provider, **kwargs)
     63 
     64 

C:\ProgramData\Anaconda3\lib\site-packages\geopandas\tools\geocoding.py in _query(data, forward, provider, **kwargs)
    136         try:
    137             if forward:
--> 138                 results[i] = coder.geocode(s)
    139             else:
    140                 results[i] = coder.reverse((s.y, s.x), exactly_one=True)

C:\ProgramData\Anaconda3\lib\site-packages\geopy\geocoders\osm.py in geocode(self, query, exactly_one, timeout, addressdetails, language, geometry)
    191         logger.debug("%s.geocode: %s", self.__class__.__name__, url)
    192         return self._parse_json(
--> 193             self._call_geocoder(url, timeout=timeout), exactly_one
    194         )
    195 

C:\ProgramData\Anaconda3\lib\site-packages\geopy\geocoders\base.py in _call_geocoder(self, url, timeout, raw, requester, deserializer, **kwargs)
    161             elif isinstance(error, URLError):
    162                 if "timed out" in message:
--> 163                     raise GeocoderTimedOut('Service timed out')
    164                 elif "unreachable" in message:
    165                     raise GeocoderUnavailable('Service not available')

GeocoderTimedOut: Service timed out

In [4]: geo.head(2)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-d69786f2c8cf> in <module>()
----> 1 geo.head(2)

NameError: name 'geo' is not defined

And Voilà! As a result we have a GeoDataFrame that contains our original address and a ‘geometry’ column containing Shapely Point -objects that we can use for exporting the addresses to a Shapefile for example. However, the id column is not there. Thus, we need to join the information from data into our new GeoDataFrame geo, thus making a Table Join.

Table join

Table joins are really common procedures when doing GIS analyses. As you might remember from our earlier lessons, combining data from different tables based on common key attribute can be done easily in Pandas/Geopandas using .merge() -function.

However, sometimes it is useful to join two tables together based on the index of those DataFrames. In such case, we assume that there is same number of records in our DataFrames and that the order of the records should be the same in both DataFrames. In fact, now we have such a situation as we are geocoding our addresses where the order of the geocoded addresses in geo DataFrame is the same as in our original data DataFrame.

Hence, we can join those tables together with join() -function which merges the two DataFrames together based on index by default.

In [5]: join = geo.join(data)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-adf01573572d> in <module>()
----> 1 join = geo.join(data)

NameError: name 'geo' is not defined

In [6]: join.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-6f7af27f135e> in <module>()
----> 1 join.head()

NameError: name 'join' is not defined
  • Let’s also check the data type of our new join table.
In [7]: type(join)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-9a8f935adda6> in <module>()
----> 1 type(join)

NameError: name 'join' is not defined

As a result we have a new GeoDataFrame called join where we now have all original columns plus a new column for geometry.

  • Now it is easy to save our address points into a Shapefile
# Output file path
outfp = r"/home/geo/addresses.shp"

# Save to Shapefile
join.to_file(outfp)

That’s it. Now we have successfully geocoded those addresses into Points and made a Shapefile out of them. Easy isn’t it!

Hint

Nominatim works relatively nicely if you have well defined and well-known addresses such as the ones that we used in this tutorial. However, in some cases, you might not have such well-defined addresses, and you might have e.g. only the name of a museum available. In such cases, Nominatim might not provide such good results, and in such cases you might want to use e.g. Google Geocoding API (V3). Take a look from last year, where we show how to use Google Geocoding API in a similar manner as we used Nominatim here.