Make Joins on Geographical Data: Spatial Support in Databricks

Databricks Runtime 17.1 introduces powerful geospatial support, featuring new Delta datatypes: geography and geometry, along with dozens of ST spatial functions. This makes it incredibly easy to perform joins on geographical data, such as connecting delivery locations with their corresponding delivery zones or cities.

Understanding Coordinate Reference Systems

You'll frequently encounter two standard codes in datatypes and error messages: 4326 and CRS84. Both describe the WGS 84 coordinate reference system, which uses latitude and longitude to locate positions on Earth.

EPSG:4326 is the official SRID (Spatial Reference System Identifier), while CRS84 is an equivalent definition used by OGC/WKT, with longitude–latitude axis order. The GEOGRAPHY type in Databricks always uses SRID 4326. The GEOMETRY type can use various SRIDs, or even none at all, since it can represent abstract x–y coordinate systems that aren't tied to the Earth.

Setting Up Delivery Zones

Let's start by creating a map with our cities (delivery zones):

This represents exactly what we define in our temporary view: polygons with latitude/longitude points stored in the GEOGRAPHY type. We can also store this data as a JSON string for other uses.

CREATE OR REPLACE TEMP VIEW cities_geo AS
SELECT
  name,
  ST_GEOMFROMTEXT(wkt, 4326) AS boundary_geog, -- 4326: Spatial Reference System Identifier
  ST_ASGEOJSON(boundary_geog) AS geo_json
FROM VALUES
('New York (Manhattan S)', 'POLYGON((-74.0180 40.7050,-74.0175 40.7400,-74.0060 40.7480,-73.9900 40.7480,-73.9750 40.7350,-73.9800 40.7050,-74.0180 40.7050))'),
('Jersey City', 'POLYGON((-74.1070 40.7050,-74.0950 40.7350,-74.0750 40.7400,-74.0550 40.7350,-74.0450 40.7150,-74.0600 40.6950,-74.0850 40.6950,-74.1070 40.7050))'),
('Hoboken', 'POLYGON((-74.0480 40.7350,-74.0465 40.7550,-74.0200 40.7550,-74.0200 40.7400,-74.0300 40.7350,-74.0480 40.7350))'),
('Newark', 'POLYGON((-74.2100 40.7100,-74.2000 40.7550,-74.1650 40.7550,-74.1450 40.7350,-74.1550 40.7100,-74.1850 40.7000,-74.2100 40.7100))'),
('Brooklyn (Central)', 'POLYGON((-73.9700 40.7050,-73.9650 40.7000,-73.9500 40.6950,-73.9300 40.6950,-73.9100 40.6850,-73.9050 40.6700,-73.9200 40.6600,-73.9450 40.6550,-73.9650 40.6650,-73.9750 40.6850,-73.9700 40.7050))'),
('Queens (LIC/Astoria)', 'POLYGON((-73.9550 40.7350,-73.9550 40.7600,-73.9400 40.7800,-73.9100 40.7800,-73.8950 40.7600,-73.9050 40.7400,-73.9250 40.7350,-73.9550 40.7350))')
AS t(name, wkt);

Notice that our datatype displays as CRS84 in the interface, but after saving in Unity Catalog, it appears as GEOMETRY(4326). Databricks is making an effort to educate users about the relationship between CRS84 and SRID 4326.

Creating Delivery Points

Now that we have our delivery zones, let's create specific delivery locations. These are simple longitude/latitude points representing food delivery addresses.

CREATE OR REPLACE TEMP VIEW deliveries_geo AS
SELECT
  CAST(id AS INT) AS id,
  CAST(lon AS DOUBLE) AS lon,
  CAST(lat AS DOUBLE) AS lat,
  ST_POINT(lon, lat, 4326) AS pt_geog -- GEOGRAPHY data type
FROM VALUES
  (1, -74.0080, 40.7130),
  (2, -74.0005, 40.7305),
  (3, -74.0710, 40.7220),
  (4, -74.0600, 40.7080),
  (5, -74.0320, 40.7450),
  (6, -74.0250, 40.7485),
  (7, -74.1800, 40.7350),
  (8, -74.1950, 40.7180),
  (9, -73.9550, 40.6850),
  (10, -73.9400, 40.6680),
  (11, -73.9300, 40.7550),
  (12, -73.9150, 40.7700) AS t (id, lon, lat);

Performing Spatial Joins

The beauty of Databricks' geospatial support becomes evident when joining delivery locations with their corresponding cities. This spatial join is surprisingly straightforward:

SELECT
  d.id AS delivery_id,
  c.name AS city_name
FROM deliveries_geo AS d
LEFT JOIN cities_geo AS c
  ON st_contains(c.boundary_geog, d.pt_geog);

This query uses the ST_CONTAINS function to determine which delivery points fall within each city's boundary polygon.

Going Nerd

Calculating Inter-City Distances

Here's another useful spatial operation: finding the center of each city (polygon) and calculating distances between them:

SELECT 
  c1.name AS city1,
  c2.name AS city2,
  ROUND(ST_DISTANCESPHEROID(ST_CENTROID(c1.boundary_geog), 
                           ST_CENTROID(c2.boundary_geog)) / 1000, 1) AS distance_km
FROM cities_geo AS c1
CROSS JOIN cities_geo AS c2
WHERE c1.name < c2.name

This query calculates the spheroidal distance between city centers in kilometers.

Advanced Example: Minimum Spanning Tree

For those interested in more complex applications, here's a sophisticated example that combines Databricks' new recursive CTE functionality with geospatial functions to implement Prim's algorithm for building a minimum spanning tree of delivery points:

DECLARE OR REPLACE VARIABLE starting_id INT DEFAULT 1;

WITH RECURSIVE mst AS (
  -- Anchor: start with the ID of our choice
  SELECT
    1 AS step,
    array(starting_id) AS visited,
    CAST(NULL AS INT) AS from_id,
    min(id) AS to_id,
    CAST(0.0 AS DOUBLE) AS edge_km,
    CAST(0.0 AS DOUBLE) AS total_km
  FROM deliveries_geo

  UNION ALL

  -- Step: among all edges from visited -> unvisited, pick the shortest
  SELECT step, visited, from_id, to_id, edge_km, total_km
  FROM (
    SELECT
      t.step + 1 AS step,
      concat(t.visited, array(v.id)) AS visited,
      u.id AS from_id,
      v.id AS to_id,
      ST_DISTANCESPHEROID(u.pt_geog, v.pt_geog) / 1000 AS edge_km,
      t.total_km + ST_DISTANCESPHEROID(u.pt_geog, v.pt_geog) / 1000 AS total_km,
      ROW_NUMBER() OVER (
        PARTITION BY t.step
        ORDER BY ST_DISTANCESPHEROID(u.pt_geog, v.pt_geog)
      ) AS rn
    FROM mst t
    JOIN deliveries_geo u
      ON array_contains(t.visited, u.id)           -- already in tree
    JOIN deliveries_geo v
      ON NOT array_contains(t.visited, v.id)       -- not yet in tree
    WHERE t.step < (SELECT COUNT(*) FROM deliveries_geo)
  ) ranked
  WHERE rn = 1
)
SELECT
  step,
  from_id,
  to_id,
  edge_km,
  total_km,
  visited AS path_ids
FROM mst
ORDER BY step;

Starting from delivery point 1, this algorithm finds the optimal route: [1, 2, 6, 5, 3, 4, 9, 10, 11, 12, 7, 8]. This represents the most efficient path if you had a helicopter for city-to-city deliveries.

Note: While exploring geographical data in Databricks, you might find it surprisingly easy to exceed the recursive limit of 1 million iterations with complex spatial algorithms.

Visualization with Plotly

While I initially attempted to use Spark's native plotting capabilities, the complexity of map visualizations led me to use Plotly Express with toPandas() instead. Here's the code for displaying zones and points in a Databricks notebook:

%python
import json
import plotly.express as px

# Cities polygon data
pdf_cities = df_cities.toPandas()
# Delivery points data
pdf_pts = df_deliveries.toPandas()

# Build FeatureCollection for polygons
features = [
    {"type": "Feature",
     "geometry": json.loads(s),
     "properties": {"name": n}}
    for n, s in zip(pdf_cities["name"], pdf_cities["geo_json"])
]
fc = {"type": "FeatureCollection", "features": features}

# Center the map on the first polygon
ring0 = fc["features"][0]["geometry"]["coordinates"][0]
center = {"lon": sum(p[0] for p in ring0)/len(ring0),
          "lat": sum(p[1] for p in ring0)/len(ring0)}

# Create base map with polygons
fig = px.choropleth_mapbox(
    pdf_cities.assign(loc=pdf_cities["name"]),
    geojson=fc,
    locations="loc",
    featureidkey="properties.name",
    color="loc",
    mapbox_style="open-street-map",
    center=center,
    zoom=10,
    opacity=0.55
)

# Add delivery points as markers
fig.add_scattermapbox(
    lon=pdf_pts["lon"],
    lat=pdf_pts["lat"],
    mode="markers",
    text=pdf_pts["id"].astype(str),
    name="Deliveries",
    marker=dict(
        size=16,
        sizemode="diameter",
        color="red"
    )
)

# Render in Databricks
displayHTML(fig.to_html(include_plotlyjs="cdn", full_html=False))

Final Thoughts

Databricks Runtime 17.1's geospatial capabilities open up powerful possibilities for location-based data analysis. From simple spatial joins to complex routing algorithms, these new features make geographic data processing more accessible and efficient than ever before. Whether you're optimizing delivery routes, analyzing customer distribution, or performing spatial analytics, these tools provide the foundation for sophisticated geospatial applications.

Hubert Dudek

Databricks MVP | Advisor to Databricks Product Board and Technical advisor to SunnyData

https://www.linkedin.com/in/hubertdudek/
Previous
Previous

New Databricks INSERT Features: INSERT REPLACE ON and INSERT REPLACE USING

Next
Next

Dashboards for Nerds: DataFrame Plotting in Databricks