How We Built Scalable Spatial Data & Spatial Indexing in CockroachDB

I haven’t ever worked with spatial data (or PostGIS) before, so this post didn’t make much sense until I found these definitions for “object” and “space” (source):

Primitive types:

  • Points and point clusters
  • Line strings
  • n-point polygons
  • Arc line strings (All arcs are generated as circular arcs.)
  • Arc polygons
  • Compound polygons
  • Compound line strings
  • Circles
  • Optimized rectangle

A geometry object is the representation of a spatial feature, modeled as an ordered set of primitive elements. A geometry can consist of a single element, which is an instance of one of the supported primitive types, or a homogeneous or heterogeneous collection of elements. An example of a geometry might describe the buildable land in a town. This could be represented as a polygon with holes where water or zoning prevents construction.

A spatial index, like any other index, provides a mechanism to limit searches within tables (or data spaces) based on spatial criteria such as intersection, and containment. A spatial index is required to:

  • Find objects within an indexed data space that overlap a given point or area-of-interest (window query)
  • Find pairs of objects from within two indexed data spaces that spatially interact with each other (spatial join)

Two options for spatial indexes were evaluated (are there other options typically available that weren’t considered here?):

  • Divide the objects
  • Divide the space

CockroachDB chose the latter because the former is incompatible with their dynamic sharding strategy.

Current approaches to spatial indexes fall into two categories. One approach is to “divide the objects”. This works by inserting the objects into a balanced tree whose structure depends on the data being indexed. The other approach is to “divide the space”. This works by creating a decomposition of the space being indexed into buckets of various sizes.

when an object/shape is indexed, a “covering” shape(s) (e.g. a bounding box) is constructed that completely encompasses the indexed object. Index queries work by looking for containment/intersection between the covering shape(s) for the query object/shape and the indexed covering shapes. This retrieves false positives but no false negatives.

Divide the Objects

PostGIS is a notable implementation of divide the objects. It maintains an “R tree” (rectangle tree) which is implemented as a Postgres “GiST” index. A search starting from the root can omit sub-trees that have no overlap.

Divide the Space

The other approach for spatial indexes is to “divide the space” into a quad-tree (or a set of quad-trees) with a set number of levels and a data-independent shape. Each node in the quad-tree (a “cell”) represents some part of the indexed space and is divided once horizontally and once vertically to produce 4 children in the next level.

Divide the space algorithms tend to use clever strategies for the unique numeric cell-IDs with important guarantees. The S2 library from Google is an example of this approach for dividing the earth and assigns cell-IDs using a Hilbert curve .

Why We chose the “Divide the Space” Approach for CockroachDB

CockroachDB is a dynamically sharded, horizontally scalable SQL database, that arranges all data into a lexicographic total order . A divide the objects approach is incompatible with this scheme since:

  • The shape of an intermediate node is dependent on many of the indexed shapes, which creates a non-localized dependency.
  • A general multi-dimensional structure cannot be directly represented in a lexicographic totally ordered space.