Optimal database operations over very large yet simple features

GIMA
M-GEO
STAMP
Topic description

Optimal performance of computations is one important promise that databases make to their user community. Specifically, the query optimizer is a key architectural component in any DBMS that aims to ensure best compute time performance for queries that the DBMS runs on behalf of its users. This holds for regular DBMSs as well as spatial DBMSs.

While much attention has been paid to operating on large datasets as measured by the number of records/objects, fairly little attention has been paid to operations on single geometries that are large as measured by the number of vertices that span them up. This thesis work addresses that objective.

The problem surfaces in many spatial algorithms. Commonly applied algorithms are those that test topological conditions ("do two geometries intersect?") and those that compute spatial overlays ("what is the geometric difference of one geometry and another?"). Spatial databases have robust functions to address these questions, but they are (really) slow when you feed them two geometries with each 5 million vertices. You can imagine what happens if you have a table with 2,000 such geometries and perform a spatial join with another ...

The student will study this phenomenon of low performance and think up a way to address it. A classical algorithmic approach is divide-and-conquer: it splits up the original problem case into a number of similar but smaller-sized cases. Perhaps even recursively. In context of this M.Sc. proposal, this can be taken literally: split up the participating geometries spatially, perform the operation on the split parts, and reunite the results after all split parts have been handled. Practical problems may arise here: rounding errors may lead to imperfect geometric reunions.

Of course, the devil is in the details. There are many ways to split, and how to split one geometry may have to depend on the other geometry. Thus, the challenge in this project is to devise smart ways of splitting and smart ways of reuniting. Any solution thought up will need to be carefully evaluated also, as this may lead to smarter solutions again.

We expect the student to execute this project with vanilla spatial SQL of PostgreSQL/PostGIS, or the associated database programming language pl/pgsql. Recent versions of postgresql (>= 13) and postgis (>= 3.0) are offering exciting opportunities in parallel query processing, and novel spatial functions from which the approach can be built.

Our ideal candidate should have interest in and talent for programming with spatial features. In return, we offer supervisor excitement over the project.

Topic objectives and methodology

Spatial databases are fully equipped to operate on large data sets, i.e., on collections of spatial data with high numbers of geometries. This is the predominant factor in data complexity of spatial queries, and it has been given much attention in the literature. Spatial data sets, however, can display another form of data complexity, namely in terms of the number of vertices that span up the geometries in hand. The objective of this thesis work is to develop spatial algorithm tactics that perform well when they are fed with very large single geometries. Performance evaluation will be part of the work.

References for further reading
  • Philippe Rigaux, Michel Scholl & Agnes Voisard, Spatial Databases with Application to GIS, Academic Press/Morgan Kaufmann, San Diego/London, 2002.

  • The PostgreSQL Global Development Group, PostgreSQL 9.6.6 Documentation, 2017. https://www.postgresql.org/docs/9.6/static/index.html