Abstract
Parallel spatial join algorithms are essential for scalable processing and analysis of big spatial data. The state-of-the-art algorithms rely on splitting the data into partitions and replicating objects from one data set in neighboring partitions, so that partitions can be processed in parallel independently without producing duplicate results. However, this universal replication of one data set leads to suboptimal performance in the case of skewed data sets with varying density. Instead, we advocate an approach that adaptively selects which data set to replicate in different local areas of the space, thus minimizing replication and boosting the performance of query processing. To this end, we contrive a graph-based framework for modeling replication between neighboring partitions. We study the theoretical properties that lead to adaptive replication with correct and duplicate-free results. Then, we design a data-parallel algorithm in Apache Spark which is based on adaptive replication, and we demonstrate its performance gain over the state-of-the-art for large-sized data sets, real and synthetic, under various settings.