Goal: Given two datasets, one for all the point locations and another for all the rectangular regions, using PostGIS, find the number of points inside each rectanglular region of these datasets and save the results.
Steps followed:
I first created four fragments of both the points and rectangles table. First, I found the best split across the latitudes for the points given. I did this byequally dividing the data into parts using four percentiles: 0.25, 0.5, and 0.75. A value at the
0.25 percentile means that this value is greater than 25% of all the values in the group. Hence, I divided the data into four equally distributed parts.
After partitioning the points into four parts, I used the same split to partition the rectangles using either of the two given diagonal points, which means thatif either of the diagonal points belongs to a partition, then that rectangle belongs to that partition.
After partitioning both the points table into p1,p2,p3, and p4 and the rectangles table into r1,r2,r3, and r4, I created a queue enumerating all the tables from p1-p4.
Next, I created a join function to join the given partitioned points table to all the partitioned rectangle tables.
After that, I created four threads using the ThreadPoolExecutor. Then I read the first number from the queue to execute the join and saved the results onto another table named pr[i] (where i is the index of the points table joined).
Goal: Given two datasets, one for all the point locations and another for all the rectangular regions, using mapreduce on Sedona, find the number of points inside each rectanglular region of these datasets and save the results.
Steps followed:
I loaded both the points and rectangles datasets with spark, converted them into data frames using the "toDF()" function, and created a Temporary View with them.
After that, I created a point using the ST_Point function with the x and y coordinates of the point dataset. I also made a geometric envelope using the ST_PolygonFromEnvelope function with the two diagonal points given in the reactangles dataset.
I combined the two datasets to get all the sets of points and rectangles wherein the point belongs to the rectangle by using the ST_Contains function.
I converted the created data frame to an RDD, mapped the RDD to <key, value> pairs of <reactangle,1> for each row, and then reduced these pairs using the keys.
After performing this map reduce, I converted the resultant RDD to a DataFrame, selected the counts, and sorted the counts in ascending order.
Lastly, I repartitioned the result to 1 single partition and returned theresult.
Goal: Modify the abve spark code so that it uses the Delta format to store and process the data.
Advantages of the Delta format:
It has two different caches: a delta cache and a result cache. The delta cache keeps local copies of remote data on the worker nodes, which helps avoidremote reads during larger workloads. The results cache stored subsets of the result of a query you run. This feature is helpful when you have a use case that runs a specific query multiple times.[3]
It supports concurrent read/write operations via snapshot isolation.
It allows background file optimization through compaction and z-order partitioning achieving better performance improvements.
Steps followed:
I read all the points in CSV format, found the set of all points having coordinates >500, and saved them in delta format.
I read this delta file and printed the count of the rows in this file.
I read all the rectangles in CSV format, found all the rectangles having all coordinates >900, and saved them in delta format. Then, I read this delta fileand printed the count of the rows in this file.
I read both the saved points and rectangles from the saved delta. Then, I found the data set of all the points belonging to some rectangle using a sub-select query. Lastly, I performed a self-join on this obtained result to obtain all the point pairs with a distance of less than two between them.
Goal: Create and run spatial queries in MongoDB using Python.
Query used to find businesses with at least one matching category and having a distance of less than 2 miles from the given location.
PostGIS preprocesses the data and indexes it geospatially, while spark uses the raw data directly from its source. Hence, due to the added time complexity ofthis indexing, PostGIS’s time needed to respond to queries increases exponentially while the spark’s response time remains stable.[8]
MongoDB stores all the data together in a document, which offers both a flexible format, and helps club information together. The advantage of this is that all the data you need is in one place, and you don’t have to join this data across multiple tables. Hence, it is fasterquerying data. Conversely, you would need large amounts of memory and processing power for this.
Delta tables have a data cache, which would cache the data of the Delta format we loaded inter-mediately and reuse it in the final query. This woulddecrease the number of reads and increase the performance of our end query.