How to summarize a geolocation table by merging contiguous network IPv4 blocks
Every device on the internet is uniquely addressable via an Internet Protocol (IP) address. The global IP address space is overseen by the Internet Assigned Numbers Authority (IANA). Traditionally, IANA allocates address space in /8 prefix blocks for IPv4, which are subsequently assigned to Internet Service Providers and other organizations. Various databases exist to map these IP blocks to their respective owners, along with information about the originating country and city.
As the Canadian national CSIRT, we, the Canadian Centre for Cyber Security, rely heavily on referencing these databases for looking up a given IP or enhancing entire datasets through SQL JOINs. However, not all use cases necessitate precision up to the city level; sometimes, country information alone suffices.
Within a country, many network blocks are contiguous. Consolidating these into mega blocks can significantly reduce the size of a table mapping mega blocks to countries, leading to improved JOIN operations.
In this article, we will demonstrate how to summarize a geolocation table by merging contiguous network blocks.
Let’s assume our geolocation table contains the following data:
+----------+-------+---------+-----------+-----------+
| start_ip | end_ip| country | city | owner |
+----------+-------+---------+-----------+-----------+
| 1 | 2 | ca | Toronto | Telus |
| 3 | 4 | ca | Quebec | Rogers |
| 5 | 8 | ca | Vancouver | Bell |
| 10 | 14 | ca | Montreal | Telus |
| 19 | 22 | ca | Ottawa | Rogers |
| 23 | 29 | ca | Calgary | Videotron |
+----------+-------+---------+-----------+-----------+
Here, start_ip represents the lowest number in the network block, and end_ip represents the largest. Normally, these numbers are much larger. For example, Google’s DNS server 8.8.8.8 is represented by the number 134744072. We use simple synthetic values for illustration purposes.
To start, let’s perform a simple summarization. For example, counting the number of IP addresses assigned to each country. This can be achieved by grouping the data by country and summing the number of IPs in each network block.
SELECT
country,
SUM(end_ip - start_ip + 1) as num_ip
FROM
geo_table
GROUP BY
country
This statement groups rows by country and applies a SUM aggregation function, calculating the total number of IPs for each country. It’s important to note that the SUM aggregation is associative, meaning the order in which you sum does not matter, similar to additions in mathematics.
+---------+--------+
| country | num_ip |
+---------+--------+
| ca | 24 |
+---------+--------+
Now, let’s delve into the complexities of aggregating contiguous network blocks. Referring to our original table, we need to fuse together the first 3 rows. Bocks 1–2, 3–4, 5–8 should result in the mega block 1–8. We also need to fuse the last 2 rows. Blocks 19–22 and 23–29 result into 19–29. Our goal is to produce the following table:
+----------+-------+---------+
| start_ip | end_ip| country |
+----------+-------+---------+
| 1 | 8 | ca |
| 10 | 14 | ca |
| 19 | 29 | ca |
+----------+-------+---------+
Detecting contiguous blocks requires inter-row information, and the order of rows becomes crucial. Fortunately, windowed analytical functions provide a solution by offering a mechanism for inter-record referencing. These functions, such as LEAD and LAG, enable comparisons with the values of the previous and subsequent rows, facilitating the identification of contiguous IP blocks.
Let’s apply the LEAD and LAG windowing functions to our table. Notice that within the OVER clause we still specify that our data should be grouped by country PARTITION BY country but additionally we specify the ordering of this data ORDER BY start_ip.
SELECT
*,
LAG(end_ip) OVER (
PARTITION BY country
ORDER BY start_ip) AS prev_end_ip,
LEAD(start_ip) OVER (
PARTITION BY country
ORDER BY start_ip) AS next_start_ip
FROM
geo_table
The resulting table view_1 is as follows:
+----------+-------+---------+-------------+---------------+
| start_ip | end_ip| country | prev_end_ip | next_start_ip |
+----------+-------+---------+-------------+---------------+
| 1 | 2 | ca | null | 3 |
| 3 | 4 | ca | 2 | 5 |
| 5 | 8 | ca | 4 | 10 |
| 10 | 14 | ca | 8 | 19 |
| 19 | 22 | ca | 14 | 23 |
| 23 | 29 | ca | 22 | null |
+----------+-------+---------+-------------+---------------+
It’s crucial to distinguish between windowing functions and simple GROUP BY functions. In an OVER() operation, the results of LEAD and LAG are appended to every row, providing context for the previous and next row’s information. This is distinct from functions in a GROUP BY clause that reduce a group of rows into a single aggregation result.
Now that we have access to the details of both the preceding and succeeding rows, we can facilitate inter-row comparisons. This comparison is crucial for identifying contiguous IP blocks, enabling us to determine when to fuse adjacent blocks together.
Each row can fall into one of four states:
1) Remove: The block is contiguous with both the previous and next blocks.
2) Start: The block is contiguous with only the next block.
3) End: The block is contiguous with only the previous block.
4) Keep: The block is not contiguous with either the previous nor the next block.
Let’s add this state column to our table.
SELECT
*,
CASE
WHEN (end_ip = next_start_ip - 1)
AND (start_ip = prev_end_ip + 1) THEN 'remove'
WHEN (end_ip = next_start_ip - 1) THEN 'start'
WHEN (start_ip = prev_end_ip + 1) THEN 'end'
ELSE 'keep'
END AS state
FROM
view_1
We obtain the following view_2 result:
+----------+-------+---------+-------------+---------------+-------+
| start_ip | end_ip| country | prev_end_ip | next_start_ip | state |
+----------+-------+---------+-------------+---------------+-------+
| 1 | 2 | ca | null | 3 | start |
| 3 | 4 | ca | 2 | 5 | remove|
| 5 | 8 | ca | 4 | 10 | end |
| 10 | 14 | ca | 8 | 19 | keep |
| 19 | 22 | ca | 14 | 23 | start |
| 23 | 29 | ca | 22 | null | end |
+----------+-------+---------+-------------+---------------+-------+
We can proceed to remove the rows falling between the start and end blocks, specifically those identified with state remove.
SELECT
start_ip,
end_ip,
country,
state
FROM
view_2
WHERE
state IN ('start', 'end', 'keep')
Resulting in view_3:
+----------+-------+---------+-------+
| start_ip | end_ip| country | state |
+----------+-------+---------+-------+
| 1 | 2 | ca | start |
| 5 | 8 | ca | end |
| 10 | 14 | ca | keep |
| 19 | 22 | ca | start |
| 23 | 29 | ca | end |
+----------+-------+---------+-------+
We are getting close to our goal! All we have to do now is merge the start and end rows, which contain the start_ip and end_ip of the mega blocks we are seeking to produce. To achieve this, we once again use a window function. This time to fetch the end_ip from the end row.
SELECT
*,
LEAD(end_ip) OVER (
PARTITION BY country
ORDER BY start_ip) AS next_end_ip
FROM
view_3
Result in view_4:
+----------+-------+---------+-------+-------------+
| start_ip | end_ip| country | state | next_end_ip |
+----------+-------+---------+-------+-------------+
| 1 | 2 | ca | start | 8 |
| 5 | 8 | ca | end | 14 |
| 10 | 14 | ca | keep | 22 |
| 19 | 22 | ca | start | 29 |
| 23 | 29 | ca | end | null |
+----------+-------+---------+-------+-------------+
Notice that the rows with state start now have a start_ip and a next_end_ip, the information necessary to build a mega block.
The rows with state end are no longer needed and can be removed.
The rows with state keep already have the correct end_ip.
We can now determine the final_end value of the mega blocks. Two cases are possible:
1) for a start row, we obtain the end value from the next_end_ip.
2) for a keep row, we simply use the original end_ip value.
SELECT
start_ip AS final_start,
CASE
WHEN (state = 'start') THEN next_end_ip
WHEN (state = 'keep') THEN end_ip
ELSE NULL
END AS final_end_ip
FROM
view_4
WHERE
state IN ('start', 'keep')
We thus achieve our goal of fusing contiguous IPv4 blocks into mega blocks.
+----------+-------+---------+-------+-------------+------------+----------+
| start_ip | end_ip| country | state | next_end_ip | final_start| final_end|
+----------+-------+---------+-------+-------------+------------+----------+
| 1 | 2 | ca | start | 8 | 1 | 8 |
| 10 | 14 | ca | keep | 22 | 10 | 14 |
| 19 | 22 | ca | start | 29 | 19 | 29 |
+----------+-------+---------+-------+-------------+------------+----------+
Putting all the statements above together, we obtain a final statement:
SELECT
country,
final_start,
IF(state = 'start', next_end_ip, final_end) AS final_end
FROM (
SELECT
country,
start_ip AS final_start,
end_ip AS final_end,
LEAD(end_ip) OVER (
PARTITION BY country
ORDER BY start_ip) AS next_end_ip
FROM (
SELECT
start_ip,
end_ip,
country,
CASE
WHEN (end_ip = next_start_ip - 1)
AND (start_ip = prev_end_ip + 1) THEN 'remove'
WHEN (end_ip = next_start_ip - 1) THEN 'start'
WHEN (start_ip = prev_end_ip + 1) THEN 'end'
ELSE 'keep'
END AS state
FROM (
SELECT
*,
LAG(end_ip) OVER (
PARTITION BY country
ORDER BY start_ip) AS prev_end_ip,
LEAD(start_ip) OVER (
PARTITION BY country
ORDER BY start_ip) AS next_start_ip
FROM
geo_table
)
WHERE
state IN ('start', 'end', 'keep')
)
)
WHERE
state IN ('start', 'keep')
In conclusion, SQL analytical window functions offer a robust framework for complex data analysis. They empower users to perform aggregations while retaining the context of individual rows, facilitating tasks such as running totals, averages, and percentile calculations. Moreover, these functions play a crucial role in ranking, time-series analysis, and detecting anomalies and outliers within datasets. These functions are indispensable assets in the toolkit of data practitioners.
Analytical window functions are very powerful. In this article, we only scratched the surface; for example, we did not make use of a window_frame. A window frame allows you to further refine which rows are considered in the aggregation. Window frames are relative to the current row and can be based on the number of rows or time intervals, making these functions indispensable for a wide range of analyses. You can learn more about these features in the Spark documentation: Spark SQL — Window Operations .
Unleashing the Power of SQL Analytical Window Functions: A Deep Dive into Fusing IPv4 Blocks was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Unleashing the Power of SQL Analytical Window Functions: A Deep Dive into Fusing IPv4 Blocks