The Securities and Exchange Commission (SEC) has delayed its decision on BlackRock’s application to launch a spot Ethereum (ETH) ETF to March. According to a document on the SEC website, a new decision on the application, which was supposed to…
North Korea-linked hackers stole nearly $430 million from decentralized finance and also targeted centralized services, exchanges, and wallet providers in 2023. Chainalysis, a blockchain forensic firm, says North Korea-affiliated hacking groups hit record in terms of attempted attacks against decentralized…
Switzerland-based crypto bank Sygnum AG has secured over $40 million in a funding round led by asset manager Azimut Holding. The Zurich-headquartered crypto-friendly bank Sygnum AG is set to double down on its acquisition plans and expansion, with the latest…
A Practical guide to optimizing non-equi joins in Spark
Photo by John Lee on Unsplash
Enriching network events with IP geolocation information is a crucial task, especially for organizations like the Canadian Centre for Cyber Security, the national CSIRT of Canada. In this article, we will demonstrate how to optimize Spark SQL joins, specifically focusing on scenarios involving non-equality conditions — a common challenge when working with IP geolocation data.
As cybersecurity practitioners, our reliance on enriching network events with IP geolocation databases necessitates efficient strategies for handling non-equi joins. While numerous articles shed light on various join strategies supported by Spark, the practical application of these strategies remains a prevalent concern for professionals in the field.
David Vrba’s insightful article, “About Joins in Spark 3.0”, published on Towards Data Science, serves as a valuable resource. It explains the conditions guiding Spark’s selection of specific join strategies. In his article, David briefly suggests that optimizing non-equi joins involves transforming them into equi-joins.
This write-up aims to provide a practical guide for optimizing the performance of a non-equi JOIN, with a specific focus on joining with IP ranges in a geolocation table.
To exemplify these optimizations, we will revisit the geolocation table introduced in our previous article.
+----------+--------+---------+-----------+-----------+ | start_ip | end_ip | country | city | owner | +----------+--------+---------+-----------+-----------+ | 1 | 2 | ca | Toronto | Telus | | 3 | 4 | ca | Quebec | Rogers | | 5 | 8 | ca | Vancouver | Bell | | 10 | 14 | ca | Montreal | Telus | | 19 | 22 | ca | Ottawa | Rogers | | 23 | 29 | ca | Calgary | Videotron | +----------+--------+---------+-----------+-----------+
Equi-Join
To illustrate Spark’s execution of an equi-join, we’ll initiate our exploration by considering a hypothetical scenario. Suppose we have a table of events, each event being associated with a specific ownerdenoted by the event_owner column.
Let’s take a closer look at how Spark handles this equi-join:
SELECT * FROM events JOIN geolocation ON (event_owner = owner)
In this example, the equi-join is established between the events table and the geolocation table. The linking criterion is based on the equality of the event_owner column in the events table and the owner column in the geolocation table.
As explained by David Vrba in his blog post:
Spark will plan the join with SMJ if there is an equi-condition and the joining keys are sortable
Spark will execute a Sort Merge Join, distributing the rows of the two tables by hashing the event_owner on the left side and the owner on the right side. Rows from both tables that hash to the same Spark partition will be processed by the same Spark task—a unit of work. For example, Task-1 might receive:
+----------+-------+---------+-----------+-----------+ | start_ip | end_ip| country | city | owner | +----------+-------+---------+-----------+-----------+ | 1 | 2 | ca | Toronto | Telus | | 10 | 14 | ca | Montreal | Telus | +----------+-------+---------+-----------+-----------+
Notice how Task-1 handles only a subset of the data. The join problem is divided into multiple smaller tasks, where only a subset of the rows from both the left and right sides is required. Furthermore, the left and right side rows processed by Task-1 have to match. This is true because every occurrence of “Telus” will hash to the same partition, regardless of whether it comes from the events or geolocation tables. We can be certain that no other Task-X will have rows with an owner of “Telus”.
Once the data is divided as shown above, Spark will sort both sides, hence the name of the join strategy, Sort Merge Join. The merge is performed by taking the first row on the left and testing if it matches the right. Once the rows on the right no longer match, Spark will pull rows from the left. It will keep dequeuing each side until no rows are left on either side.
Non-equi Join
Now that we have a better understanding of how equi-joins are performed, let’s contrast it with a non-equi join. Suppose we have events with an event_ip, and we want to add geolocation information to this table.
To execute this join, we need to determine the IP range within which the event_ip falls. We accomplish this with the following condition:
SELECT * FROM events JOIN geolocation ON (event_ip >= start_ip and event_ip <= end_ip)
Now, let’s consider how Spark will execute this join. On the right side (the geolocation table), there is no key by which Spark can hash and distribute the rows. It is impossible to divide this problem into smaller tasks that can be distributed across the compute cluster and performed in parallel.
In a situation like this, Spark is forced to employ more resource-intensive join strategies. As stated by David Vrba:
If there is no equi-condition, Spark has to use BroadcastNestedLoopJoin (BNLJ) or cartesian product (CPJ).
Both of these strategies involve brute-forcing the problem; for every row on the left side, Spark will test the “between” condition on every single row of the right side. It has no other choice. If the table on the right is small enough, Spark can optimize by copying the right-side table to every task reading the left side, a scenario known as the BNLJ case. However, if the left side is too large, each task will need to read both the right and left sides of the table, referred to as the CPJ case. In either case, both strategies are highly costly.
So, how can we improve this situation? The trick is to introduce an equality in the join condition. For example, we could simply unroll all the IP ranges in the geolocation table, producing a row for every IP found in the IP ranges.
This is easily achievable in Spark; we can execute the following SQL to unroll all the IP ranges:
SELECT country, city, owner, explode(sequence(start_ip, end_ip)) AS ip FROM geolocation
The sequence function creates an array with the IP values from start_ip to end_ip. The explode function unrolls this array into individual rows.
+---------+---------+---------+-----------+ | country | city | owner | ip | +---------+---------+---------+-----------+ | ca | Toronto | Telus | 1 | | ca | Toronto | Telus | 2 | | ca | Quebec | Rogers | 3 | | ca | Quebec | Rogers | 4 | | ca | Vancouver | Bell | 5 | | ca | Vancouver | Bell | 6 | | ca | Vancouver | Bell | 7 | | ca | Vancouver | Bell | 8 | | ca | Montreal | Telus | 10 | | ca | Montreal | Telus | 11 | | ca | Montreal | Telus | 12 | | ca | Montreal | Telus | 13 | | ca | Montreal | Telus | 14 | | ca | Ottawa | Rogers | 19 | | ca | Ottawa | Rogers | 20 | | ca | Ottawa | Rogers | 21 | | ca | Ottawa | Rogers | 22 | | ca | Calgary | Videotron | 23 | | ca | Calgary | Videotron | 24 | | ca | Calgary | Videotron | 25 | | ca | Calgary | Videotron | 26 | | ca | Calgary | Videotron | 27 | | ca | Calgary | Videotron | 28 | | ca | Calgary | Videotron | 29 | +---------+---------+---------+-----------+
With a key on both sides, we can now execute an equi-join, and Spark can efficiently distribute the problem, resulting in optimal performance. However, in practice, this scenario is not realistic, as a genuine geolocation table often contains billions of rows.
To address this, we can enhance the efficiency by increasing the coarseness of this mapping. Instead of mapping IP ranges to each individual IP, we can map the IP ranges to segments within the IP space. Let’s assume we divide the IP space into segments of 5. The segmented space would look something like this:
Now, our objective is to map the IP ranges to the segments they overlap with. Similar to what we did earlier, we can unroll the IP ranges, but this time, we’ll do it in segments of 5.
SELECT country, city, owner, explode(sequence(start_ip / 5, end_ip / 5)) AS bucket_id FROM geolocations
We observe that certain IP ranges share a bucket_id. Ranges 1–2 and 3–4 both fall within the segment 1–5.
+----------+--------+---------+-----------+-----------+-----------+ | start_ip | end_ip | country | city | owner | bucket_id | +----------+--------+---------+-----------+-----------+-----------+ | 1 | 2 | ca | Toronto | Telus | 0 | | 3 | 4 | ca | Quebec | Rogers | 0 | | 5 | 8 | ca | Vancouver | Bell | 1 | | 10 | 14 | ca | Montreal | Telus | 2 | | 19 | 22 | ca | Ottawa | Rogers | 3 | | 19 | 22 | ca | Ottawa | Rogers | 4 | | 23 | 29 | ca | Calgary | Videotron | 4 | | 23 | 29 | ca | Calgary | Videotron | 5 | +----------+--------+---------+-----------+-----------+-----------+
Additionally, we notice that some IP ranges are duplicated. The last two rows for the IP range 23–29 overlap with segments 20–25 and 26–30. Similar to the scenario where we unrolled individual IPs, we are still duplicating rows, but to a much lesser extent.
Now, we can utilize this bucketed table to perform our join.
SELECT * FROM events JOIN geolocation ON ( event_ip / 5 = bucket_id AND event_ip >= start_ip AND event_ip <= end_ip )
The equality in the join enables Spark to perform a Sort Merge Join (SMJ) strategy. The “between” condition eliminates cases where IP ranges share the same bucket_id.
In this illustration, we used segments of 5; however, in reality, we would segment the IP space into segments of 256. This is because the global IP address space is overseen by the Internet Assigned Numbers Authority (IANA), and traditionally, IANA allocates address space in blocks of 256 IPs.
Analyzing the IP ranges in a genuine geolocation table using the Spark approx_percentile function reveals that most records have spans of less than 256, while very few are larger than 256.
This implies that most IP ranges are assigned a bucket_id, while the few larger ones are unrolled, resulting in the unrolled table containing approximately an extra 10% of rows.
A query executed with a genuine geolocation table might resemble the following:
WITH b_geo AS ( SELECT explode( sequence( CAST(start_ip / 256 AS INT), CAST(end_ip / 256 AS INT))) AS bucket_id, * FROM geolocation ), b_events AS ( SELECT CAST(event_ip / 256 AS INT) AS bucket_id, * FROM events )
SELECT * FROM b_events JOIN b_geo ON ( b_events.bucket_id = b_geo.bucket_id AND b_events.event_ip >= b_geo.start_ip AND b_events.event_ip <= b_geo.end_ip );
Conclusion
In conclusion, this article has presented a practical demonstration of converting a non-equi join into an equi-join through the implementation of a mapping technique that involves segmenting IP ranges. It’s crucial to note that this approach extends beyond IP addresses and can be applied to any dataset characterized by bands or ranges.
The ability to effectively map and segment data is a valuable tool in the arsenal of data engineers and analysts, providing a pragmatic solution to the challenges posed by non-equality conditions in Spark SQL joins.
HP Enterprise was infiltrated by a hacking group linked to Russian intelligence last year, the business IT company has revealed in a Securities and Exchange Commission filing. The threat actor is believed to be Midnight Blizzard, also known as Cozy Bear, which was the same group that recently breached the email accounts of several senior executives and other employees at Microsoft. It was also the same hacking group behind the SolarWinds attacks that affected multiple government entities, including the US Treasury Department and Homeland Security. In addition, the National Security Agency accused it in 2020 of trying to steal research on COVID-19 vaccines from the US, UK and Canada.
In its filing, HPE said it was notified on December 12, 2023 that an attacker had gained access to its cloud-based email environment. It worked with external cybersecurity experts that found that the threat actor was able to access and steal data from “a small percentage” of email accounts owned by employees from various divisions, including those in cybersecurity. HPE didn’t say what kind of data was stolen, but it believes the incident is related to an earlier security breach that took place in May 2023, wherein the bad actor was able to get away with “a limited number of SharePoint files.” SharePoint is a document management and collaborative platform for Microsoft 365.
HPE spokesperson Adam R. Bauer told AP that the company can’t say whether this incident is related to Microsoft’s data breach. Bauer also said that the “total scope of mailboxes and emails accessed remains under investigation.” So far, HPE’s investigation has shown that the attack hasn’t had material impact on its operations, but it’s still looking into the incident and working with law enforcement.
This article originally appeared on Engadget at https://www.engadget.com/hp-enterprise-was-hacked-by-the-same-russian-state-sponsored-group-that-targeted-microsoft-060743999.html?src=rss
Selling pressure on Bitcoin had declined, hinting at a trend reversal.
BTC was up by more than 1% in the last 24 hours, and a few indicators looked bullish.
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.