Blog

All about Solana’s short-lived ascent to $89

Abiodun Oladokun

SOL’s price jumped by 10% on the 24 of January after meme coin Dogwifhat got listed on Bitget.
However, it has since reversed some of these gains.

Solana [SOL] traders will remember that t

The post All about Solana’s short-lived ascent to $89 appeared first on AMBCrypto.

Go here to Read this Fast!

All about Solana’s short-lived ascent to $89

Originally appeared here:

All about Solana’s short-lived ascent to $89

January 25, 2024
SEC delays decision on BlackRock’s Ethereum spot ETF launch

Anna Kharton

The Securities and Exchange Commission (SEC) has delayed its decision on BlackRock’s application to launch a spot Ethereum (ETH) ETF to March. According to a document on the SEC website, a new decision on the application, which was supposed to…

Go here to Read this Fast! SEC delays decision on BlackRock’s Ethereum spot ETF launch

Originally appeared here:
SEC delays decision on BlackRock’s Ethereum spot ETF launch

January 25, 2024
Arbitrum and Celestia recovering, Meme Moguls up 42% in presale

Guest Post

Arbitrum (ARB) and Celestia (TIA) remain bullish but trail Meme Moguls (MGLS), a new crypto project conducting its presale

Go here to Read this Fast! Arbitrum and Celestia recovering, Meme Moguls up 42% in presale

Originally appeared here:
Arbitrum and Celestia recovering, Meme Moguls up 42% in presale

January 25, 2024
North Korea set new record by number of crypto hacks in 2023, data shows

Denis Omelchenko

North Korea-linked hackers stole nearly $430 million from decentralized finance and also targeted centralized services, exchanges, and wallet providers in 2023. Chainalysis, a blockchain forensic firm, says North Korea-affiliated hacking groups hit record in terms of attempted attacks against decentralized…

Go here to Read this Fast! North Korea set new record by number of crypto hacks in 2023, data shows

Originally appeared here:
North Korea set new record by number of crypto hacks in 2023, data shows

January 25, 2024
Swiss crypto bank Sygnum bags $41m to expand services, acquisitions

Denis Omelchenko

Switzerland-based crypto bank Sygnum AG has secured over $40 million in a funding round led by asset manager Azimut Holding. The Zurich-headquartered crypto-friendly bank Sygnum AG is set to double down on its acquisition plans and expansion, with the latest…

Go here to Read this Fast! Swiss crypto bank Sygnum bags $41m to expand services, acquisitions

Originally appeared here:
Swiss crypto bank Sygnum bags $41m to expand services, acquisitions

January 25, 2024
Aave and VeChain trailing Pullix, PLX’s potential draws investors

Guest Post

Pullix (PLX) is outpacing established players like Aave (AAVE) and VeChain (VET). The project has raised over $4 million in its presale

Go here to Read this Fast! Aave and VeChain trailing Pullix, PLX’s potential draws investors

Originally appeared here:
Aave and VeChain trailing Pullix, PLX’s potential draws investors

January 25, 2024
Performant IPv4 Range Spark Joins
Jean-Claude Cote
A Practical guide to optimizing non-equi joins in Spark

Photo by John Lee on Unsplash

Enriching network events with IP geolocation information is a crucial task, especially for organizations like the Canadian Centre for Cyber Security, the national CSIRT of Canada. In this article, we will demonstrate how to optimize Spark SQL joins, specifically focusing on scenarios involving non-equality conditions — a common challenge when working with IP geolocation data.

As cybersecurity practitioners, our reliance on enriching network events with IP geolocation databases necessitates efficient strategies for handling non-equi joins. While numerous articles shed light on various join strategies supported by Spark, the practical application of these strategies remains a prevalent concern for professionals in the field.

David Vrba’s insightful article, “About Joins in Spark 3.0”, published on Towards Data Science, serves as a valuable resource. It explains the conditions guiding Spark’s selection of specific join strategies. In his article, David briefly suggests that optimizing non-equi joins involves transforming them into equi-joins.

This write-up aims to provide a practical guide for optimizing the performance of a non-equi JOIN, with a specific focus on joining with IP ranges in a geolocation table.

To exemplify these optimizations, we will revisit the geolocation table introduced in our previous article.
```
+----------+--------+---------+-----------+-----------+
| start_ip | end_ip | country |    city   |   owner   |
+----------+--------+---------+-----------+-----------+
| 1        | 2      | ca      | Toronto   | Telus     |
| 3        | 4      | ca      | Quebec    | Rogers    |
| 5        | 8      | ca      | Vancouver | Bell      |
| 10       | 14     | ca      | Montreal  | Telus     |
| 19       | 22     | ca      | Ottawa    | Rogers    |
| 23       | 29     | ca      | Calgary   | Videotron |
+----------+--------+---------+-----------+-----------+
```
Equi-Join

To illustrate Spark’s execution of an equi-join, we’ll initiate our exploration by considering a hypothetical scenario. Suppose we have a table of events, each event being associated with a specific ownerdenoted by the event_owner column.
```
+------------+--------------+
| event_time | event_owner  |
+------------+--------------+
| 2024-01-01 | Telus        |
| 2024-01-02 | Bell         |
| 2024-01-03 | Rogers       |
| 2024-01-04 | Videotron    |
| 2024-01-05 | Telus        |
| 2024-01-06 | Videotron    |
| 2024-01-07 | Rogers       |
| 2024-01-08 | Bell         |
+------------+--------------+
```
Let’s take a closer look at how Spark handles this equi-join:
```
SELECT
  *
FROM
  events
  JOIN geolocation
  ON (event_owner = owner)
```
In this example, the equi-join is established between the events table and the geolocation table. The linking criterion is based on the equality of the event_owner column in the events table and the owner column in the geolocation table.

As explained by David Vrba in his blog post:

Spark will plan the join with SMJ if there is an equi-condition and the joining keys are sortable

Spark will execute a Sort Merge Join, distributing the rows of the two tables by hashing the event_owner on the left side and the owner on the right side. Rows from both tables that hash to the same Spark partition will be processed by the same Spark task—a unit of work. For example, Task-1 might receive:
```
+----------+-------+---------+-----------+-----------+
| start_ip | end_ip| country |   city    |   owner   |
+----------+-------+---------+-----------+-----------+
| 1        | 2     | ca      | Toronto   | Telus     |
| 10       | 14    | ca      | Montreal  | Telus     |
+----------+-------+---------+-----------+-----------+

+------------+--------------+
| event_time | event_owner  |
+------------+--------------+
| 2024-01-01 | Telus        |
| 2024-01-05 | Telus        |
+------------+--------------+
```
Notice how Task-1 handles only a subset of the data. The join problem is divided into multiple smaller tasks, where only a subset of the rows from both the left and right sides is required. Furthermore, the left and right side rows processed by Task-1 have to match. This is true because every occurrence of “Telus” will hash to the same partition, regardless of whether it comes from the events or geolocation tables. We can be certain that no other Task-X will have rows with an owner of “Telus”.

Once the data is divided as shown above, Spark will sort both sides, hence the name of the join strategy, Sort Merge Join. The merge is performed by taking the first row on the left and testing if it matches the right. Once the rows on the right no longer match, Spark will pull rows from the left. It will keep dequeuing each side until no rows are left on either side.

Non-equi Join

Now that we have a better understanding of how equi-joins are performed, let’s contrast it with a non-equi join. Suppose we have events with an event_ip, and we want to add geolocation information to this table.
```
+------------+----------+
| event_time | event_ip |
+------------+----------+
| 2024-01-01 |    6     |
| 2024-01-02 |   14     |
| 2024-01-03 |   18     |
| 2024-01-04 |   27     |
| 2024-01-05 |    9     |
| 2024-01-06 |   23     |
| 2024-01-07 |   15     |
| 2024-01-08 |    1     |
+------------+----------+
```
To execute this join, we need to determine the IP range within which the event_ip falls. We accomplish this with the following condition:
```
SELECT
  *
FROM
  events
  JOIN geolocation
  ON (event_ip >= start_ip and event_ip <= end_ip)
```
Now, let’s consider how Spark will execute this join. On the right side (the geolocation table), there is no key by which Spark can hash and distribute the rows. It is impossible to divide this problem into smaller tasks that can be distributed across the compute cluster and performed in parallel.

In a situation like this, Spark is forced to employ more resource-intensive join strategies. As stated by David Vrba:

If there is no equi-condition, Spark has to use BroadcastNestedLoopJoin (BNLJ) or cartesian product (CPJ).

Both of these strategies involve brute-forcing the problem; for every row on the left side, Spark will test the “between” condition on every single row of the right side. It has no other choice. If the table on the right is small enough, Spark can optimize by copying the right-side table to every task reading the left side, a scenario known as the BNLJ case. However, if the left side is too large, each task will need to read both the right and left sides of the table, referred to as the CPJ case. In either case, both strategies are highly costly.

So, how can we improve this situation? The trick is to introduce an equality in the join condition. For example, we could simply unroll all the IP ranges in the geolocation table, producing a row for every IP found in the IP ranges.

This is easily achievable in Spark; we can execute the following SQL to unroll all the IP ranges:
```
SELECT
  country,
  city,
  owner,
  explode(sequence(start_ip, end_ip)) AS ip
FROM
  geolocation
```
The sequence function creates an array with the IP values from start_ip to end_ip. The explode function unrolls this array into individual rows.
```
+---------+---------+---------+-----------+
| country |   city    |  owner    |   ip   |
+---------+---------+---------+-----------+
|   ca    | Toronto   |  Telus    |   1    |
|   ca    | Toronto   |  Telus    |   2    |
|   ca    | Quebec    |  Rogers   |   3    |
|   ca    | Quebec    |  Rogers   |   4    |
|   ca    | Vancouver |  Bell     |   5    |
|   ca    | Vancouver |  Bell     |   6    |
|   ca    | Vancouver |  Bell     |   7    |
|   ca    | Vancouver |  Bell     |   8    |
|   ca    | Montreal  |  Telus    |  10    |
|   ca    | Montreal  |  Telus    |  11    |
|   ca    | Montreal  |  Telus    |  12    |
|   ca    | Montreal  |  Telus    |  13    |
|   ca    | Montreal  |  Telus    |  14    |
|   ca    | Ottawa    |  Rogers   |  19    |
|   ca    | Ottawa    |  Rogers   |  20    |
|   ca    | Ottawa    |  Rogers   |  21    |
|   ca    | Ottawa    |  Rogers   |  22    |
|   ca    | Calgary   | Videotron |  23    |
|   ca    | Calgary   | Videotron |  24    |
|   ca    | Calgary   | Videotron |  25    |
|   ca    | Calgary   | Videotron |  26    |
|   ca    | Calgary   | Videotron |  27    |
|   ca    | Calgary   | Videotron |  28    |
|   ca    | Calgary   | Videotron |  29    |
+---------+---------+---------+-----------+
```
With a key on both sides, we can now execute an equi-join, and Spark can efficiently distribute the problem, resulting in optimal performance. However, in practice, this scenario is not realistic, as a genuine geolocation table often contains billions of rows.

To address this, we can enhance the efficiency by increasing the coarseness of this mapping. Instead of mapping IP ranges to each individual IP, we can map the IP ranges to segments within the IP space. Let’s assume we divide the IP space into segments of 5. The segmented space would look something like this:
```
+---------------+-------------+-----------+
| segment_start | segment_end | bucket_id |
+---------------+-------------+-----------+
|       1       |      5      |     0     |
|       6       |     10      |     1     |
|      11       |     15      |     2     |
|      16       |     20      |     3     |
|      21       |     25      |     4     |
|      26       |     30      |     5     |
+---------------+-------------+-----------+
```
Now, our objective is to map the IP ranges to the segments they overlap with. Similar to what we did earlier, we can unroll the IP ranges, but this time, we’ll do it in segments of 5.
```
SELECT
  country,
  city,
  owner,
  explode(sequence(start_ip / 5, end_ip / 5)) AS bucket_id
FROM
  geolocations
```
We observe that certain IP ranges share a bucket_id. Ranges 1–2 and 3–4 both fall within the segment 1–5.
```
+----------+--------+---------+-----------+-----------+-----------+
| start_ip | end_ip | country |    city   |   owner   | bucket_id |
+----------+--------+---------+-----------+-----------+-----------+
|    1     |    2   |   ca    | Toronto   |   Telus   |     0     |
|    3     |    4   |   ca    | Quebec    |  Rogers   |     0     |
|    5     |    8   |   ca    | Vancouver |   Bell    |     1     |
|   10     |   14   |   ca    | Montreal  |   Telus   |     2     |
|   19     |   22   |   ca    |  Ottawa   |  Rogers   |     3     |
|   19     |   22   |   ca    |  Ottawa   |  Rogers   |     4     |
|   23     |   29   |   ca    | Calgary   | Videotron |     4     |
|   23     |   29   |   ca    | Calgary   | Videotron |     5     |
+----------+--------+---------+-----------+-----------+-----------+
```
Additionally, we notice that some IP ranges are duplicated. The last two rows for the IP range 23–29 overlap with segments 20–25 and 26–30. Similar to the scenario where we unrolled individual IPs, we are still duplicating rows, but to a much lesser extent.

Now, we can utilize this bucketed table to perform our join.
```
SELECT
  *
FROM
  events
  JOIN geolocation
  ON (
    event_ip / 5 = bucket_id
    AND event_ip >= start_ip
    AND event_ip <= end_ip
  )
```
The equality in the join enables Spark to perform a Sort Merge Join (SMJ) strategy. The “between” condition eliminates cases where IP ranges share the same bucket_id.

In this illustration, we used segments of 5; however, in reality, we would segment the IP space into segments of 256. This is because the global IP address space is overseen by the Internet Assigned Numbers Authority (IANA), and traditionally, IANA allocates address space in blocks of 256 IPs.

Analyzing the IP ranges in a genuine geolocation table using the Spark approx_percentile function reveals that most records have spans of less than 256, while very few are larger than 256.
```
SELECT 
    approx_percentile(
        end_ip - start_ip, 
        array(0.800, 0.900, 0.950, 0.990, 0.999, 0.9999),
        10000)
FROM
    geolocation
```
This implies that most IP ranges are assigned a bucket_id, while the few larger ones are unrolled, resulting in the unrolled table containing approximately an extra 10% of rows.

A query executed with a genuine geolocation table might resemble the following:
```
WITH
  b_geo AS (
    SELECT
      explode(
        sequence(
          CAST(start_ip / 256 AS INT), 
          CAST(end_ip / 256 AS INT))) AS bucket_id,
      *
    FROM
      geolocation
  ),
  b_events AS (
    SELECT
      CAST(event_ip / 256 AS INT) AS bucket_id,
      *
    FROM
      events
  )

SELECT
  *
FROM
  b_events
  JOIN b_geo
  ON (
    b_events.bucket_id = b_geo.bucket_id
    AND b_events.event_ip >= b_geo.start_ip
    AND b_events.event_ip <= b_geo.end_ip
  );
```
Conclusion

In conclusion, this article has presented a practical demonstration of converting a non-equi join into an equi-join through the implementation of a mapping technique that involves segmenting IP ranges. It’s crucial to note that this approach extends beyond IP addresses and can be applied to any dataset characterized by bands or ranges.

The ability to effectively map and segment data is a valuable tool in the arsenal of data engineers and analysts, providing a pragmatic solution to the challenges posed by non-equality conditions in Spark SQL joins.

Performant IPv4 Range Spark Joins was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Performant IPv4 Range Spark Joins

Go Here to Read this Fast! Performant IPv4 Range Spark Joins
January 25, 2024
NASA readies Starliner spacecraft for first crewed flight to ISS

Trevor Mogg

NASA has offered an update on its preparation work for the Starliner spacecraft’s first crewed flight to the International Space Station.

Go Here to Read this Fast! NASA readies Starliner spacecraft for first crewed flight to ISS

Originally appeared here:
NASA readies Starliner spacecraft for first crewed flight to ISS

January 25, 2024
HP Enterprise was hacked by the same Russian state-sponsored group that targeted Microsoft

Mariella Moon

HP Enterprise was infiltrated by a hacking group linked to Russian intelligence last year, the business IT company has revealed in a Securities and Exchange Commission filing. The threat actor is believed to be Midnight Blizzard, also known as Cozy Bear, which was the same group that recently breached the email accounts of several senior executives and other employees at Microsoft. It was also the same hacking group behind the SolarWinds attacks that affected multiple government entities, including the US Treasury Department and Homeland Security. In addition, the National Security Agency accused it in 2020 of trying to steal research on COVID-19 vaccines from the US, UK and Canada.

In its filing, HPE said it was notified on December 12, 2023 that an attacker had gained access to its cloud-based email environment. It worked with external cybersecurity experts that found that the threat actor was able to access and steal data from “a small percentage” of email accounts owned by employees from various divisions, including those in cybersecurity. HPE didn’t say what kind of data was stolen, but it believes the incident is related to an earlier security breach that took place in May 2023, wherein the bad actor was able to get away with “a limited number of SharePoint files.” SharePoint is a document management and collaborative platform for Microsoft 365.

HPE spokesperson Adam R. Bauer told AP that the company can’t say whether this incident is related to Microsoft’s data breach. Bauer also said that the “total scope of mailboxes and emails accessed remains under investigation.” So far, HPE’s investigation has shown that the attack hasn’t had material impact on its operations, but it’s still looking into the incident and working with law enforcement.

This article originally appeared on Engadget at https://www.engadget.com/hp-enterprise-was-hacked-by-the-same-russian-state-sponsored-group-that-targeted-microsoft-060743999.html?src=rss

Go Here to Read this Fast! HP Enterprise was hacked by the same Russian state-sponsored group that targeted Microsoft

Originally appeared here:
HP Enterprise was hacked by the same Russian state-sponsored group that targeted Microsoft

January 25, 2024
$38K or $40K: Which way will Bitcoin go?

Dipayan Mitra

Selling pressure on Bitcoin had declined, hinting at a trend reversal.
BTC was up by more than 1% in the last 24 hours, and a few indicators looked bullish.

Last week was not in the best i

The post $38K or $40K: Which way will Bitcoin go? appeared first on AMBCrypto.

Go here to Read this Fast! $38K or $40K: Which way will Bitcoin go?

Originally appeared here:
$38K or $40K: Which way will Bitcoin go?

January 25, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Blog

A Practical guide to optimizing non-equi joins in Spark

Equi-Join

Non-equi Join

Conclusion