Grayscale CEO Michael Sonnenshein urged regulators to approve spot Bitcoin exchange-traded fund (ETF) options in a statement on Feb. 5. Sonnenshein wrote in a post on X: “… I think it’s never been more important … to advocate for the development of a robust listed options market for spot Bitcoin ETFs. Although GBTC has been […]
Autonomous delivery startup Starship Technologies announced today it had raised $90mn on its mission to “make science fiction a reality” and roll out its self-driving robots across streets in Europe and the US. There are sci-fi scenarios where robots turn on their makers, hunt them down, and initiate either the enslavement or the extinction of humanity. Then there are the R2-D2s and the BB8s of the genre — resourceful droid helpers that somehow manage to tug on our heartstrings nearly as much as a Pixar puppy. While perhaps not quite as empathy-inducing, Starship Technologies’ delivery robots are decidedly more reminiscent…
Apple Vision Pro users don’t have any recourse for forgetting a passcode beyond visiting an Apple Store or shipping their headset off to be reset.
Optic ID is used for most things, but a passcode is still required
Apple prides itself in providing secure devices locked down by biometrics and passcodes. However, human error gets in the way of any security protocol, no matter how many safeguards may exist.
According to a report from Bloomberg, Apple Support lines are receiving “a lot” of complaints about forgotten passcodes. An anonymous source says support staff have been told to advise users that an in-store or mail-in fix is all that is possible.
There is life beyond music streaming, and installing a Winamp-like app on your favorite distro is the next best thing to spinning vinyl. Here’s what to know.
The U.S. Securities and Exchange Commission (SEC)’s request to have Ripple produce certain information has been granted by a judge. Though the SEC’s case against Ripple has otherwise concluded, the agency requested financial statements, contracts, and a response to an interrogatory on Jan. 11 in order to determine penalties and remedies. U.S. Magistrate Judge Sarah […]
Navigating Slowly Changing Dimensions (SCD) and Data Restatement: A Comprehensive Guide
Strategies for efficiently managing dimension changes and data restatement in enterprise data warehousing
Imagine this, you are a data engineer working for a large retail company that utilizes the incremental load technique in data warehousing. This technique involves selectively updating or loading only the new or modified data since the last update. What could occur when the product R&D department decides to change the name or description of a current product? How would such updates impact your existing data pipeline and data warehouse? How do you plan to address challenges like these? This article provides a comprehensive guide with solutions, utilizing Slowly Changing Dimensions (SCD), to tackle potential issues during data restatement.
Slowly changing dimensions refer to infrequent changes in dimension values, which occur sporadically and are not tied to a daily or regular time-based schedule, as dimensions typically change less frequently than transaction entries in a system. For example, a jewelry company that has its customers placing a new order on their website will become a new row in the order fact table. On the other hand, the jewelry company rarely changes their product name and their product description but that doesn’t mean it will never happen in the future.
Managing changes in these dimensions requires employing Slowly Changing Dimension (SCD) management techniques, which are categorized into defined SCD types, ranging from Type 0 through Type 6, including some combination or hybrid types. We can employ one of the following methods:
SCD Type 0: Ignore
Changes to dimension values are completely disregarded, and the values of dimensions remain unchanged from the time they were initially created in the data warehouse.
SCD Type 1: Overwrite/ Replace
This approach is applicable when the previous value of the dimension attribute is no longer relevant or important. However, historical tracking of changes is not necessary.
SCD Type 2: Create a New Dimension Row
This approach is recommended as the primary technique for addressing changing dimension values, involving the creation of a second row for the dimension with a start date, end date, and potentially a “current/expired” flag. It is suitable for our scenarios like product description or address changes, ensuring a clear partitioning of history. The new dimension row is linked to newly inserted fact rows, with each dimension record linked to a subset of fact rows based on insertion times — those before the change linked to the old dimension row, and those after linked to the new dimension row.
Figure 1 (Image by the author): PRODUCT_KEY = “cd3004” is the restatement for PRODUCT_KEY = “cd3002”
SCD Type 3: Create a “PREV” Column
This method is suitable when both the old and new values are relevant, and users may want to conduct historical analysis using either value. However, it is not practical to apply this technique to all dimension attributes, as it would involve providing two columns for each attribute in dimension tables or more if multiple “PREV” values need preservation. It should be selectively used where appropriate.
Figure 2 (Image by the author): PRODUCT_KEY = “cd3002” is restated with new PRODUCT_NAME, the old PRODUCT_NAME is stored in NAME_PREV column
SCD Type 4: Rapidly Changing Large Dimensions
What if in a scenario you need to capture every change to every dimension attribute for a very large dimension of retail, say a million plus customers of your huge jewelry company? Using type 2 above will very quickly explode the number of rows in the customer dimension table to tens or even hundreds of millions of rows and using type 3 is not viable.
A more effective solution for rapidly changing and large volume dimension tables is to categorize attributes (e.g.,customer age category, gender, purchasing power, birthday, etc.) and separate them into a secondary dimension, like a customer profile dimension. This table, acting as a “full coverage” dimension table all potential values for every category of dimension attributes preloaded into the table, which can better manage the granularity of changes while avoiding excessive row expansion in the main customer dimension.
For example, if we have 8 age categories, 3 different genders, 6 purchasing power categories, and 366 possible birthdays. Our “full coverage” dimension table for customer profiles that contains all the above combinations will be 8 x 3 x 6 x 366 combinations or 52704 rows.
We’ll need to generate surrogate_key for this dimension table and establish a connection to a new foreign key in the fact table. When a modification occurs in one of these dimension categories, there’s no necessity to add another row to the customer dimension. Instead, we generate a new fact row and associate it with both the customer dimension and the new customer profile dimension.
Figure 3 (Image by the author): Entity relationship diagram for a “Full Coverage Dimension” table
SCD Type 5: An Extension to Type 4
To enhance the Type 4 approach mentioned earlier, we can establish a connection between the customer dimension and the customer profile dimension. This linkage enables the tracking of the “current” customer profile for a specific customer. The key facilitates the connection of the customer with the latest customer profile, which allows seamless traversal from the customer dimension to the most recent customer profile dimension without the need to link through the fact table.
Figure 4 (Image by the author): Entity relationship diagram shows the linkage between the customer_dim to the cust_profile_dimension
SCD Type 6: A Hybrid Technique
With this approach, you integrate both Type 2 (new row) and Type 3 (“PREV” column). This blended approach offers the advantages of both methodologies. You can retrieve facts using the “ PREV “ column, which provides historical values and presents facts associated with the product category at that specific time. Simultaneously, querying by the “new” column provides all facts for both the current and all preceding values of the product category.
Figure 5 (Image by the author): PRODUCT_ID = “cd3004” is the restatement for PRODUCT_ID = “cd3002”, which PRODUCT_ID = “cd3001” is marked as “EXPIRED” in LAST_ACTION column
Bonus and Conclusion
Normally, data extraction comes in STAR schema, which includes one fact table and multiple dimension tables in an enterprise. While the dimension tables store all the descriptive data and primary keys, the fact table contains numeric and additive data that references the primary keys of each dimension around it.
Figure 6 (Image by the author): Illustration of Star Schema
However, if your marketing sales data extract is provided as a single denormalized table without distinct dimension tables and lacks the primary key for its descriptive data, future updates to product names may pose challenges. Handling such scenarios in your existing pipeline can be more complicated.
The absence of primary keys in the descriptive data can lead to issues during data restatement, especially when you are dealing with large datasets. For instance, if a product name is updated in the restatement extract without a unique product_key, the incremental load pipeline may treat it as a new product, impacting the historical data in your consumption layer. To address this, creating surrogate_key for the product dimension and a mapping table to link original and restated product names is necessary for maintaining data integrity.
In conclusion, every aspect of data warehouse design should be carefully considered, taking into account potential edge cases.
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.