Open-Source Data Observability with Elementary — From Zero to Hero (Part 2)
The guide to take your dbt tests to the next level for free
In the previous part, we have set up Elementary in our dbt repository and hopefully also run it on our production. In this part, we will go more in detail and examine the available tests in Elementary with examples and explain which tests are more suitable for which kind of data scenarios.
Here is the first part if you missed it:
Opensource Data Observability with Elementary – From Zero to Hero (Part 1)
While running the report we saw a “Test Configuration” Tab available only in Elementary Cloud. This is a convenient UI section of the report in the cloud but we can also create test configurations in the OSS version of the Elementary in .yaml files. It is similar to setting up native dbt tests and follows a similar dbt native hierarchy, where smaller and more specific configurations override higher ones.
What are those tests you can set up? Elementary groups them under 3 main categories: Schema tests, Anomaly tests, and Python tests. So let’s go through them and understand how they are working one by one:
Schema Tests :
As the name suggests, schema tests focus on schemas. Depending on the tests you integrate, it is possible to check schema changes or schema changes from baseline, check inside of a JSON column, or monitor your columns for downstream exposures.
- Schema changes: These tests monitor and alert if there are any unexpected changes in the schema like additions or deletions of columns or changes in the data types of the columns.
- Schema changes from baseline: Like schema changes tests, schema changes from baseline tests compare the current schema to a defined baseline schema. For this test to work, a baseline schema needs to be defined and added under the columns. Elementary also provides a macro to create this test automatically, and running it would create the test for all sources, so appropriate arguments should be given to create the tests and pasted into the relevant .yml file. The following code would create a configuration with the fail_on_added argument set to true:
#Generating the configuration
dbt run-operation elementary.generate_schema_baseline_test --args '{"name": "sales_monthly","fail_on_added": true}'
#Output:
models:
- name: sales_monthly
columns:
- name: country
data_type: STRING
- name: customer_key
data_type: INT64
- name: store_id
data_type: INT64
tests:
- elementary.schema_changes_from_baseline:
fail_on_added: true
- Both tests appear similar but are designed for different scenarios. The schema_changes test is ideal when dealing with sources where the schema changes frequently, allowing for early detection of unexpected changes like the addition of a new column. On the other hand, the schema_changes_from_baseline test is better suited for situations where the schema should remain consistent over time, such as in regulatory settings or production databases where changes need to be carefully managed.
- JSON schema (Currently supported only in BigQuery and Snowflake): Checks if the given JSON schema matches a string column that is defined. Like schema_changes_from_baseline , Elementary also provides a run operation for json_schema as well to automatically create the test given a model or source.
#Example usage
dbt run-operation elementary.generate_json_schema_test --args '{"node_name": "customer_dimension", "column_name": "raw_customer_data"}'
- Finally exposure_schema: Elementary powers up the exposures by enabling to detection of changes in the model’s columns that can break the downstream exposures. Following is a sudo example of how we are using it for our BI dashboards, with multiple dependencies:
...
#full exposure definition
- ref('api_request_per_customer')
- ref('api_request_per_client')
owner:
name: Sezin Sezgin
email: [email protected]
meta:
referenced_columns:
- column_name: "customer_id"
data_type: "numeric"
node: ref('api_request_per_customer')
- column_name: "client_id"
data_type: "numeric"
node: ref('api_request_per_client')
Anomaly Detection Tests :
These tests monitor significant changes or deviations on a specific metric by comparing them with the historical values at a defined time frame. An anomaly is simply an outlier value out of the expected range that was calculated during the time frame defined to measure. Elementary uses the Z-score for anomaly detection in data and values with a Z-score of 3 or higher are marked as anomaly. This threshold can also be set to higher in settings with anomaly_score_threshnold . Next, I will try to explain and tell which kind of data they are suited to the best with examples below.
- volume_anomalies: Once you are integrating from a source or creating any tables within your data warehouse, you observe some kind of trends on volume mostly already. These trends can be weekly to daily, and if there are any unexpected anomalies, such as an increase caused by duplication or a really low amount of data inserts that would make freshness tests still successful, can be detected by Elementary’s volume_anomalies tests. How does it calculate any volume anomalies? Most of the anomaly tests work similarly: it splits the data into time buckets and calculates the number of rows per bucket for a training_period Then compares the number of rows per bucket within the detection period to the previous time bucket. These tests are particularly useful for data with already some expected behavior such as to find unusual trading volumes in financial data or sales data analysis as well as for detecting unusual network traffic activity.
models:
- name: login_events
config:
elementary:
timestamp_column: "loaded_at"
tests:
- elementary.volume_anomalies:
where_expression: "event_type in ('event_1', 'event_2') and country_name != 'unwanted country'"
time_bucket:
period: day
count: 1
# optional - use tags to run elementary tests on a dedicated run
tags: ["elementary"]
config:
# optional - change severity
severity: warn
- freshness_anomalies: These tests check for the freshness of your table through a time window. There are also dbt’s own freshness tests, but these two tests serve different purposes. dbt freshness tests are straightforward, check if data is up to date and the goal is validating that the data is fresh within an expected timeframe. Elemantary’s tests focus is detecting anomalies, such as highlighting not-so-visible issues like irregular update patterns or unexpected delays caused by problems in the pipeline. These can be useful, especially when punctuality is important and irregularities might indicate issues.
models:
- name: ger_login_events
config:
elementary:
timestamp_column: "ingested_at"
tags: ["elementary"]
tests:
- elementary.freshness_anomalies:
where_expression: "event_id in ('successfull') and country != 'ger'"
time_bucket:
period: day
count: 1
config:
severity: warn
- elementary.event_freshness_anomalies:
event_timestamp_column: "created_at"
update_timestamp_column: "ingested_at"
config:
severity: warn
- event_freshness_anomalies: Similar to freshness anomalies, event freshness is more granular and focuses on specific events within datasets, but still compliments the freshness_tests. These tests are ideal for real/near-real-time systems where the timeliness of individual events is critical such as sensor data, real-time user actions, or transactions. For example, if the pattern is to log data within seconds, and suddenly they start being logged with minutes of delay, Elementary would detect and alert.
- dimension_anomalies: These are suited best to track the consistency and the distribution of categorical data, for example, if you have a table that tracks events across countries, Elementary can track the distribution of events across these countries and alerts if there is a sudden drop attributed to one of these countries.
- all_columns_anomalies: Best to use when you need to ensure the overall health and consistency within the dataset. This test checks the data type of each column and runs only the relevant tests for them. It is useful after major updates to check if the changes introduced any errors that were missed before or when the dataset is too large and it is impractical to check each column manually.
Besides all of these tests mentioned above, Elementary also enables running Python tests using dbt’s building blocks. It powers up your testing coverage quite a lot, but that part requires its own article.
How are we using the tests mentioned in this article? Besides some of the tests from Elementary, we use Elementary to write metadata for each dbt execution into BigQuery so that it becomes easier available since these are otherwise just output as JSON files by dbt.
Implementing all the tests mentioned in this article is not necessary—I would even say discouraged/not possible. Every data pipeline and its requirements are different. Wrong/excessive alerting may decrease the trust in your pipelines and data by business. Finding the sweet spot with the correct amount of test coverage comes with time.
I hope this article was useful and gave you some insights into how to implement data observability with an open-source tool. Thanks a lot for reading and already a member of Medium, you can follow me here too ! Let me know if you have any questions or suggestions.
References In This Article
- dbt Labs. (n.d.). run-results.json (Version 1.0). Retrieved September 5, 2024, from https://docs.getdbt.com/reference/artifacts/run-results-json
- Elementary Data. (n.d.). Python tests. Retrieved September 5, 2024, from https://docs.elementary-data.com/data-tests/python-tests
- Elementary Data Documentation. (n.d.). Elementary Data Documentation. Retrieved September 5, 2024, from https://docs.elementary-data.com
- dbt Labs. (n.d.). dbt Documentation. Retrieved September 5, 2024, from https://docs.getdbt.com
- Elementary Data. (n.d.). GitHub Repository. Retrieved September 5, 2024, from https://github.com/elementary-data
Open-Source Data Observability with Elementary — From Zero to Hero (Part 2) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Open-Source Data Observability with Elementary — From Zero to Hero (Part 2)