Use density-based clustering and survival analysis to estimate when earthquakes occur
Introduction
Whilst we have a fairly good understanding of where and why earthquakes occur, understanding when an earthquake will happen is very challenging. In this article we will use historical earthquake data and a combination of clustering and survival analysis to answer the following questions:
“How are earthquake events spatially distributed?”
“If an earthquake happens, what’s the probability of another earthquake happening within the next hour?”
“Are there regional differences in the time between earthquake events?”
Getting the data
We will be using data from the USGS[1] that records earthquake events (earthquake data resides in the Public Domain and is provided courtesy of the U.S. Geological Survey). They have a nice API that means we can get earthquake data in a straightforward way in Python:
import http.client
import pandas as pd
import json
url = '/fdsnws/event/1/query'
query_params = {
'format': 'geojson',
'starttime': "2020-01-01",
'limit': '10000',
'minmagnitude': 3,
'maxlatitude': '47.009499',
'minlatitude': '32.5295236',
'maxlongitude': '-114.1307816',
'minlongitude': '-124.482003',
}
full_url = f'https://earthquake.usgs.gov{url}?{"&".join(f"{key}={value}" for key, value in query_params.items())}'
print('defined params...')
conn = http.client.HTTPSConnection('earthquake.usgs.gov')
conn.request('GET', full_url)
response = conn.getresponse()
In our API request we have a few parameters:
- limit — The maximum number of earthquake events we want
- starttime — When should the earliest earthquake event be
- minmagnitude — The minimum magnitude of an earthquake event
- maxlatitude — Maximum latitude
- minlatitude — Minimum latitude
- maxlongitude — Maximum longitude
- minlongitude — Minimum longitude
We will set the minimum magnitude to 3, as this is typically what can be felt at the surface, and provide latitude and longitude coordinates to make a bounding box around the US state of California. California is home to the famous San Andreas Fault, so will have plenty of earthquake events for us to analyse.
To handle the GeoJSON response we use the following code:
if response.status == 200:
print('Got a response.')
data = response.read()
print('made the GET request...')
data = data.decode('utf-8')
json_data = json.loads(data)
features = json_data['features']
df = pd.json_normalize(features)
if df.empty:
print('No earthquakes recorded.')
else:
df[['Longitude', 'Latitude', 'Depth']] = df['geometry.coordinates'].apply(lambda x: pd.Series(x))
df['datetime'] = df['properties.time'].apply(lambda x : datetime.datetime.fromtimestamp(x / 1000))
df['datetime'] = df['datetime'].astype(str)
df.sort_values(by=['datetime'], inplace=True)
else:
print(f"Error: {response.status}")
This will return a Pandas DataFrame where each row is an earthquake event, with columns describing event properties (i.e., latitude, longitude, magnitude, id etc.).
For the geospatially-minded out there, you’ll recognise that the API call coordinates represent a bounding box of earthquakes (i.e., finds earthquakes in a boxed area), however we only care about earthquakes in California. Therefore we will need to filter out all the earthquakes that did not occur in California (e.g., earthquake events in Nevada & Oregon). To do this we will use the convenient OSMNX package to get a MultiPolygon of the California border:
import osmnx
import geopandas as gpd
place = "California, USA"
gdf = osmnx.geocode_to_gdf(place)
# Get the target geometry
gdf = gdf[["geometry", "bbox_north", "bbox_south", "bbox_east", "bbox_west"]]
Next we’ll convert our earthquake coordinates into Point geometries using Shapely, and then perform a spatial join to filter our non-Californian earthquakes:
from shapely.geometry import Point
# Convert to a GeoDataFrame with Point geometry
geometry = [Point(xy) for xy in zip(df['Longitude'], df['Latitude'])]
earthquake_gdf = gpd.GeoDataFrame(df, geometry=geometry, crs='EPSG:4326')
# Filter to keep only points within the California bounding box
points_within_california = gpd.sjoin(earthquake_gdf, gdf, how='inner', predicate='within')
# Extract latitude, longitude etc.
df = points_within_california[['id', 'Latitude', 'Longitude', 'datetime', 'properties.mag']]
Now to see our first map of Californian earthquakes colour-coded by magnitude:
Excellent, we now have a map of California earthquakes!
Spatial Clustering of earthquakes
Looking at the earthquakes plotted on the map earlier you can see that events are spatially arranged in a linear SE-NW orientation, where you can see around 2–3 clear linear-shaped groupings of earthquakes. This makes sense because:
- Earthquakes occur due to movements along faults, which are linear features (i.e., cracks in the Earth’s crust).
- The orientation of earthquake events matches the orientation of the San Andreas Fault zone.
Our goal is now to cluster these earthquake events based on their location, so that we can produce spatial clusters associated with regional faults.
For clustering we will use HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) which is a density-based clustering algorithm useful for certain types of datasets, such as those with irregularly shaped clusters and varying cluster densities. This is relevant to our dataset since earthquake events can occur in irregular shaped clusters (i.e., along faults of different orientations) and in different densities (i.e, some fault zones are more active than others). HDBSCAN is also robust to outliers, which is this case can be earthquakes that occur far away from fault zones.
Through some experimentation the following parameters produce okay results (a bit of trial and error involved, but guided by our prior assumptions on how earthquakes should be clustered):
# Fit HDBSCAN
clusterer = HDBSCAN(min_cluster_size=200, metric='haversine', min_samples=20, cluster_selection_epsilon=0.05)
result_df['cluster_label_hdbscan'] = clusterer.fit_predict(data_scaled)
# Find the number of unique cluster labels
num_clusters = result_df['cluster_label_hdbscan'].nunique()
# Create a list of colors
colors = ['lightgray', 'red', 'blue', 'green'][:num_clusters]
# Create a Folium map centered at the mean coordinates
map_center = [result_df['Latitude'].mean(), result_df['Longitude'].mean()]
mymap = folium.Map(location=map_center, zoom_start=6)
# Add markers for each data point with cluster color
for _, row in result_df.iterrows():
cluster_color = colors[row['cluster_label_hdbscan'] + 1] # Map cluster label to color
folium.CircleMarker(
location=[row['Latitude'], row['Longitude']],
radius=2,
color=cluster_color,
fill=True,
fill_color=cluster_color,
fill_opacity=0.7,
popup=f'Cluster: {row["cluster_label_hdbscan"]}'
).add_to(mymap)
mymap
Super, we have now used HDBSCAN to cluster our earthquake events into three areas, which corresponds with our prior regional knowledge. Note that some earthquakes are considered outliers and coloured grey.
Examining the time between earthquakes
Since we have the date and time of each earthquake we can calculate the time between each event (i.e., the time duration between two earthquakes), which is required for our Survival Analysis in the next section. We have already sorted our values by the ‘datetime’ column so they’re in chronological order, now we need to calculate the time difference:
from datetime import datetime
# Convert 'time' column to datetime objects
df['time'] = pd.to_datetime(df['datetime'])
# Sort dataframe by time
df.sort_values(by=['time'], inplace=True)
# Calculate time elapsed between consecutive events
df['time_elapsed'] = df['time'].diff().shift(-1)
Since there will be no time difference for the last earthquake (i.e., the most recent) we can calculate the difference from a time we’ll assume as the present day:
# Assuming present time is '2024-01-08 16:06:00.000'
present_time = pd.to_datetime('2024-01-08 16:06:00.000')
# For the last event of each group, replace NaN with time between event and present time
df['time_elapsed'].fillna(present_time - df['time'], inplace=True)
Finally we will add a new column indicating whether an earthquake was followed by another earthquake. Whilst this seems slightly redundant, given that only the last earthquake is affected by this, it highlights an interesting point about using survival analysis for these types of tasks (which we’ll cover later):
# Label events based on whether they happened or not
df['event_happened'] = df['time_elapsed'].apply(lambda x: 1 if pd.Timedelta(days=0) > 0 else 0)
Here is a histogram of the elapsed time between earthquake events:
We can see that the majority of earthquakes occur in short succession of each other (approximately 2.5 hours of each other), with far fewer earthquakes occurring at longer time intervals apart (i.e., the maximum time between earthquakes in this dataset is 350 hours).
Survival Analysis of Earthquake Time Intervals
We have our earthquake clusters, as well as the time between earthquakes, so now we use Survival Analysis to tell us about the probability of earthquake occurrences. Our goals will therefore be to:
- Use survival analysis to create curves that show the probability of an earthquake occurring in time, given that an earthquake has just occurred.
- Compare our probability curves across our already clustered earthquake regions, to see if there are any similarities/differences in regional earthquake activity.
What is survival analysis?
There are plenty of well-written articles [2,3,4] covering this, but in short, survival analysis is a statistical technique used to analyse time-to-event data, typically in the context of biomedical or observational studies. It focuses on estimating and modelling the time until an event of interest occurs, such as death, failure, or another specific outcome, providing insights into the probability of events happening over time and the impact of covariates on the event occurrence.
In our case, an earthquake has happened and we want to know the probability of another earthquake occurring over time. Note: we make the assumption that earthquake events are independent, as well as only retrieving events with a minimum magnitude of 3, which is a limitation of this work since it oversimplifies earthquake dynamics.
The Kaplan-Meier estimator is a non-parametric method used in survival analysis to estimate the survival function, which will give us our probabilities. We will use the package ‘lifelines’ to fit the Kaplan-Meier estimator to the earthquakes in each cluster, to produce a probability curve:
import plotly.graph_objects as go
from lifelines import KaplanMeierFitter
# Initialize Kaplan-Meier Fitter
kmf = KaplanMeierFitter()
# Create a Plotly Figure
fig = go.Figure()
color_map = {0: 'red', 1: 'blue', 2: 'green', -1: 'gray'}
# Fit and plot survival curves for each cluster excluding cluster -1
for label, group in result_df.groupby('cluster_label_hdbscan'):
if label != -1:
durations = group['time_elapsed'].dt.total_seconds() / 3600 # Convert hours to days
event_observed = group['event_happened'].values
kmf.fit(durations=durations, event_observed=a)
# Plot survival function
fig.add_trace(go.Scatter(
x=kmf.survival_function_.index,
y=1-kmf.survival_function_.KM_estimate,
mode='lines',
name=f'Cluster {label}',
line=dict(color=color_map[label])
))
# Customize the plot
fig.update_layout(
xaxis_title='Elapsed Time (hours)',
yaxis_title='Probability of experiencing an earthquake',
legend_title='Cluster Labels',
width=800, # Adjust width
height=500, # Adjust height
margin=dict(l=50, r=20, t=50, b=50), # Adjust margins
legend=dict(x=0.8, y=0.99), # Adjust legend position
)
# Show the plot
fig.show()
Before seeing the results, notice the ‘event_observed’ parameter that uses the ‘event_happened’ column we created earlier. Our final earthquake is an example of a right-censored data point, since there was no following earthquake, and is therefore treated as a partial observation.
Now to view the curves:
Each cluster of earthquakes has its own coloured curve, where the X axis represents the time elapsed after an earthquake, and the Y axis represents the probability of an earthquake occurring. Some interesting observations:
- The shape of the curves are similar.
- The initial steepness of the curves varies by cluster location.
Lets focus on the initial time after an earthquake has occurred, since most of our data shows that earthquakes happen in relatively quick succession:
This suggests that the time between earthquakes is not the same regionally. Looking at the graph we can say that once an earthquake event has happened, the probability of another earthquake happening within ten hours is approximately 35% for cluster 0, 60% for cluster 1, and 48% for cluster 2. This shows that earthquakes in the region that belong to cluster 1 happen in quick success relative to the other two areas.
We can double check this by looking at a histogram of elapsed time between earthquakes for each cluster location:
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np
# Filter data
filtered_df = result_df[result_df['cluster_label_hdbscan'] != -1]
# Create subplots in a 2 by 2 grid
fig = make_subplots(rows=2, cols=2, subplot_titles=[f'Cluster {label}' for label in filtered_df['cluster_label_hdbscan'].unique()],
shared_xaxes='all', shared_yaxes='all')
# Create histograms for elapsed time for each cluster
for i, (label, group) in enumerate(filtered_df.groupby('cluster_label_hdbscan')):
# Calculate histogram
hist_data, bin_edges = np.histogram(group['time_elapsed'].dt.total_seconds() / 3600, bins=20)
# Add histogram trace to subplot
fig.add_trace(go.Bar(
x=bin_edges,
y=hist_data,
opacity=0.5,
name=f'Cluster {label}',
marker=dict(color=color_map[label])
), row=(i // 2) + 1, col=(i % 2) + 1)
# Customise subplot
fig.update_xaxes(title_text='Elapsed Time (hours)', row=(i // 2) + 1, col=(i % 2) + 1)
fig.update_yaxes(title_text='Frequency', row=(i // 2) + 1, col=(i % 2) + 1)
# Update layout
fig.update_layout(
showlegend=False,
margin=dict(l=50, r=50, t=50, b=50), # Adjust margins
height=600,
width=800,
)
# Show the plot
fig.show()
As shown by the probability curves, cluster 1 has longer intervals between earthquake events relative to clusters 2 and 3.
Conclusion
This work utilised spatial clustering and survival analysis to unravel temporal and geographic patterns within earthquake data. By employing techniques such as HDBSCAN for spatial clustering and the Kaplan-Meier estimator for survival analysis, we gained valuable insights into the regional variations in time between earthquakes, which was translated into probability curves that can be useful for risk assessment and preparedness in earthquake-prone regions.
Thanks for reading!
Disclaimer: This article should only be used for learning purposes.
Spotting Spatiotemporal Patterns in Earthquake Data was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Spotting Spatiotemporal Patterns in Earthquake Data
Go Here to Read this Fast! Spotting Spatiotemporal Patterns in Earthquake Data