Tag: AI

Need for Speed: cuDF Pandas vs. Pandas

Thomas Reid

A comparative overview

Continue reading on Towards Data Science »

Originally appeared here:
Need for Speed: cuDF Pandas vs. Pandas

Go Here to Read this Fast! Need for Speed: cuDF Pandas vs. Pandas

April 5, 2024
Improving Content Moderation with Amazon Rekognition Bulk Analysis and Custom Moderation

Mehdy Haghy

Amazon Rekognition makes it easy to add image and video analysis to your applications. It’s based on the same proven, highly scalable, deep learning technology developed by Amazon’s computer vision scientists to analyze billions of images and videos daily. It requires no machine learning (ML) expertise to use and we’re continually adding new computer vision […]

Originally appeared here:
Improving Content Moderation with Amazon Rekognition Bulk Analysis and Custom Moderation

Go Here to Read this Fast! Improving Content Moderation with Amazon Rekognition Bulk Analysis and Custom Moderation

April 5, 2024
Beyond the Blind Zone
Fraser King
Inpainting radar gaps with deep learning

Overview

In this post, we review high-level details from our recent work on image inpainting of radar blind zones. We discuss the main science problems, inpainting techniques, model architecture decisions, fidelity metrics, uncertainties, and finish with an analysis of model explainability (XAI), in the hope that this information can help others when planning future, similar projects. This work was recently published in the American Meteorologic Society’s Artificial Intelligence for Earth Sciences (AIES) https://doi.org/10.1175/AIES-D-23-0063.1, which we recommend readers view for additional project details.

Motivation

Radar observations are powerful sources of information for tasks like precipitation forecasting. These systems are used by millions of people daily to help plan their lives, and their predictive accuracy has large economic impacts to agriculture, tourism, and the outdoor recreation industry. But how do these systems work? In a nutshell, these predictive models relate precipitation rates to power backscatter measurements from a radar signal interacting with falling hydrometeors in the atmosphere (Fig. 1). With a large enough reference dataset, we can take the information produced by the radar profile (along with some atmospheric state variables) and back-out an estimate of surface precipitation.

Figure 1: a) Reflectivity profile showing the power backscattered by the cloud; b) Vertical profile of the snowfall rate estimates in the cloud; and c) surface snowfall rates extrapolated down from the lowest precipitating cloud layer shown in b). Image retrieved from King et al., 2020 (https://doi.org/10.1029/2019EA000776).

Unlike vertical-pointing surface radar instruments which are stationary, satellites are unrestricted in space and can provide a much richer, comprehensive view of global precipitation patterns due to their orbit. However, unlike surface radars, satellite instruments exhibit a unique type of measurement problem since they point down towards the Earth: a radar blind zone. As implied by its name, the blind zone is a portion of the radar profile that the satellite cannot directly observe. As the downward pointed radar signal reaches the Earth’s surface, the backscatter from the ground produces an attenuated signal that becomes saturated with noise (and therefore unobserved). A comparison between surface and spaceborne radars (and the corresponding blind zone) is illustrated in Fig. 2.

Figure 2: Multi-panel of coincident vertical reflectivity profiles from a surface radar, CloudSat and the Global Precipitation Measurement mission, along with their respective radar blind zones. Image retrieved from Kidd et al., 2021 (https://doi.org/10.3390/rs13091708).

While the size of the blind zone region can vary, it is a common issue on active spaceborne systems (e.g. GPM, CloudSat) that will persist as a source of uncertainty in future Earth observing missions (e.g. EarthCARE and AOS). Additionally, although this region is just a small subset of the full profile (only about 4% of CloudSat’s vertical extent, for instance), this 0–2 km of the atmosphere can contain large quantities of precipitating cloud (Lamer et al., 2020). Therefore, by masking this region, we are potentially missing a large quantity of snow (or overestimating snow in the case of virga), further increasing the uncertainty of an already uncertain estimate of surface snow accumulation.

… the blind zone leads to reflectivity being underestimated by up to 1 dB, the number of events being altered by +/- 5% and the precipitation amount being underestimated by 9 to 11 percentage points (Maahn et al., 2014).

What can we do about this?

Our Solution

At its core, the ideas behind image inpainting have been around for decades for use in things like image restoration (e.g. removing a scratch on a family photo), or object removal (removing a tourist in a picture from your last vacation) (Guillemot and Le Meur, 2013). Originally, this type of repair was a costly endeavour that was done by hand by a trained artist (Fig. 2), but as has become increasingly clear in recent years, computers are quite skilled at this task too (with much shorter training times compared to a human)!

Figure 3: Painting restoration by hand by a trained professional. Image courtesy of Ana Alba.

While the first iterations of these techniques like structure-based linear interpolation, texture-synthesis-based Efros interpolation, and diffusion-based Laplace or Navier-Stokes interpolation can work well in certain contexts, they often fell short when dealing with large gaps. The differences between these techniques are shown in Fig. 4 where an elephant mask is inpainted in the center of the image. These techniques rely heavily on information/patterns from the edges of the targeted inpainting region and are often unable to effectively make use information from the scene’s global context when making intelligent predictions.

Figure 4: Object removal application a) mask and inpainting results with methods from different categories; b) anisotropic diffusion, c) exemplar-base; d) patch sparse representation, e) hybrid with one global energy minimization; and f) patch offsets. Image retrieved from Guillemot and Le Meur, 2013 (https://doi.org/10.1109/MSP.2013.2273004).

However, the field of computer vision has flourished in recent decades, primarily driven by advances in computing power and the development of newer, more efficient machine learning techniques. Generative approaches using Generative Adversarial Networks (GANs) for instance have been very popular recently with OpenAI’s DALL-E, or Stability.ai’s Stable Diffusion, producing incredibly realistic images based on a simple text prompt. There has also been some work previously attempting to use similar methods for inpainting, but there exists a trade-off between realism and fidelity/stability for inpainting taskings (Lugmayr et al., 2017; Geiss and Hardin, 2021).

For instance, while you might generate a region of an image that looks great to the human eye, the actual pixel values are not necessarily correct if compared to a reference image and may vary quite a bit based on the provided random noise/seed (Fig. 5). This is not unexpected, though, as these techniques are not necessarily constructed with such constraints in mind and exist for other purposes.

Figure 5: A set of examples from the Denoising Diffusion Probabilistic Models (DDPM) project for image inpainting introduced in Lugmayr et al., 2017 (https://doi.org/10.1109/CVPR52688.2022.01117).

Instead, we focus on another machine learning architecture in this work: the U-Net. U-Nets are a class of Convolutional Neural Networks (CNNs), that ingest information in the form of an image (typically) and produce an output that is of the same dimensions. Often used for image segmentation, the encoder-decoder architecture of the U-Net allow the model to learn both local and global features of an image (context that is often quite valuable for correctly interpreting an image’s content during inpainting). We will use this architecture to teach the model to learn about latent features in aloft cloud to predict near surface reflectivity data in the aforementioned radar blind zone.

Data

The data used in this project comes from primary two sources:
1. ARM KaZR CloudSat-aligned reflectivity
2. ERA-5
Both datasets are publicly available (governed by the Creative Commons Attribution 4.0 International (CC BY 4.0)) and are collocated at two Arctic locations in northern Alaska, USA at North Slope and Oliktok Point (Fig. 6). Since we are focused on snowfall, we limit our observations to cold periods below 2 degrees C. Further, data is split in contiguous training/validation/testing chunks to avoid overfitting from autocorrelation as shown in Fig. 6.c. We are using radar reflectivity data from the ARM KaZR, along with temperature, specific humidity, u-component wind, and v-component wind from ERA-5.

Figure 6: a) Study site locations; b) surface MET temperature timeseries at NSA and OLI; and c) data splitting methods used for model training. Image by author.

Our Model

What type of U-Net should be used for this task? We experimented with a variety of different U-Nets in this project, including the UNet++ model from Zhou et al., 2018 and the 3Net+ from Huang et al., 2020. But let’s talk about the model architecture for a moment. For instance, why use these methods over a just a traditional U-Net? First let’s review how U-Nets work.

U-Nets are primarily thought of in three parts, an encoder path, a bottleneck layer, and a decoder path. The encoder is responsible for taking your initially high-resolution image, and through a series of convolutions and pooling steps, trades spatial information for rich feature information (learning about the latent context of your image). Depending on how deep your network is, these latent features are encoded most densely in the lowest dimensionality bottleneck layer at the base of the U. At this point, the image array may be only a fraction of the size of the original, and while you’ve lost most of the spatial information, the model has identified a set of embeddings that represent what it sees as key elements. Then, in the decoder path, the reverse occurs. The low-resolution, feature-rich arrays in the bottleneck layer are downsampled until the feature information has been transformed back into spatial information (resulting in a final image that has the same dimensions as the original).

One of the key differences between a U-Net, U-Net++ and 3Net+, is how each variation handles skip connections (Fig. 7). Skip connections allow these models to “skip” some data directly across the encoder path to be used in the decoder, which helps in conveying low-level feature information during decoding and produces a more stable model that converges in meaningful ways. In a vanilla U-Net for example, these connections simply concatenate feature maps from the contracting encoder path to the corresponding level in the decoding expansive path.

Figure 7: Comparisons between a) U-Net; b) U-Net++; and c) 3Net+ model architectures. Image retrieved from Huang et al., 2020 (https://doi.org/10.48550/arXiv.2004.08790).

The UNet++ introduces a series of nested and dense skip pathways that attempt to address the issue of the unknown depth of the optimal architecture in a traditional U-Net. Instead of just having direct connections from the encoder to the decoder, the UNet++ has multiple skip pathways. For a given level in the encoder, there are skip connections to all subsequent levels in the decoder, creating a dense set of connections. These nested skip connections are designed to capture and fuse features at various semantic levels more effectively than the vanilla U-Net, however this comes at the cost of a larger model (more parameters) and increased training times.

The 3Net+ builds on ideas of both previous techniques and is the architecture used in our final model (Fig. 8). This method breaks the skip connections into a series of across-skip connections (similar to the vanilla U-Net), inter-skip connections, and intra-skip connections. These inter- and intra- skip connections make full use of multi-scale features in the scene by passing information in a manner which incorporates low-level details with high-level semantics from feature maps in full scales, but with fewer parameters to the U-Net++ model.

Figure 8: a) 128×128 chunks of input KaZR and ERA-5 variables; and b) our model’s 3Net+ architecture with deep supervision layers. Image by author.

Further, our model makes use of deep supervision to learn hierarchical representations from the full-scale aggregated feature maps at each level of the decoder. This helps the model learn to correctly position clouds within the blind zone by examining the wider context of the scene. In the remaining sections we will compare the skill of this 3Net+ model trained on just reflectivity (herein denoted as 3+_1), another version trained on reflectivity and ERA5 data (3+_5) and two linear inpainting techniques using repeating extrapolation (REP) and marching average (MAR) methods. For additional details on how these are implemented, please refer to our AIES paper.

Fidelity

To comprehensively evaluate model performance in blind zone reconstructive accuracy, we will first examine a handful of common cases, followed by a more general statistical analysis of the full test dataset. Note that all results shown from here on out are strictly taken from the unseen test set of observations.

Case Studies

Examples of inpainted blind zone reflectivity values (taken from both NSA and OLI) for REP, MAR, 3+ 1 and 3+ 5 models, along with the corresponding target KaZR VAP product (i.e. the ground truth) in the far left column are shown below in Figs. 9/10.

This first set of examples highlights cases of near surface reflectivity gradients and cloud gaps that are common at both locations. The black dashed line indicates the 1.2 km blind zone threshold (values below this are masked and reconstructed by each model), and the shaded regions indicate areas of high uncertainty in the inpainted U-Net predictions (more on this later in the Monte Carlo Dropout section).

Figure 9: Examples a)-e) of inpainted blind zone reflectivity values (taken from both NSA and OLI) for REP, MAR, 3+_1 and 3+_5 models, along with the corresponding target KaZR VAP product (i.e. the ground truth) in the far left column. This set of examples highlights cases of near surface reflectivity gradients and cloud gaps. Image by author.

We find that the linear models (REP and MAR) perform well for deep, homogeneous systems, but fall short in more complex cases. Further, due to their reliance on blind zone threshold reflectivities, reflectivity gradients (vertical and horizontal) are missed by REP and MAR (and usually captured using the U-Nets). Finally, shallow, Arctic mixed-phase clouds can also be resolved using U-Nets, along with cloud gaps and cases of virga (Fig. 10), which is exciting as this has substantial implications to surface snowfall quantities.

Figure 10: The same as Fig. 9, now focusing on shallow snowfall and virga cases for each model’s inpainted predictions. Image by author.

The most challenging cases to accurately model are those with sparse reflectivity profiles and distant cloud. For instance, consider Fig. 11 showing two similar cases at NSA occurring in different years. To the human eye, both locations have very similar cloud structures, however one has a shallow near surface cloud in the blind zone while the other does not. The linear inpainting techniques are clearly well outside of their comfort zone here and will always produce the same “no cloud” output as we see in a). However, the U-Net models are still able to resolve cloud presence in cases such as this, with the 3+_5 model using the additional context from ERA-5 to better understand that the atmospheric conditions in this case likely result in blind zone cloud.

Figure 11: a) Example of a case where a cloud aloft (with mean reflectivity near -20 dBZ) has no near surface activity in the blind zone; and b) a similar cloud structure to a), except in this case there does exist a near surface band of reflectivity in a shallow cloud that is correctly identified using the 3+_5 model. Image by author.

Robustness

As we can see in Fig. 12 a), the U-Net PSD curves are closer to observations compared to the linear methods. Further, the nonlinear reconstructions produce a more realistic lowest cloud echo height (shown in the probability density plots of b), suggesting that we are better capturing cloud positions. Finally, the entire blind zone structure is summarized in c)-g), where the linear methods are able to capture general macro-scale trends in reflectivity but fall short of capturing the fine-scale variability. We do note that the U-Net models have a slight “cold” bias towards -60 dBZ due to their tendency to produce “safer” estimates closer to “no cloud” over much rarer, high intensity snowfall events.

Figure 12: Cloud structure metrics including a) vertical power spectral density curves; b) lowest reflectivity echo layer probability density plot; and c) 2d reflectivity histograms. Image by author.

Further, improving our predictions of surface snowfall and virga cases would have substantial hydrologic implications and reduce our uncertainty in snow accumulation quantities. So, we performed checks to see how well three cases were reconstructed with our model using probability of detection (POD) and false alarm rate (FAR):
1. Blind zone cloud presence (i.e. was any cloud detected)
2. Shallow snowfall (i.e. was there snowfall at the surface but not at the blind zone threshold)
3. Virga (i.e. was there snowfall detected at the blind zone threshold but not at the surface)
The critical success index (CSI) for each of these metrics is shown in the performance diagram below (Fig. 13), with the 3+_5 model performing the best overall. Shallow snowfall cases were typically the hardest to reconstruct, as we saw these cases can be quite tricky to get right (Fig. 11).

Figure 13: Cloud, shallow snowfall and virga detection performance diagram for each model. Image by author.

Explainability

To add further trust in the model’s decision-making process, we also performed a series of eXplainable Artificial Intelligence (XAI) tests (i.e., thinking towards a mechanistic interpretation of model behaviour). These tests were aimed at connecting model behaviour to logical physical processes to both inspire confidence in the model and provide additional insights for potentially enhancing future retrievals. If we can learn about previously unknown connections in the data, that can be very valuable! Separately, each individual XAI method gives a slightly different “local” explination of the decision-making process, and it is therefore useful to incorporate multiple tests to derive a more robust understanding.

Feature Maps

The first, most basic, test we considered were feature/activation maps. By examining the values of the reLU activator in different levels of the encoder path of Fig. 8.b, we get a rough idea of where the model is looking in an image for a given input. As shown in Fig.14 below, the e1 encoder layer of the 3+_5 model typically looked along cloud edges, at reflectivity gradients and directly along the blind zone threshold.

Figure 14: a) Example reflectivity input to the 3+_5 model; and b) the set of 32 feature maps produced by the e1 encoder layer of the model. Image by author.

Drop Channel Importance

One of the biggest questions in this project was whether ERA-5 was providing useful context to the model. If we can use a more simplistic U-Net that only relies on reflectivity for instance (i.e. the 3+_1 model), then we should do so, as this would be more computationally efficient. However, if the additional atmospheric state variables from ERA-5 provide the model with useful context for inpainting complex systems, then the use of this more complex model may be warranted.

Since we only have a handful of inputs in this case (1 mandatory (radar) and 4 supplementary (temperature, humidity, u-wind, and v-wind), we can do an exhaustive search of the input space to evaluate their marginal contributions to accuracy. More formally, this drop-channel approach uses the below formula (Eq. 1/ Eq. 2) to calculate marginal contributions of importance from the provided inputs. Note that this technique does not consider potential nonlinear interactions between inputs.

Eq. 1 / Eq. 2: Used in calculating marginal importance from N inputs. Image by author.

If we perform a series of this test runs (25 epochs) and examine changes in the validation loss, we can gain some rough insight into which inputs are most useful. The results of these tests are shown below in Fig. 15, where we note a trend of decreasing validation loss as we add more ERA-5 inputs (suggesting that none of the inputs are wholly unimportant). Further, the marginal contributions to validation loss suggest that the wind data is the most influential overall. We believe that this importance may stem from that fact that in the upper troposphere, wind patterns can hint at mesoscale atmospheric dynamics, such as the presence of high or low-pressure systems, fronts, and jet streams (which are of course linked to cloud/hydrometeor formation).

Figure 15: Input channel combination contributions to inpainting performance in 3Net+ models. Image by author.

Saliency Maps

Finally, we also examined saliency maps for a handful of cases (Fig. 16), to further compare the differences in importance between the 3+_5 and 3+_1 models. These pixel attribution vanilla gradient saliency maps are inspired by the work of Simonyan et al., 2014, and provide additional insight into areas the model identifies as crucial contributors of information for inpainting accuracy using a given input. These saliency maps are generated by running an image through the network and subsequently extracting the gradients of the output based on the input across all channels. While simplistic, this method is particularly useful for visualizing which parts of the observed image are most valuable in inpainting the blind zone reflectivity values, allowing for direct plotting of the activation gradients.

Figure 16: Vanilla gradient saliency maps for both the 3+_1 and 3+_5 models, along with the corresponding ERA5 atmospheric state variables for a handful of cases at NSA. Image by author.

For multilayer clouds that intersect with the blind zone cut-off (e.g. Fig. 16.a), both models focus on the top of the cloud and the 1.2 km boundary threshold, as these systems often continue to extend down to the surface with similar reflectivity intensities. Both models also typically focus on and around cloud gaps in deeper systems (e.g. Fig. 16.b), however, a distinct halo of importance towards the tropopause is noticeable in the 3+_5 model. This recurring feature is likely incorporating upper troposphere wind and humidity data into predictions of near surface reflectivity. Interestingly, the 3+_1 model does not solely focus on the areas of high reflectivity in the scene but also on the regions around the cloud.

Applications

The motivating goal of this work is to eventually apply a surface trained U-Net to spaceborne observations. While additional work needs to be completed on resolution-matching between these two systems, we performed early tests against coincident CloudSat-CPR observations near NSA. The idea here being that both systems (while not perfectly overlapped), will be observing similar portions of the same storm system. We consider a handful of examples and include one below for a shallow cumuliform snowfall case.

Figure 17: a) Map depicting a CloudSat overpass granule path (red), 128-step site-coincident CloudSat footprint near NSA, and white point showing the closest observed profile to the site (black circle represents a 50 km radius around the site); b) CloudSat reflectivity profile of a shallow cumuliform system, with the region between the two dashed lines representing the blue (coincident) portion of the overpass in a), and the white shaded region showing the satellite blind zone; c) Vertical NSA KaZR reflectivity profile, with the corresponding overpass period from b) shown in the region between the two dashed lines; d) A closeup of the KaZR CloudSat coincident profile from c); e) The CloudSat-NSA coincident profile from b); and f-h) The REP, MAR and 3+_1 inpainted blind zone scenes, respectively, when provided with e) as input. Image by author.

In this example, we noted that both the spaceborne and surface radars observed a shallow cloud about 3 km tall (however CloudSat misses the increased reflectivity gradient below the blind zone due to surface clutter). When this region is reconstructed using both the traditional techniques and our U-Net, we find the U-Net is the only method that can accuratly represent the band of increased reflectivity around 1 km. More formally, if we look at the structure between the nearest CloudSat observation to the site (white dashed line) to each of the closest reconstructed regions, Pearson correlations are substantially improved using the U-Net (r_MAR=0.13 to r_3+_1=0.74).

While these examples don’t comprise a comprehensive anaylsis allowing us to say something about general performance, they do indicate that we see skill consistent with what we have previous noted when looking at the simulated surface blind zone. Further application work to spaceborne instruments is ongoing.

Final Notes

Before we finish up this already long post, I wanted to highlight a few of the other features we built into the model and provide some training code examples for those interested in developing their own inpainting model.

Monte Carlo Dropout

Unlike traditional Bayesian methods, we do not directly produce a physically-based uncertainty estimate using a U-Net. To get a rough idea of model confidence and stability, we decided to introduce a dropout at inference layer to the model based on the work of Gal and Ghahramani, 2016 which allows us to generate a distribution of inpainted predictions for each test case. These distributions allow us to produce confidence intervals for each inpainted pixel, and further refine our estimates to regions that model is more certain of when inpainting. An example of this is shown below in Fig. 17.

Figure 18: Monte Carlo dropout example outputs (n=50 iterations). Image by author.

We typically use N=50 iterations per case, and as we can see above, the areas with the highest uncertainty are typically cloud edges and cloud gaps, as the model often hallucinates when positioning these features.

Training Statistics

Model training for this project was completed on two sets of hardware, including a Linux-based GPU computing cluster on Microsoft Azure, and a high-performance desktop running Windows 11 (additional system details in Table 1). An extensive Bayesian hyperparameter sweep was also performed over the course of 2 days. Further, batch normalization is applied along with early stopping (n=20), dropout and L2 regularization (ridge regression) to help mitigate overfitting during the training process. Learning rate decay is also applied at two epochs (450 and 475), allowing the model to more easily settle into a local loss minima near the end of the training phase. All training runs and hyperparameter sweeps are saved online using the Weights & Biases cloud storage option, to monitor model learning rates and stability over time.

Table 1: Summary details of hardware used for model training. Image by author.

Example Code

A link to the GitHub is here: https://github.com/frasertheking/blindzone_inpainting

However, I wanted to provide an overview of the actual 3Net+ implementation (with variable depth) in Tensorflow below for those interested in playing around with it.
```
def conv_block(x, kernels, kernel_size=(3, 3), strides=(1, 1), padding='same', is_bn=True, is_relu=True, n=2, l2_reg=1e-4):
    for _ in range(1, n+1):
        x = k.layers.Conv2D(filters=kernels, kernel_size=kernel_size,
                            padding=padding, strides=strides,
                            kernel_regularizer=tf.keras.regularizers.l2(l2_reg),
                            kernel_initializer=k.initializers.he_normal(seed=42))(x)
        if is_bn:
            x = k.layers.BatchNormalization()(x)
        if is_relu:
            x = k.activations.relu(x)
    return x

def unet3plus(input_shape, output_channels, config, depth=4, training=False, clm=False):

    """ Prep """
    interp = config['interpolation']
    input_layer = k.layers.Input(shape=input_shape, name="input_layer")
    xpre = preprocess(input_layer, output_channels)

    """ Encoder """
    encoders = []
    for i in range(depth+1):
        if i == 0:
            e = conv_block(xpre, config['filters']*(2**i), kernel_size=(config['kernel_size'], config['kernel_size']), l2_reg=config['l2_reg'])
        else:
            e = k.layers.MaxPool2D(pool_size=(2, 2))(encoders[i-1])
            e = k.layers.Dropout(config['dropout'])(e, training=True)
            e = conv_block(e, config['filters']*(2**i), kernel_size=(config['kernel_size'], config['kernel_size']), l2_reg=config['l2_reg'])
        encoders.append(e)

    """ Middle """
    cat_channels = config['filters']
    cat_blocks = depth+1
    upsample_channels = cat_blocks * cat_channels

    """ Decoder """
    decoders = []
    for d in reversed(range(depth+1)):
        if d == 0 :
            continue
        loc_dec = []
        decoder_pos = len(decoders)
        for e in range(len(encoders)):
            if d > e+1:
                e_d = k.layers.MaxPool2D(pool_size=(2**(d-e-1), 2**(d-e-1)))(encoders[e])
                e_d = k.layers.Dropout(config['dropout'])(e_d, training=True)
                e_d = conv_block(e_d, cat_channels, kernel_size=(config['kernel_size'], config['kernel_size']), n=1, l2_reg=config['l2_reg'])
            elif d == e+1:
                e_d = conv_block(encoders[e], cat_channels, kernel_size=(config['kernel_size'], config['kernel_size']), n=1, l2_reg=config['l2_reg'])
            elif e+1 == len(encoders):
                e_d = k.layers.UpSampling2D(size=(2**(e+1-d), 2**(e+1-d)), interpolation=interp)(encoders[e])
                e_d = k.layers.Dropout(config['dropout'])(e_d, training=True)
                e_d = conv_block(e_d, cat_channels, kernel_size=(config['kernel_size'], config['kernel_size']), n=1, l2_reg=config['l2_reg'])
            else:
                e_d = k.layers.UpSampling2D(size=(2**(e+1-d), 2**(e+1-d)), interpolation=interp)(decoders[decoder_pos-1])
                e_d = k.layers.Dropout(config['dropout'])(e_d, training=True)
                e_d = conv_block(e_d, cat_channels, kernel_size=(config['kernel_size'], config['kernel_size']), n=1, l2_reg=config['l2_reg'])
                decoder_pos -= 1
            loc_dec.append(e_d)
        de = k.layers.concatenate(loc_dec)
        de = conv_block(de, upsample_channels, kernel_size=(config['kernel_size'], config['kernel_size']), n=1, l2_reg=config['l2_reg'])
        decoders.append(de)

    """ Final """
    d1 = decoders[len(decoders)-1]
    d1 = conv_block(d1, output_channels, kernel_size=(config['kernel_size'], config['kernel_size']), n=1, is_bn=False, is_relu=False, l2_reg=config['l2_reg'])
    outputs = [d1]

    """ Deep Supervision """
    if training:
        for i in reversed(range(len(decoders))):
            if i == 0:
                e = conv_block(encoders[len(encoders)-1], output_channels, kernel_size=(config['kernel_size'], config['kernel_size']), n=1, is_bn=False, is_relu=False, l2_reg=config['l2_reg'])
                e = k.layers.UpSampling2D(size=(2**(len(decoders)-i), 2**(len(decoders)-i)), interpolation=interp)(e)
                outputs.append(e)
            else:
                d = conv_block(decoders[i - 1], output_channels, kernel_size=(config['kernel_size'], config['kernel_size']), n=1, is_bn=False, is_relu=False, l2_reg=config['l2_reg'])
                d = k.layers.UpSampling2D(size=(2**(len(decoders)-i), 2**(len(decoders)-i)), interpolation=interp)(d)
                outputs.append(d)

    if training:
        for i in range(len(outputs)):
            if i == 0:
                continue
            d_e = outputs[i]
                d_e = k.layers.concatenate([out1, out2, out3])
            outputs[i] = merge_output(input_layer, k.activations.linear(d_e), output_channels)

    return tf.keras.Model(inputs=input_layer, outputs=outputs, name='UNet3Plus')
```
Future

I know this was a long post, and we covered a lot of stuff, but I want to give a quick summary of everything we’ve covered for those who have made this far (or those who skipped to the end).

The satellite radar blind zone is an ongoing problem on satellite-based Earth observing precipitation missions, with critical implications to global water-energy budget calculations. In order to overcome common issues with traditional linear inpainting methods for filling in this region, we opted to use a nonlinear, deeply supervised U-Net for radar blind zone inpainting. The U-Net outperforms linear techniques across nearly all metrics and can even reconstruct complex cloud structures like multilayer clouds, cloud gaps and shallow cloud. Further, using a variety of XAI techniques, we saw that information directly at the blind zone threshold and along the tropopause (especially wind information) was found to be very useful in the model’s decision-making process. While we don’t suggest that these models fully replace current physically-based solutions, we believe they offer a unique new perspective that can be used to supplement other retrievals in future missions.

We are currently working on a follow-up project to this with direct applications to CloudSat-CPR observations.

Referernces

Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning (arXiv:1506.02142). arXiv. https://doi.org/10.48550/arXiv.1506.02142

Geiss, A., & Hardin, J. C. (2021). Inpainting radar missing data regions with deep learning. Atmospheric Measurement Techniques, 14(12), 7729–7747. https://doi.org/10.5194/amt-14-7729-2021

Guillemot, C., & Le Meur, O. (2014). Image Inpainting: Overview and Recent Advances. IEEE Signal Processing Magazine, 31(1), 127–144. https://doi.org/10.1109/MSP.2013.2273004

Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y., Han, X., Chen, Y.-W., & Wu, J. (2020). UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation (arXiv:2004.08790). arXiv. https://doi.org/10.48550/arXiv.2004.08790

Kidd, C., Graham, E., Smyth, T., & Gill, M. (2021). Assessing the Impact of Light/Shallow Precipitation Retrievals from Satellite-Based Observations Using Surface Radar and Micro Rain Radar Observations. Remote Sensing, 13(9), Article 9. https://doi.org/10.3390/rs13091708

King, F., & Fletcher, C. G. (2020). Using CloudSat-CPR Retrievals to Estimate Snow Accumulation in the Canadian Arctic. Earth and Space Science, 7(2), e2019EA000776. https://doi.org/10.1029/2019EA000776

Lamer, K., Kollias, P., Battaglia, A., & Preval, S. (2020). Mind the gap — Part 1: Accurately locating warm marine boundary layer clouds and precipitation using spaceborne radars. Atmospheric Measurement Techniques, 13(5), 2363–2379. https://doi.org/10.5194/amt-13-2363-2020

Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., & Van Gool, L. (2022). RePaint: Inpainting using Denoising Diffusion Probabilistic Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11451–11461. https://doi.org/10.1109/CVPR52688.2022.01117

Maahn, M., Burgard, C., Crewell, S., Gorodetskaya, I. V., Kneifel, S., Lhermitte, S., Van Tricht, K., & van Lipzig, N. P. M. (2014). How does the spaceborne radar blind zone affect derived surface snowfall statistics in polar regions? Journal of Geophysical Research: Atmospheres, 119(24), 13,604–13,620. https://doi.org/10.1002/2014JD022079

Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps (arXiv:1312.6034). arXiv. https://doi.org/10.48550/arXiv.1312.6034

Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N., & Liang, J. (2018). UNet++: A Nested U-Net Architecture for Medical Image Segmentation (arXiv:1807.10165). arXiv. https://doi.org/10.48550/arXiv.1807.10165

Beyond the Blind Zone was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Beyond the Blind Zone

Go Here to Read this Fast! Beyond the Blind Zone
April 5, 2024
How Reliable Are Your Time Series Forecasts, Really?

Bradley Stephen Shaw

How cross-validation, visualisation, and statistical hypothesis testing combine to reveal the optimal forecasting horizon

Continue reading on Towards Data Science »

Originally appeared here:
How Reliable Are Your Time Series Forecasts, Really?

Go Here to Read this Fast! How Reliable Are Your Time Series Forecasts, Really?

April 5, 2024
5 Key Soft Skills to Elevate Your Data Science Game
Florent Pajot
Top soft skills that should be highly valued by data scientists to distinguish themselves from the crowd

In this story, I’ll share with you my top 5 soft skills that should be highly valued by Data Scientists to distinguish themselves from the crowd. These observations are based on my experiences over the past decade, working with and managing Data Scientists.

Photo by Gary Butterfield on Unsplash

1. Curiosity

I won’t catch you off guard with this one (or the next), yet curiosity remains a key characteristic for Data Scientists (and numerous other professions).

Keep learning

Data Scientists should value curiosity a lot because curiosity fosters continuous learning and adaptation in the rapidly evolving field of data science.

Given the constant influx of innovations from research papers or conferences, a curious mindset encourages Data Scientists to stay abreast of emerging trends, embrace new tools, and refine their skills.

I’ve experienced moments in my career where I would focus on a topic for months, lowering my head and forgetting about everything else. At some point, my curiosity brings me back to earth and urges me to open my eyes and look for what happened in the field during that span. Without it, I would get stuck and become outdated.

Think outside the box

Also, curiosity is a source of creativity in problem-solving.

Data Science often presents intricate challenges that demand innovative approaches and is also often about finding new ways to apply old and proven approaches to new domains in ways that have not been tried before.

Thus, curious minds are more likely to think outside the box, widen their options, and try new things out. That’s all you need to solve problems!

Creativity doesn’t find its lone source in curiosity. For example, I think that a team composed of people with diverse backgrounds has more chance to foster creativity. So if you are a manager, think about hiring people very different from you or your colleagues, be curious about unorthodox career paths, and you might find gems.

How to bolster your curiosity?
1. Read
You should read more blogs and more books (or whatever other formats you like to read/listen to). You should pick topics you’re not experienced with, topics going beyond what you usually read. If you have a scientific background, read about business, communication, decision-making, etc.

I like reading about topics I’m passionate about like Basketball, Data Science, or Data Engineering. However, I also try reading more about software engineering, marketing, entrepreneurship, biology, personal finance, etc.

2. Step outside your comfort zone

Seize any opportunity to leave your comfort zone and learn new things. If you adopt a position where everything is of interest, and everyone is worth listening to, you’ll widen your world and start questioning yourself about things outside your domain of expertise.

Photo by Panos Sakalakis on Unsplash

2. Communication

In any profession, you’ve likely encountered the challenges of collaborating with someone who struggles with effective communication. This issue becomes particularly pronounced in the realm of Data Science, where adeptly conveying technical concepts presents a formidable challenge.

Adapt to your audience

Data Scientists are often at the interface of several teams, working with various stakeholders, and people with diverse backgrounds. So bridging the knowledge gap, and ensuring everybody has a common understanding of what is said is important.

Without clear communication, collaborating becomes rapidly impossible and might jeopardize a project, a team, or your job.

In my previous experience, I’ve always been impressed by people able to express their current work without using any technical concept. For me, it tells a lot about how much one person is in control of the subject and gets an overview of it before talking.

Communicating results efficiently

Even if just 1% of your project time is dedicated to communicating results, this seemingly small fraction often holds 99% of the value to be conveyed.

Regrettably, I’ve noticed that many junior Data Scientists devote weeks to perfecting their technical methodologies or data analyses, only to stumble when it comes to delivering impactful insights due to inadequate communication.

How to improve your communication?
1. Practice
You should consider every opportunity to interact with people as a chance to practice and refine your communication skills.

For example, if you have daily meetings, prepare them for a few minutes and make sure to deliver clear communication to your audience.

Whenever I have an “important” meeting, I take 5 minutes to prepare. My recipe is the following:
- take 1 minute for a zygomatic warm-up (it helps to improve pronunciation) -> https://preply.com/en/blog/english-pronunciation-practice/
- define my objectives
- identify what my audience’s interest in listening to me
- write down key topics or a structured summary of my intervention
Always make the hypothesis that they don’t know anything about what you’re doing, and help them close the knowledge gap so that they feel comfortable with what you are saying.

Also, ask them for feedback: do you remember what is my primary objective today? Was it clear enough?

Resources:
- https://hbr.org/2023/06/how-to-give-and-receive-critical-feedback
- https://kimmalonescott.medium.com/asking-for-feedback-how-to-solicit-radical-candor-823dab2860c0
2. Read

Read books about communication! I highly recommend beginning with “Made To Stick” by Chip and Dan Heath, a book that offers a methodology for making your ideas more memorable and impactful.

Photo by Seema Miah on Unsplash

You can also find interesting resources on the internet for free like https://online.hbs.edu/blog/post/communication-techniques

3. Scientific Mindset

It is easy to forget about the “scientist” part of the “data scientist” role name, but it is very important to remember that a Data Scientist must apply a scientific approach to solving problems using data.

Scientific rigor forms the bedrock of Data Science, ensuring that analyses are robust, reliable, and reproducible. Applying scientific rigor mainly refers to adhering to a rigorous methodology, and critically evaluating findings.

You can be a great software engineer without applying any scientific methodology. But you won’t be a good Data Scientist without it.

In my opinion, this is not an option. Even if you’re not a research scientist, exercising scientific rigor is vital to mitigate the risk of erroneous conclusions. That’s why, whenever I discuss experiments and results with a colleague, I tend to value the scientific approach over the results in the first phase.

Tips to embrace a higher level of scientific rigor
1. Problem definition
It’s easier said than done. But, if you are solving the wrong problem, you won’t go anywhere and lose your time. Unfortunately, most people jump on the first available version of a problem because they prefer writing code. So, the solution is simple, invest more time in defining the problem, discussing desired outcomes with stakeholders, and setting proper starting hypotheses/constraints.

2. Statistics

Statistics is one of the 3 pillars of a Data Scientist’s hard skills. And, any scientific approach needs a fair use of statistical tools. For example, statistical tests will help you check for feature correlation or data distribution. So, if you’re not comfortable with it, think about leveling up your statistical arsenal.

You can start right now on Medium: https://towardsdatascience.com/ultimate-guide-to-statistics-for-data-science-a3d8f1fd69a7

I recommend also this comprehensive book: https://www.oreilly.com/library/view/practical-statistics-for/9781492072935/

3. Tools

People tend to try multiple things at once to save time. But going step by step, assessing one thing at a time is mandatory to make sure you understand what’s going on and draw the right conclusions.

Using the appropriate tool can significantly facilitate the process. This is precisely the function of “Experiment tracking” tools, which are increasingly used within the Data Science domain.

For personal projects like Kaggle competitions, I like using DVC which introduced experiment tracking features a few years ago. However, much more advanced tools are available on the market like MLFlow or Neptune.ai.

You can find a comprehensive comparison of experiment-tracking tools here: https://towardsdatascience.com/a-comprehensive-comparison-of-ml-experiment-tracking-tools-9f0192543feb

But, you don’t need a sophisticated tool to take notes of your thoughts, questions, and experiments. So I’d recommend at least taking the first step by just writing down things on a notepad.

Photo by Brett Jordan on Unsplash

4. Integrity

As any Data Scientist knows too well, a data analysis can vary significantly based on the narrative one wishes to convey to his audience.

Is he lying with data?

There is no denying that conclusions drawn from a dataset are inherently influenced by the perspective and intentions of the presenter. This simple fact underscores the importance of integrity.

Resources:
- How To Lie With Statistics by Darrell Huff
- https://towardsdatascience.com/lessons-from-how-to-lie-with-statistics-57060c0d2f19
With great power comes great responsibility. But, this is not the only reason why integrity is important to me.

Challenge model bias

Bias in models is also a now well-advertised issue every Data Science practitioner should care about.

Tackling this problem is a difficult task but everybody should feel concerned because of potential business impacts, and more importantly, because of the potential societal impact one biased model can have.

Interesting resources on the topic:
- https://developers.google.com/machine-learning/crash-course/fairness/video-lecture
- https://pair-code.github.io/what-if-tool/
If you’re an AWS user like me, Sagemaker Clarify provides numerous analyses for bias detection.

If you are interested in fairness for Deep Learning, take an attentive look at DEEL lab publications on the subject and their open-source solutions like Influencia.

Environmental Impact

Another ethical aspect of Data Science is its environmental impact which is often downplayed due to its complexity in measurement and comprehension.

I guess the research in this area is expanding, and I would appreciate hearing from you if you have materials to share with me on the subject.

It is imperative for Data Scientists to actively assess and mitigate their environmental impact. For instance, they must question whether processing an extensive quantity of data is essential to achieve the desired business objectives.

Additionally, they should explore methods to minimize the environmental repercussions of their models. Then, sharing results and perspectives with stakeholders will broaden awareness of the environmental implications inherent in data-driven decision-making.

There are many more dimensions to this problem and I’ll think about it for a future story.

Interesting resources:
If you want to start thinking about your code carbon footprint:
- https://mlco2.github.io/codecarbon/
- understanding where your code is inefficient is also a good start: https://pyinstrument.readthedocs.io/en/latest/
How do you keep your integrity?

Integrity is about being honest with yourself, consistent in your values, and acting according to your principles. So, the first thing is questioning yourself about your core values and the best way to incarnate them at work daily.

Then, resist external pressure and stay true to yourself. Also, don’t ignore the ethical challenges of the domain. They are an increasing concern for society, and we are responsible for providing solutions.

Photo by Afif Ramdhasuma on Unsplash

5. Being value-centric

All Data Scientists just love exploring data and building models. That’s why Kaggle is so popular. As a professional, it is easy to be trapped by infinite data exploration, unbounded experiments, or model optimizations.

Value-centricity refers to an approach or mindset that places a strong emphasis on delivering value as the primary objective in decision-making, problem-solving, and overall strategy.

So, in the context of Data Science, being value-centric means you’ll have to keep your focus and employ your skills to create value rather than losing your time with the technical issues you’d like to solve to reach an optimal solution.

Once again, the best Data Scientists I worked with are exploring data on purpose, asking and answering the questions that will help them solve the right problem. Then, they conduct the minimum required experiments to draw a solution and build an MVP. They quickly go to production to see what happens, and iterate.

This path is made of numerous tradeoffs between optimizing things and adding incremental value to the end user.

How to be focused on value?

One hard thing about being value-centric is recognizing that even if you’re not building a comprehensive data product by yourself, you’re building a piece of it, so you must conform to a product mindset and focus on the value you’ll ultimately generate for an end user.

Your decisions should always assess the time it will take to do something compared to the value it provides in terms of product. Some things are important but can be postponed for future iterations, other things are not interesting enough to be done.

When building a Data Science model, it is often possible to quickly assess if a model would yield good enough value depending on the expected performance and how it will impact the business.

For example, if you are not familiar with building custom scoring functions for evaluating your model based on nontechnical metrics, take a look at this: https://towardsdatascience.com/calculating-the-business-value-of-a-data-science-project-3b282de9be3c

At some point, you may also face the opportunity to improve model performance. This is the right time to understand what kind of improvements will be valuable. Is increasing your F1 score by 0.01 worth it compared to the effort you need to provide? Does it require collecting 100k new labeled data points?

As product owners and other software engineers might not understand all the technical aspects of developing a model, it is your responsibility to make these decisions.

Other interesting resources:
- The work ethic concept captures the ideas of value and ownership: https://hbr.org/2022/09/how-to-develop-a-strong-work-ethic
Conclusions

To become a better Data Scientist, you should focus on developing your curiosity, communication, integrity, scientific mindset, and value-centricity.

Most of the time, I would recommend reading books, but there are also numerous other valuable materials such as online courses and blog articles. Some of these skills can only be developed by being confronted with reality. Thus, be aware and prepared for the opportunities you’ll face.

If you would like to receive notifications for my upcoming posts regarding Data Science and more, please subscribe here or follow me.

Did you know you can clap multiple times?

5 Key Soft Skills to Elevate Your Data Science Game was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
5 Key Soft Skills to Elevate Your Data Science Game

Go Here to Read this Fast! 5 Key Soft Skills to Elevate Your Data Science Game
April 5, 2024
Understanding and predicting urban heat islands at Gramener using Amazon SageMaker geospatial capabilities

Abhishek Mittal

This is a guest post co-authored by Shravan Kumar and Avirat S from Gramener. Gramener, a Straive company, contributes to sustainable development by focusing on agriculture, forestry, water management, and renewable energy. By providing authorities with the tools and insights they need to make informed decisions about environmental and social impact, Gramener is playing a […]

Originally appeared here:
Understanding and predicting urban heat islands at Gramener using Amazon SageMaker geospatial capabilities

Go Here to Read this Fast! Understanding and predicting urban heat islands at Gramener using Amazon SageMaker geospatial capabilities

April 5, 2024
Chronos: The Rise of Foundation Models for Time Series Forecasting

Luís Roque

Exploring Chronos: How foundational AI models are setting new standards in predictive analytics

Continue reading on Towards Data Science »

Originally appeared here:
Chronos: The Rise of Foundation Models for Time Series Forecasting

Go Here to Read this Fast! Chronos: The Rise of Foundation Models for Time Series Forecasting

April 5, 2024
Deep Learning At Scale: Parallel Model Training

Caroline Arnold

Concept and a Pytorch Lightning example

Continue reading on Towards Data Science »

Originally appeared here:
Deep Learning At Scale: Parallel Model Training

Go Here to Read this Fast! Deep Learning At Scale: Parallel Model Training

April 5, 2024
De-biasing Treatment Effects with Double Machine Learning
Ryan O’Sullivan
Causal AI, exploring the integration of causal reasoning into machine learning

Photo by Ales Nesetril on Unsplash

What is this series of articles about?

Welcome to my series on Causal AI, where we will explore the integration of causal reasoning into machine learning models. Expect to explore a number of practical applications across different business contexts.

In the last article we explored making Causal Discovery work in real-world business settings. This time we will cover de-biasing treatment effects with Double Machine Learning.

If you missed the last article on Causal Discovery, check it out here:

Making Causal Discovery work in real-world business settings

Introduction

This article will demonstrate why Double Machine Learning is an essential part of the Causal AI toolbox:

Expect to gain a deep understanding of:
- Average treatment effects (ATE)
- The challenges of using Linear Regression to estimate ATE
- Double Machine Learning and how it overcomes the challenges Linear Regression faces
- A worked case study in Python illustrating how to apply Double Machine Learning.
The full notebook can be found here:

causal_ai/notebooks/estimating average treatment effects with double machine learning.ipynb at main · raz1470/causal_ai

Average Treatment Effects (ATE)

ATE

ATE is the average impact of a treatment or intervention on a population. We can calculate it by comparing the average change in a chosen metric between a treatment and control group.

For example, consider a marketing team is running a promotion. The treatment group consists of customers who receive an offer, while the control group consists of customers who didn’t. We can calculate ATE comparing the average number of orders in the treatment and control group.

Potential outcomes framework

The potential outcomes framework was developed by Donald Rubin and has become a foundational concept in causal inference. Lets try and understand it using the example above from the marketing team.
1. Treatment assignment: Each customer has two potential outcomes, the outcome of being in the treatment group (sent offer) and the outcome of being in the control group (not sent offer). However, only one potential outcome is observed for each customer.
2. Counterfactuals: The potential outcome which is not observed is a counterfactual e.g. what would have happened if this customer was in the control group (not sent offer).
3. Causal effect: The causal effect of a treatment is the difference between the potential outcomes under different treatment conditions (sent off vs not sent offer).
4. Estimation: Causal effects can be estimated using experimental or observational data using a range of causal techniques.
Several assumptions are made to help ensure the estimated effects are valid:
- Stable Unit Treatment Value Assumption (SUTVA): The potential outcome for any customer is unaffected by the treatment assignment of other customers.
- Positivity: For any combination of features, there must be some probability that a customer could receive either treatment or control
- Ignorability: All confounders which effects both treatment and outcome are observed.
Experimental data

Estimating ATE with experimental data is relatively straightforward.

Randomised Controlled Trials (RCTs) or AB tests are designed to randomly assign participants to treatment and control groups. This ensures that any differences in outcomes can be attributed to the treatment effect rather than pre-existing characteristics of the participants.

Back to the example from the marketing team. If they randomly split customers between the treatment and control group, the average difference in orders is the causal effect of the offer sent.

Observational data

Estimating ATE using observational data is more challenging.

The most common challenge is confounding variables which effect both the treatment and outcome. Failure to control for confounders will lead to biased estimates of the treatment effect. We will come back to this later in the article in the worked case study.

Other challenges include:
- Selection bias — The treatment assignment is influenced by factors related to the outcome.
- Heterogenous treatment effects — The treatment effect varies across different subgroups of the population.
Estimating ATE using Linear Regression

Overview

Linear regression can be used to estimate ATE using observational data. The treatment (T) and control features (X) are included as variables in the model.

Use generated image

The coefficient of the treatment variable is the ATE — the average change in the outcome variable associated with a unit change in the treatment variable, while keeping control features constant.

Data-generating process

We can use a simple data-generating process with one outcome, treatment and confounder to illustrate how we can use linear regression to estimate ATE.

First of all we can visualise the causal graph:
```
# Create node lookup variables
node_lookup = {0: 'Confounder',
               1: 'Treatment',
               2: 'Outcome'                                                                                       
 }

total_nodes = len(node_lookup)

# Create adjacency matrix - this is the base for our graph
graph_actual = np.zeros((total_nodes, total_nodes))

# Create graph using expert domain knowledge
graph_actual[0, 1] = 1.0 # Confounder -> Treatment
graph_actual[0, 2] = 1.0 # Confounder -> Outcome
graph_actual[1, 2] = 1.0 # Treatment -> Outcome
     
plot_graph(input_graph=graph_actual, node_lookup=node_lookup)
```
User generated image

And then we can create samples using the simple data-generating process. Pay close attention to the coefficient of the treatment variable (0.75) — this is our ground truth ATE.
```
np.random.seed(123)

# Create dataframe with a confounder, treatment and outcome
df = pd.DataFrame(columns=['Confounder', 'Treatment', 'Outcome'])
df['Confounder'] = np.random.normal(loc=100, scale=25, size=1000)
df['Treatment'] = np.random.normal(loc=50, scale=10, size=1000) + 0.50 * df['Confounder']
df['Outcome'] = 0.25 * df['Confounder'] + 0.75 * df['Treatment'] + np.random.normal(loc=0, scale=5, size=1000)

sns.pairplot(df, corner=True)
```
User generated image

Linear regression

We can then train a linear regression model and extract the coefficient of the treatment variable — We can see that it correctly estimates the ATE (0.75).
```
# Set target and features
y = df['Outcome']
X = df[['Confounder', 'Treatment']]

# Train model 
model = RidgeCV()
model = model.fit(X, y)

# Extract the treatment coefficient
ate_lr = round(model.coef_[1], 2)

print(f'The average treatment effect using Linear Regression is: {ate_lr}')
```
User generated image

Challenges

Linear regression can be a very effective method for estimating ATE. However, there are some challenges to be aware of:
- It struggles when we have high-dimensional data.
- The “nuisance parameters” (the control features which are a “nuisance” to estimate) may be too complex for linear regression to estimate.
- It assumes the treatment effect is constant across different subgroups of the population (e.g. no heterogeneity).
- Assumes no unobserved confounders.
- Assumes that the treatment effect is linear.
Double Machine Learning (DML)

Overview

Double Machine Learning is a causal method first introduced in 2017 in the paper “Double/Debiased Machine Learning for Treatment and Structural Parameters”:

Double/Debiased Machine Learning for Treatment and Causal Parameters

It aims to reduce bias and improve the estimation of causal effects in situations where we have high-dimensional data and/or complex nuisance parameters.

It is inspired by the Frisch-Waugh-Lovell theorem, so let’s start by understanding this.

Frisch-Waugh-Lovell theorem

The FWL theorem is used to decompose the effects of multiple regressors on an outcome variable, allowing us to isolate effects of interest.

Imagine you had two sets of features, X1 and X2. You could estimate the model parameters using linear regression like we did before. However, you can also get the same parameter for X1 by following these steps:
1. Use X2 only to predict the outcome
2. Use X2 only to predict X1
3. Calculate the residuals from the outcome model (step 1) and feature model (step 2)
4. Regress the residuals of the outcome model on the residuals of the feature model to estimate the parameter for X1
At first glance this can be quite hard to follow, so let’s try it out in Python to illustrate. We use the same data as before, but take the treatment column as X1 and the confounder column as X2:
```
# Set treatment, outcome and confounder samples
treatment = df['Treatment'].to_numpy().reshape(-1,1)
outcome = df['Outcome'].to_numpy().reshape(-1,1)
confounder = df['Confounder'].to_numpy().reshape(-1,1)

# Train treatment model and calculate residuals
treatment_model = RidgeCV()
treatment_model = treatment_model.fit(confounder, treatment)
treatment_pred = treatment_model.predict(confounder)
treatment_residuals = treatment - treatment_pred

# Train outcome model and calculate residuals
outcome_model = RidgeCV()
outcome_model = outcome_model.fit(confounder, outcome)
outcome_pred = outcome_model.predict(confounder)
outcome_residuals = outcome - outcome_pred

# Train residual model and calculate average treatment effect
final_model = RidgeCV()
final_model = final_model.fit(treatment_residuals, outcome_residuals)
ate_dml = round(final_model.coef_[0][0], 2)

print(f'The average treatment effect is: {ate_fwl}')
```
User generated image

We can see that it correctly estimates the coefficient of the treatment variable (0.75).

Double Machine Learning

Double Machine Learning builds upon FWL by isolating the effects of treatment and control features and by using flexible machine learning models.

The first stage is often referred to an orthogonalization as the nuisance parameters are estimated independently of the treatment effect estimation.

First stage:
- Treatment model (de-biasing): Machine learning model used to estimate the probability of treatment assignment (often referred to as propensity score). The treatment model residuals are then calculated.
- Outcome model (de-noising): Machine learning model used to estimate the outcome using just the control features. The outcome model residuals are then calculated.
Second stage:
- The treatment model residuals are used to predict the outcome model residuals.
The coefficient of the second stage model is the ATE. It is worth noting that the second stage model is a linear model, meaning we are assuming our treatment effect is linear (this is why we call DML a partially linear model).

Rather than code it up ourselves we can use the Microsoft package EconML. EconML has a wide range of Causal ML techniques implemented including a number of implementations of DML:

Welcome to econml’s documentation! – econml 0.15.0 documentation
```
# Train DML model
dml = LinearDML(discrete_treatment=False)
dml.fit(df['Outcome'].to_numpy().reshape(-1,1), T=df['Treatment'].to_numpy().reshape(-1,1), X=None, W=df['Confounder'].to_numpy().reshape(-1,1))

# Calculate average treatment effect
ate_dml = round(dml.ate()[0], 2)

print(f'The average treatment effect using the DML is: {ate_dml}')
```
User generated image

We again can see that it correctly estimates the coefficient of the treatment variable (0.75).

Marketing Case Study

Background

The Marketing team send attractive offers to selected customers. They don’t currently hold out a randomly selected sample of customers to measure the impact of the offers.

The Data Science team is asked to estimate how the offers affect customer orders.

Confounding bias

Naively comparing customers who were and weren’t sent offers is biased. This is driven by confounding factors:
- Customers who opt-out of email can’t receive an offer – this population is less engaged and less likely to order.
- The CRM team target customers based on their order history — order history effects how likely you are to order again.
Data generating process

We set up a data generating process with the following characteristics:
- Difficult nuisance parameters
- Simple treatment effect (no heterogeneity)
The X features are customer characteristics taken before the treatment:

User generated image

T is a binary flag indicating whether the customer received the offer.

User generated image
```
np.random.seed(123)

# Set number of observations
n=100000

# Set number of features
p=10

# Create features
X = np.random.uniform(size=n * p).reshape((n, -1))

# Nuisance parameters
b = (
    np.sin(np.pi * X[:, 0] * X[:, 1])
    + 2 * (X[:, 2] - 0.5) ** 2
    + X[:, 3]
    + 0.5 * X[:, 4]
    + X[:, 5] * X[:, 6]
    + X[:, 7] ** 3
    + np.sin(np.pi * X[:, 8] * X[:, 9])
)

# Create binary treatment
T = np.random.binomial(1, expit(b))

# Set treatment effect
tau = 0.75

# Calculate outcome
y = b + T * tau + np.random.normal(size=n)
```
The data generating process python code is based on the synthetic data creator from Ubers Causal ML package. Being able to create realistic synthetic data is crucial when it comes to assessing causal inference methods so I highly recommend you check it out:

causalml/causalml/dataset/regression.py at master · uber/causalml

Linear Regression

We start by using linear regression to estimate the ATE. Our expectation is that it will struggle to capture the nuisance parameters and then potentially mis-specify the treatment effect.
```
# Append features and treatment
X_T = np.append(X, T.reshape(-1, 1), axis=1)

# Train linear regression model 
model = RidgeCV()
model = model.fit(X_T, y)
y_pred = model.predict(X_T)

# Extract the treatment coefficient
ate_lr = round(model.coef_[-1], 2)

print(f'The average treatment effect using Linear Regression is: {ate_lr}')
```
User generated image

Double Machine Learning

We then train a DML model using LightGBM as flexible first stage models. This should allow us to capture the difficult nuisance parameters whilst correctly calculating the treatment effect.
```
np.random.seed(123)

# Train DML model using flexible stage 1 models
dml = LinearDML(model_y=LGBMRegressor(), model_t=LGBMClassifier(), discrete_treatment=True)
dml.fit(y, T=T, X=None, W=X)

# Calculate average treatment effect
ate_dml = round(dml.ate(), 2)

print(f'The average treatment effect using the DML is: {ate_dml}')
```
User generated image

Comparison

When we compare the results, we observe that linear regression gives us a biased estimate whilst DML is very close to the ground truth. This really shows the power of DML!
```
# Plot comparison of results
categories = ['Ground truth', 'DML', 'Linear Regression']
sns.barplot(x=categories, y=[tau, ate_dml, ate_lr])
plt.ylabel('ATE')
plt.title('Average Treatment Effect comparison')
plt.show()
```
User generated image

Other methods

There a several other causal methods which we can use to estimate ATE (a lot of which are implemented in both EconML and CausalML packages):
- Propensity score matching (PSM)
- Inverse-propensity score matching (IPSM)
- S-Learner
- T-Learner
- Doubly-Robust Learner (DR)
- Instrument variable learner (IV)
If you want to delve into these methods further, I would recommend starting with the S-Learner and T-Learner (often referred to as meta-learners). A couple of key learnings to help you start to work out when and where you could apply them:
- When your treatment is binary, and your treatment and control size is equally balanced, the T-Learner is often a simpler alternative to DML.
- When your treatment is continuous, and you suspect the treatment effect may be non-linear, the S-Learner may be more appropriate than DML.
- Meta-learners can struggle with regularization bias (particularly the S-learner) — When we do see DML outperform meta-learners, this is usually the reason.
Follow me if you want to continue this journey into Causal AI — In the next article we will explore how we can estimate Conditional Average Treatment Effects (CATE) with Double Machine Learning to help us personalise treatments at a customer level.

De-biasing Treatment Effects with Double Machine Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
De-biasing Treatment Effects with Double Machine Learning

Go Here to Read this Fast! De-biasing Treatment Effects with Double Machine Learning
April 5, 2024
How to Improve Graphs to Empower Your Machine-Learning Model’s Performance

Eivind Kjosbakken

Learn how you can improve your graphs for machine-learning tasks.

Continue reading on Towards Data Science »

Originally appeared here:
How to Improve Graphs to Empower Your Machine-Learning Model’s Performance

Go Here to Read this Fast! How to Improve Graphs to Empower Your Machine-Learning Model’s Performance

April 4, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Tag: AI

Inpainting radar gaps with deep learning

Overview

Motivation

Our Solution

Data

Our Model

Fidelity

Case Studies

Robustness

Explainability

Feature Maps

Drop Channel Importance

Saliency Maps

Applications

Final Notes

Monte Carlo Dropout

Training Statistics

Example Code

Future

Referernces

Top soft skills that should be highly valued by data scientists to distinguish themselves from the crowd

1. Curiosity

Keep learning

Think outside the box

How to bolster your curiosity?

2. Communication

Adapt to your audience

Communicating results efficiently

How to improve your communication?

3. Scientific Mindset

Tips to embrace a higher level of scientific rigor

4. Integrity

Is he lying with data?

Challenge model bias

Environmental Impact

How do you keep your integrity?

5. Being value-centric

How to be focused on value?

Conclusions

Causal AI, exploring the integration of causal reasoning into machine learning

What is this series of articles about?

Introduction

Average Treatment Effects (ATE)

ATE

Potential outcomes framework

Experimental data

Observational data

Estimating ATE using Linear Regression

Overview

Data-generating process

Linear regression

Challenges

Double Machine Learning (DML)

Overview

Frisch-Waugh-Lovell theorem

Double Machine Learning

Marketing Case Study

Background

Confounding bias

Data generating process

Linear Regression

Double Machine Learning

Comparison

Other methods