A beginner’s guide
I. Introduction
Deep learning spread with success in Earth Observation. Its achievements led to more complex architectures and methodologies. However, in this process we lost sight of something important. It is better to have more quality data than better models.
Unfortunately, the development of EO datasets has been messy. Nowadays, there are hundreds of them. Despite several efforts to compile datasets, it is fair to say that they are scattered all over. Additionally, EO data have proliferated to serve very specific needs. Paradoxically, this is the opposite way we should be moving forward with them, especially if we want our deep learning models to work better.
For instance, ImageNet compiled thousands of images to better train computer vision models. Yet, EO data is more complex than the ImageNet images database. Unfortunately, there has not been a similar initiative for EO purposes. This forces the EO community to try to adapt the ImageNet resource to our needs. This process is time-consuming and prone to errors.
Additionally, EO data has an uneven spatial distribution. Most of the data covers North America and Europe. This is a problem since climate change will affect developing countries more.
In my last article, I explored how computer vision is changing the way we tackle climate change. The justification for this new article emerges in light of the challenges of choosing EO data. I aim to simplify this important first step when we want to harness the power of AI for good.
This article will answer questions such as: what do I need to know about EO data to be able to find what I am looking for? in a sea of data resources, where should I start my search? which are the most cost-effective solutions? what are the options if I have the resources to invest in high-quality data or computing power? What resources will speed up my results? how best to invest my learning time in data acquisition and processing? We will start addressing the following question: what type of image data should I focus on to analyze climate change?
II. The Power of Remote Sensing Data
There are several types of image data relevant to climate change. For example, aerial photographs, drone footage, and environmental monitoring camera feeds. But, remote sensing data (eg. satellite images) offers several advantages. Before describing them let’s describe what remote sensing is.
Remote sensors collect information about objects. But, they are not in physical contact with them. Remote sensing works based on the physical principle of reflectance. Sensors capture the ratio of the light reflected by a surface to the amount of light incident to it. Reflectance can provide information about the properties of surfaces. For example, it helps us discriminate vegetation, soil, water, and urban areas from an image. Different materials have different spectral reflectance properties. Meaning they reflect light at different wavelengths. By analyzing the reflectances across various wavelengths we can infer not only the composition of the Earth’s surface. We can also detect environmental changes.
Besides reflectance, there are other remote sensing concepts that we should understand.
Spatial resolution: is the size of the smallest observable object in a scene. In other words, we will not be able to see entities smaller than the resolution of the image. For example, let’s imagine that we have a satellite image of a city with a resolution of 1 Km. This means that each pixel in the image represents an area of 1 Km by 1 Km of the urban area. If there is a park in the scene smaller than this area, we will not see it. At least not in a clear manner. But we will be able to see roads and big buildings.
Spectral resolution: refers to the number of wavebands a sensor is measuring. The wavebands relate to all possible frequencies of electromagnetic radiation. There are three main types of spectral resolution. Panchromatic data captures wavebands in the visible range. It is also called optical data. Multispectral data compile several wavebands at the same time. Color composites use these data. Hyperspectral data have hundreds of wavebands. This resolution allows much more spectral detail in the image.
Temporal resolution: is also referred to as the revisit cycle. It is the time it takes a satellite to return to its initial position to collect data.
Swath width: refers to the ground width covered by the satellite.
Now that we know the basics about remote sensing, let’s discuss its advantages for researching climate change. Remote sensing data allows us to cover large areas. Also, satellite images often provide continuous data over time. Equally important, sensors can capture diverse wavelengths. This enables us to analyze the environment beyond our human vision capabilities. Finally, the most important reason is accessibility. Remote sensing data is often public. This means that is a cost-effective source of information.
As a next step, we will learn where to find remote sensing data. Here we have to make a distinction. Some data platforms provide satellite images. And there are computing platforms that allow us to process data and that often also have data catalogs. We will explore data platforms first.
III. Geospatial Data Platforms
Geospatial data is ubiquitous nowadays. The following table describes, to my knowledge, the most useful geospatial data platforms. The table privileges open-source data. It also includes a couple of commercial platforms as well. These commercial datasets can be expensive but worth knowing. They can provide high spatial resolution (ranging from 31 to 72 cm) for many applications.
This section presented several data platforms, but it is worth acknowledging something. The size and volume of geospatial data is growing. And everything indicates that this trend will continue in the future. Thus, it will be improbable that we continue to download images from platforms. This approach to processing data demands local computing resources. Most likely, we will pre-process and analyze data in cloud computing platforms.
IV. Geospatial Cloud Computing Platforms
Geospatial cloud platforms offer powerful computing resources. Thus, it makes sense that these platforms provide their own data catalogs. We will review them in this section.
This platform provides several Application Programming Interfaces (APIs) to interact with. The main APIs run in two programming languages: JavaScript and Python. The original API uses JavaScript. Since I am more of a Pythonista, this was intimidating for me at the beginning. Although the actual knowledge of JavaScript that you must have is minimal. It is more important to master the GEE built-in functions which are very intuitive. The development of the Python API came later. Here is where we can unleash the full power of the GEE platform. This API allows us to take advantage of Python’s machine-learning libraries. The platform also allows us to develop web apps to deploy our geospatial analyses. Although the web app functionalities are pretty basic. As a data scientist, I am more comfortable using Streamlit to build and deploy my web apps. At least for minimal viable products.
AWS offers a range of capabilities. Firstly, it provides access to many geospatial data sources. These sources include open data and those from commercial third-party providers. Additionally, AWS can integrate our own satellite imagery or mapping data. Moreover, the platform facilitates collaboration. It enables us to share our data with our team. Furthermore, AWS’s robust computing capabilities empower us to efficiently process large-scale geospatial datasets. The processing occurs within a standardized environment, supported by available open-source libraries. Equally important, it accelerates model building through the provision of pre-trained machine-learning models. Also, within the AWS environment, we can generate high-quality labels. We can also deploy our models or containers to start predictions. Furthermore, AWS facilitates the exploration of predictions through its comprehensive visualization tools.
I came across this platform a couple of days ago. The platform displays several geospatial datasets with varied spatial and temporal resolutions. Additionally, it offers an advantage over GEE and AWS as it does not require coding. We can perform our analyses and visualizations on the platform and download the results. The range of analyses is somewhat limited, as one might expect, since it does not require coding. However, it can be enough for many studies or at least for quick preliminary analyses.
4. Colab
This is another fascinating Google product. If you ever had the chance to use a Jupyter Notebook on your local computer, you are going to love Colab. As with Jupyter Notebooks, it allows us to perform analyses with Python interactively. Yet, Colab does the same thing in the cloud. I identify three main advantages to using Google Colab for our geospatial analyses. First, Colab provides Graphical Computing Units (GPUs) capabilities. GPUs are efficient in handling graphics-related tasks. Additionally, Colab provides current versions of data science libraries (e.g. scikit-learn, Tensorflow, etc.). Finally, it allows us to connect to GEE. Thus, we can take advantage of GEE computing resources and data catalog.
5. Kaggle
The famous platform for data science competitions also provides capabilities similar to Colab. With a Kaggle account, we can run Python notebooks interactively in the cloud. It also has GPU capabilities. The advantage of Kaggle over Colab is that it provides satellite image datasets.
V. Conclusion
As we have seen, getting started with data acquisition is not a trivial task. There is a plethora of datasets developed for very specific purposes. Since the size and volume of these datasets have increased, it does not make sense to try to run our models locally. Nowadays we have fantastic cloud computing resources. These platforms even provide some free capabilities to get started.
As a gentle reminder, it is important to mention that the best we can do to improve our modeling is to use better data. As users of these data, we can contribute to pinpointing the gaps in this arena. It is worth highlighting two of them. First, the a lack of a general-purpose benchmark dataset designed for EO observations. Another one is the absence of more spatial coverage in developing countries.
My next article will explore the preprocessing techniques for image data. Stay tuned!
References
- Lavender, S., & Lavender, A. (2023). Practical handbook of remote sensing. CRC Press.
- Schmitt, M., Ahmadi, S. A., Xu, Y., Taşkın, G., Verma, U., Sica, F., & Hänsch, R. (2023). There are no data like more data: Datasets for deep learning in earth observation. IEEE Geoscience and Remote Sensing Magazine.
Image Data Collection for Climate Change Analysis was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Image Data Collection for Climate Change Analysis
Go Here to Read this Fast! Image Data Collection for Climate Change Analysis