According to the NOAA National Centers for Environmental Information (NCEI), the past nine years have been the nine warmest since record-keeping began in 1880, and 2023 is on track to set another record. Making the hard decisions involved in addressing this global crisis requires analyzing and understanding enormous quantities of data using sophisticated computational methods. Luckily, innovations from the open source software community are helping revolutionize climate science, letting researchers quickly derive insights from data.
Delivering Critical Collaboration
Modern science relies heavily on complex computational and numerical analysis techniques. However, the massive amount of measurement and simulation data available in fields like climate science makes traditional approaches impractical. With petabyte-scale data, researchers can no longer expect to download or obtain physical copies of entire datasets, nor can their local workstations or laptops handle datasets of that size once they arrive. Moving the scientists to the data is also impractical since researchers all over the world need to make use of the same data.
Instead, scientists and engineers from Pangeo.io and other global collaborations have developed a robust set of open-source tools for working efficiently with large, remote datasets, bypassing restrictions from proprietary software that’s limited to small datasets installed locally. Instead of monolithic commercial applications or inflexible Fortran codebases, modern researchers use flexible and lightweight approaches for distributing computation across cloud computing resources shared between many collaborators. These innovations are critical for climate science since further delay could lead to catastrophic challenges in combating global warming.
Deriving Faster, Flexible Data for Greater Output at Scale
Python’s scientific ecosystem, in particular, has gained prominence and is top of mind when considering open-source platforms best suited to execute climate solutions. Incoming graduate students more often know Python than any other programming language now that Python has taken over as the world’s most popular programming language. Python’s open-source ecosystem includes a wide variety of packages, each focusing on a specific area of functionality. These components evolve quickly and regularly introduce new features while maintaining compatibility through standard protocols. Unlike a monolithic, centrally designed approach, Python’s open ecosystem provides more flexibility when scaling data and solutions.
In high-impact fields like climate science, the goal is to enable scientists and researchers to concentrate on their domain-specific questions rather than becoming entangled in the intricacies of data manipulation and resource management. The numerical Python stack of open-source tools, combined with the ability to access available data in the cloud on a massive scale, facilitates this division of labor. Scientists can focus on writing software for the analysis parts specific to their expertise, meaning faster progress toward climate solutions and products.
With modern cloud-computing approaches, it makes more sense to move computation to the data rather than vice versa (since code is much smaller and more portable than data). With a platform like Jupyter running on a remote cluster and remote Python kernels as the execution environment, scientists can leverage the same infrastructure that hosts massive datasets in cloud object stores. Improvements in software technology and data formats also make it simple to access specific chunks of very large files, such as the temperature records associated with one particular city out of the entire world’s data. Accessing data becomes fast, cost-effective and scalable, bypassing traditional workflows involving repeatedly downloading large datasets consisting primarily of data not needed for a particular analysis.
For climate science specifically, this flexible remote-computing approach also means easier collaboration between developers and scientists to create innovations that could move climate studies and solutions forward. Distributed computing technologies such as Dask, Ray and Spark enable parallel and out-of-core computing, allowing for processing larger datasets at higher speeds. Several remote Jupyter-based services now exist, enabling even the main Python process (the “kernel”) to run remotely while utilizing the browser as an interactive development platform. Because all these tools are open, researchers can apply emerging technologies such as machine learning (ML) and artificial intelligence (AI) techniques as soon as they are developed without waiting for software and hardware manufacturers to catch up.
Creating Data Visualizations and Dashboarding
For humans to understand and validate the data and the steps involved in analysis requires visualization at every stage, which is difficult when the data is too large to transfer locally for display. New open-source tools like Datashader and VegaFusion offer a server-side rendering of even the largest datasets, efficiently rendering distributed datasets into a viewable image and then transferring only that image to a scientist’s local browser for display. Matplotlib, HoloViews, Altair, Plotly and hvPlot in Python, along with Makie.jl for Julia, now allow server-side rendering to be embedded into fully interactive plots in a web browser, letting researchers explore datasets and identify interesting features or detect issues with data quality or analysis steps. The Panel and Dash dashboarding tools in Python allow these plots to be combined into complete web applications for researchers to share their interactive analyses, all without any proprietary tools or restrictions (see this USGS wind turbine application as an example).
In a field with such a dire need for fast innovation, open-source technology is crucial to providing the creativity and speed required for climate scientists to fight global warming. The collaborative and transparent delivery of innovation is what the scientific community is increasingly counting on.
Dr. Martin Durant, staff software engineer at Anaconda, contributed to this article.