Data, Data Everywhere Without a Way to Sync

Data, Data Everywhere Without a Way to Sync

"Look before you leap." Well, some adages make more sense as time life goes on and international climate assessment reports happen. Uncertainties are "certain" barriers to predictions, and even the slightest reduction in uncertainties in climate prediction is a big win.

Here in the United States, almost all of us treat traffic lights the same way, adhering to a common set of rules. Some countries are not as rigorous in their traffic law enforcement, and traffic lights serve more of a decorative than practical function. I have been to those countries and traveled in an environment where everyone has different measures and interpretations; it’s not a bed of roses. However, this article is not about traffic lights, it’s about an equally dangerous subject: data standards in climate modeling.

Wait! — don’t close your browser! This is not a story of climate change…it is a tale of thousands of scientists who do their best to take accurate measurements and share that data with other scientists and decision makers.

About every six years, climate model output is published in the climate model inter-comparison projects (CMIP)—a community-based way to compare climate models. During the last Intergovernmental Panel on Climate Change (IPCC) cycle, there were about 27 modeling centers and 58 global climate models that participated, resulting in about 750 research articles. These articles lay the fundamentals in understanding climate change that drive the decision-making bodies, providing the world with a clear scientific view on the current state of knowledge in climate change and its potential environmental and socioeconomic impact.

We expect 33 centers (75 models) to participate in the next CMIP. The last CMIP (the fifth) generated one petabyte of data. For those of you keeping score at home, that’s a million gigabytes! The upcoming CMIP6 is projected to produce 20 petabytes. Data, data everywhere. But, if it’s not usable (e.g., standardized), it’s just not worth it. So, what makes it usable?

Data provenance and documentation

To err is human; to err is machine.

If we cannot "track" outputs back to the inputs, the credibility of the data source becomes weaker (i.e., provenance). Data provenance is like showing your work on your physics 101 exam, or footnoting your term paper…the scientific way of saying, "we aren’t making this stuff up."

Data Everywhere At the Geophysical Fluid Dynamics Laboratory, the Engility team works towards innovative solutions to address data provenance. Over ten years ago, we helped develop a workflow that revolved around curator technology to harness metadata from every experiment configuration being run to seamlessly translate it into netCDF file metadata. Think a universal language for climate data. The great vision and resources that went into making this happen is highly commendable. A decade later, the vision and usefulness remain the same, the technology is being shaped up as a new reincarnation.

Quality control

Out goes the outlier.

By applying quality control (QC) to the datasets, researchers like me significantly add to the credibility factor tied to the model output that gets published. With multiple dimensions involved, the challenges only get bigger…and so does the size of the data.

We strive to establish best practices in this realm. We continue to find user-friendly and effective ways to enable scientific data QC to spot the outliers and more: live-monitoring of climate simulations and QC frameworks with basic statistics and visualizations for fields to be published in climate assessment reports. An unbiased importance to "data standards" and “metadata” QC’ing is also evident from the automated tools and checkers we employ to spot the non-compliant fields. We believe in team effort and collaboration to provide solutions to simple and complex problems.

Connecting data standards with climate analytics

Complex and convoluted methods are not a necessity to build frameworks that contribute to useful research.

Think simple, think "standards."

The Geophysical Fluid Dynamics Laboratory applies innovative scientific analysis scripts that study various climate phenomena. In order to build better models and better predictions, the lab has a pressing need to apply these scripts to models developed at different modeling centers in the world. This paves a way to generate metrics for global climate models to perform multi-model comparison. Innovation is needed in the gap that bridges the climate analytics to the petabytes of data from all over the world. The key to designing frameworks that address this is in the realm of "Climate data standards.” Once, my mentor said to me, “Developing data standards is indeed a thankless task, but the benefits are enormous."

What happens when we speak a language with ambiguous meanings to words? Chaos. Just like the insanity of driving in a city where no one adheres to traffic signals. Well-documented data, adhering to climate forecast conventions, Data Reference Syntax, and CMIP Standards make our lives much easier to provide hooks to develop distributed data access and visualization solutions that allow us to apply climate analytics on a multitude of data from other modeling centers. Be it a global climate model out-put or statistically downscaled data, data standards are of paramount importance to produce reproducible and reliable research, uniting the best from all worlds.

Thank You

This blog post is to thank the numerous people across the world who have been participating in the "thankless" task of establishing data standards to produce more meaningful scientific findings. We continuously look for opportunities to incorporate newer technologies, and we also strive to thank the data creators attributing to open-source and reproducible research. The use of Jupyter notebooks and Digital Object Identifiers (DOIs) are two classic examples, where the python Jupyter notebooks not only help develop live-code, but also document it and share it with collaborators and public for reproducible research, while the data DOIs make our data citable, giving credit to the data creators.

Share this Post:

Posted by Aparna Radhakrishnan

I work within Engility’s Space and Mission Systems Group. Since 2009, I have represented the Engility team, working with the great minds at NOAA’s Geophysical Fluid Dynamics Laboratory towards the overarching goal of the betterment of our planet. I have served as the Modeling Systems Group liaison for the PMEL Data Integration Group with NOAA PMEL, GFDL’s Empirical Statistical Downscaling, Ensemble Climate Data Assimilation projects, ExArch-NSF exa-scale initiative and the Earth System Grid Federation. I develop and promote the use of collaborative software and data standards to support climate research.