CZNet

Watershed catchments, a place where all the water in a specific area of a watershed comes together before moving outside the area, generate thousands of data points from flow rate, turbidity, and chemical properties. Over the years that creates massive caches of data.

That raw data is referred to as “dirty data”. It’s filled with signals that are not helpful for studying what’s going on in the catchment, and the watershed as a whole. Everything from an unusual storm to an animal stumbling through the water sensor array becomes an anomaly in the data. These anomalies often turn out as “peaks” when visualized.

Cluster data scientists have zeroed in on those peaks as they start their work developing machine learning models to clean the anomalies out of the big data sets. They’ve labeled the data peaks with names like “skyrocketing” or “plummeting” peaks. Then they went to work writing machine learning models that recognize those anomalous peaks and clean them out of the data sets.

This work is currently done with painstaking effort by hand, taking up a significant amount of any researcher’s time. Once reliably developed, the machine learning models will be able to accomplish in minutes what would take a human many, many hours.

Current research in the Cluster has computer scientists generated synthetic data and then comparing it with actual data from samples collected in the field. So far the results are promising. Ijaz Ul Haq, Byung Lee, and others in the Big Data Cluster hope that it will one day mean an environmental scientist will be able to plug-and-play these models into their data sets and spend more time asking important questions, and less time cleaning data points by hand.

Machine Learning Research Leads off Big Data Sessions on Day 3 of AGU22