Harnessing big data in Earth science – Transcription
Host: We will get started with our sessions for the afternoon.
I want to make sure that we have all the right pieces to get started with. So we’re going to get started with Brian Johnson, talking about harnessing big data in earth science.
Brian Johnson: Thank you, this is my first conference, I’m new to the community. I’m Brian Johnson, director of analytics in Earth Lab. I will give you an interview of what Earth Lab is, it is about two years old or so. We will initiated under CU’s grand challenge, which kicked off about three years ago, this is one of the initives they funded. I will talk about what Earth Lab is, and the approach to addressing the challenges with big data technologies, and applying those in earth science and using big data from earth observations, including satellite data, and ground systems, for example.
So, as I mentioned, Earth Lab started about two years ago, in September, 2015. We kicked off Earth Lab under the grand challenge initiative, and it is really an earth system science synthesis center. And what we’re trying to accomplish is to take advantage of the strong earth and social science work that is already being done at the university, and trying to bring big data technologies and some of the new approaches for data mining, machine learning, and deep learning and statistical analysis to try to uncover new insights, new trends, or spatial patterns, in earth’s observations.
We are focused on trying to understand environmental change and the consequences of those changes, so society can better adapt and manage.
And so we’re using a range of different kinds of observations and then trying to integrate those observations with other information.
So we’re more of a consumer of maps and map products than actually a producer of those things. But we’re learning how to bring in that kind of information, together with satellite remote sensing, airborne remote sensing, and other socio-economic data to gain insights on environmental change, and also societal impacts.
So Earth Lab has a director, Jennifer Bulch, the deputy director is Bill Travis, they are both in the geography department at CU. We are a core of researchers, we have staff, faculty, graduate students, and a number of undergraduate and post-Docs that are involved in data-intensive research on campus.
So there’s about 30 of us or so, a co-hort of researchers, all with similar interest in research areas, and also in data and analytics challenges. So Earth Lab formed around this idea, and we have these three main components that are interrelated and highly integrated.
We have the data-intensive science projects, and the research areas, we have an education initiative, and now I will talk about the analytics of it. That is primarily focused on supporting the science projects, and the education initiative. And this is where we try to bring both expertise together in data and computing infrastructure to try to facilitate, or advance, some of these scientific research areas.
And, on your right, you will see the examples of the kinds of research areas we are involved in. We are looking alt how ranchers look at drought, and what they do about insurance if there’s a loss, in terms of selling cattle or buying new feed.
We are looking at forest health and resilience and recovery after the disturbances of wildfire, drought, and pine beetle. And all of these things have a common thing where we are bringing earth observations and new ways of analyzing data to try to make progress.
Understanding analyzing processes and feedback process within the earth system sciences.
So, as I mentioned, the analytics hub is really focused on trying to provide and facilitate that research.
So we are – we provide both expertise in remote sensing, in data science, computing, and scientific visualization, along with trying to provide data infrastructure.
I will talk a little bit more about that.
And, in the process of working sort of integrated with the science teams and working with them, we develop tools that enable data access and some of the analytics work – well, some of the transformation of making different datasets work together.
And, we make those data and those tools available, more generally, sort of as an accelerator in Earth Lab and to the earth science community. We put that stuff on GitHub, and hoping that will be picked up and used and adapted for other research activities.
We have some training, as part of the analytics hub, and we support the education initiative to fill in or address that gap in skill and knowledge between big data technologies, coding best practices, and other things, like Docker containers, as a way to build out analytics workflow and how to work with earth observations, for example.
We have a visualization studio that we built up as part of Earth Lab, and it is – it is a bit like a decision studio, if you are familiar with that. It has a number of displays around the room. But it is a way to bring in a number of different people across different sectors, for example: Industry, education, federal agencies. It is a way to explore data, and to explore and visualize different data. And the key to that is we have a dedicated vis cluster, we call it. It is a stack of servers running Sage software that allows the interaction in the web browser. What is important is that we can run complex models in large data sets and run it in realtime. It gives us the ability to interact with the data in realtime, and looking at new data and forming new science questions, looking for patterns and doing analytics in that.
And in the center of the diagram is our data and computing infrastructure. So, in partner with the university, we have access to to the hyperperformance computing capants, which includes super computing, and we have a couple nodes on Earth Lab facility so it is high-priority. We have a super computing lab we can work in, we have research data infrastructure, we refer to as the library. So we have capabilities to large amounts of data.
And for us, it is not just about harvesting a lot of data and putting it in the center, but reaching out into other data repositories. So one of the challenges that we have is, most of the earth observations, especially those collected by federal administrations, sit in data repositories. So we have the challenge of trying to make connection in there. We are still make subsets or parts of that data, and still doing analytics on it. The goal is not to curate our own data, but to use the other data that is available.
And recently, we are trying to move the analytics workflow and pull the researchers along with us into the cloud, because there are some really powerful approaches to scalable computing that the cloud enables. And it is also, I will talk a bit more about this, but it enables us to have, sort of, a collaborative and a common platform as we try to build up partnerships in federal agencies and industry community, as well as across the university, rather than pulling people into the high-performance computing environment, which has its own environment, we can develop tools, capabilities, and analytics-like environment that is common among different sectors, different partners.
So, here, this is really – I think a central theme for us. The challenge with working with earth observations, I try to represent that here, it is not only the volume of data that is available. For example, NASA has two petabytes of data products sitting in the archive right now, spread across 12 different data centers, science-theme oriented data centers, and they are growing the archive at five petabytes per year. When they get to 2021 or 2022 and launch a couple new satellites that are radar satellites, the data just skyrockets. So there is not only a lot of data sitting in the archive today, but a rapid growth of today. And if we look at what NOAA is looking at in providing weather to the forecasts, they are selecting something like 20 terabytes of data per day.
Those are all mission-related data collections, so they provide or define their own data types, they have their own data formats and a set of metadata that goes with that. And they are all designed to meet specific scientific objectives, or operational needs. There’s a lot of value and re-use in those data sets. We are tapping those datasets to bring them into new research areas and applications.
I mentioned that the other challenge, of course, is this data is spread around to a lot of different repositories. And while there are efforts in NASA and NOAA to bring their large-value datasets into the cloud, they are not there. There is still the challenge of trying to bring that data in.
And in addition, we are not only using satellite remote sensing, but a whole number of other ways that observations are collected. If you see in the top right corner there, there are fixed sites, environmental sites constantly collecting data about the data, and airborne sensors collecting information, and field studies, and how do we integrate this heterogeneous set of data, both in measurement type and also in format.
And also, one of the things that we are striving to do here, rather than building out a single analytics workflow, we are trying to find new ways to bring scalable compute together, with some of these new machine learning frameworks and other software approaches, not really well understood within the natural sciences, or well-adopted quite yet. But, by bringing those together, we can enable open and reproducible science.
There’s a number of reasons to do that. You want a process that allows transparency and access to some of the intermediate research, and the algorithms and the process used to analyze that data, both for review and to demonstrate the correctness of the results.
But also, to extend, to use those methods and those tools in other research areas and extend them to new research topics.
Within Earth Lab, then, it also allows us to build us a capability, we can capture work protects and tools that are well-documented and reproducible, as well as the scientific narrative that goes along with that.
So we understand what those tools are, so, as part of our kind of, I guess, business model, we have post-docs that come for a couple years and move out. And we want to try to capture some of that understand, capability, and understand within Earth Lab so we can continue to build out capability and knowledge in this data intensive work we are doing.
And one of the things this means to us, we are trying to adopt in the way that we do analytics development, and the way that we try to bring scalable compute to the problem is to try to use open-source software, R and Python, and Git, GitHub, as ways of controlling software versioning, so several people are working on the same algorithms and a repository for software, and things like Docker containers, that allow us to capture the processing environments.
So it captures the dependencies as well as the versions in an environment, and then, in principle, it is reproducible, so we can deploy that either from the desktop, or into the cloud, or give it to somebody else to deploy in their own software environment and give it the same workflow and results.
As part of the analytics hub activity, we are getting our feet under us, trying to understand and really how do we leverage and adopt some of these really pretty cool and capable, both cloud computing technologies, but also some of the software approaches that I just mentioned.
And so we have this opportunity, at the university, to think about a way to strategize on how do we build out this architecture, or kind of computing enterprise. And so we have, you know, a lot of our researchers are working on the desktop and we are trying to push them into environments that allow them to scale out there research. So not just to be able to do, you know, to wait over the weekend, but rather to extend their research to larger datasets, or longer timeframes, for example. So we are looking at how we interarct with the cloud.
In the last couple minutes, I want to touch on a couple examples of the analytics we have been doing, and our approach to bringing analytics to a remote sensing data, and so on. And this first one represents a collaboration that we have been involved in with Digital globe, and with new mining. New Mine is a gold and silver producer, they are a global company. They came and asked if there’s some new techniques that we can apply to high spatial resolution, multi-spectral imagery.
And so, we looked at how to bring deep learning into, sort of, into this framework, and Digital Globe provided the imagery, they did all the processing, the atmosphere correction, and the geo referencing of 16 bands. They provided something that tells us about the recoverable gold on the ground and trained up the deep learning algorithm on those points.
And then we looked at predictions of recoverable gold in those regions within the Cripple Creek mind, here in Colorado, to see how it was doing. We use 60 percent of the ground points to train the algorithm, and 40 percent as a test for the algorithm. And we are in the 65 to 70 percent accuracy in the category. You can see that the yellow represents high values of recoverable gold.
It allows us to see what is going on in that mine, but the plan is to extend it to other regions and mines to see if we can predict, or find gold from space.
And the last example is an example that one of our student interns, Jeremy Diaz, is looking at. What he is interested in doing, he is using the machine learning framework and applying that to predicting tornado economic impacts. There’s a lot of work in trying to predict the occurrence of tornadoes but, more recently now, people are interested in trying to understand the damages associated with those and document those damages. So we built up some data about the social and economic impact of tornadoes, as well as using the National Weather Service’s storm data database, which characterizes the tornadoes, the intensity, and other characteristics of them. And Jeremy used those pieces of information to train up a machine learning algorithm. And over time, you can see the prediction of tornado damage at each of these grid points. It is in January, it will loop through month by month. You will see the geo-spatial patterns, that was the objective, to see if there are geo-spatial patterns related to tornado damage.
And, of course, this just predicts the damage that would occur if a tornado occurred at that location. And so there are some, of course, regions that won’t have tornadoes, or very unlikely. So the next step is to lay on the probability of occurrence here, so we get something more akin to a risk map.
So with that, I will wrap up and thank you all for coming, and I guess we will entertain a couple questions, maybe.
Audience.
Member: Yes, are you accepting new graduate students?
Brian: (Laughter), yes, we are.
We have 10 undergrad students in an internship, and we have something like four slots for GRAs, graduate research assistantships, and we have post-docs that roll over every few years. There’s a lot of opportunities to get involved in Earth Lab.
Audience member: Can you go back to the first slide?
Brian: That one?
Audience member: The data highlighting focus – spatial. Yes.
Brian: You are asking about the spatial analytics we are involved in?
With the training tools.
There are two aspects to that activity. So, with the education initiative, Lea Watser is director of the education program and manages tools, and she handles the Earth Data analytics course for graduate students. This is the third year we have been running this. And she runs the professional master’s certificate. And these courses start with coding, scale R, Python, and then some of the concepts of transformation, you know, re-grading, reprojection of imagery and bringing images together, and getting your feet wet in machine learning and getting your feet wet in spatial and temporal patterns.
And then what the data hubs do are data jams, and coding best practices in Docker container and Docker best practices and things like that. Does that answer your question? I can’t find the mouse, (laughter).
Any other questions?
Oh, you want their email?
Yeah, we will have to cut it off and keep moving forward. If there are any other questions for Brian, grab him at the break, or between talks if you could. Thank you.
Thanks, everybody. Thank you.
(Applause).
Live captioning by Lindsay @stoker_lindsay at White Coat Captioning @whitecoatcapx.