Introduction to Scientific Gateways and Workflows
Overview
Teaching: 10 min
Exercises: 0 minQuestions
What is a scientific gateway?
What is a scientific workflow?
Why are scientific gateways and workflows important?
Objectives
Understand what scientific gateways and workflows are and benefits they provide researchers
Motivation
Many sciences have become data intensive, due to numerous simulations, digital collection methods and instrumentation producing many terabytes of data. Moreover new highly complex and massively large data sets are expected by novel and more complex scientific instruments and numerical simulations that will become available in the next decades.
The handling, exploring, and using of these data to make scientific discoveries poses a challenge that requires the adoption of new approaches in organizing scientific collaboration and using computing and storage resources. To this extent scientific gateways and workflows have emerged as a paradigm for researchers to collaborate as well as formalize and structure complex scientific experiments in order to enable and accelerate scientific discoveries and reproducibility.
Furthermore, funding agencies are more likely requiring data management plans to accompany grant applications while citing data for publications for science reproducibility is becoming the norm.
Scientific gateways and workflows combine to provide a framework to enable research communities with computing resources to orchestrate scientific applications and tools via web-based graphical user interfaces.
What is a Scientific Gateway?
A science gateway is a community-developed set of tools, applications, and data that are integrated via a portal or suite of applications, usually in the form of a graphical interface, to meet the needs of a specific science community. Communities formed from users of a common discipline can use gateways to access resources used for scientific analysis through a common, optimized interface. This removes the barrier of complexity in assembling the necessary cyberinfrastructure and thorough understanding of programming languages and/or softwares researchers may need to carry out computational tasks.
More succinctly, science gateways are portals to computational and data services and resources. They provide these services and resources to a range of science domains for researchers, engineers, educators, and students. Each gateway is an independent project with its own guidelines for access and available for use by anyone but targeted towards specific research communities.
Benefits of using science gateways.
The benefits scientific gateways provide to researchers are many. We list a few of the most common characteristics of gateways that are helpful.
- Gateways enable researchers to focus on their scientific research efforts and less on setting up the cyberinfrastructure they would need to carry out heavy computation.
- Provide access to large community datasets
- Promote the disseminating of research knowledge and reproducible science
- Foster collaborations amongst researchers and scientific communities
What is a scientific workflow?
Quantitatively complex science often consists of numerous interconnected computational tasks. In this context, a workflow is the composition of several such complex, data-intensive computing tasks for scientific simulation or data analysis. Common stages in scientific workflows are acquisition, integration, reduction, visualization, and publication of scientific data. Scientific communities use workflows and workflow management systems to manage the complexity associated with such tasks. Workflow technologies can perform scheduling of tasks on distributed resources, managing dependancies, and data staging for compute execution. A workflow management system (WMS) aids in the automation of those operations, namely managing the execution of constituent tasks and the information exchanged between them.
In the context of gateways, computational processes supported by gateways are organized as scientific workflows that explicitly specify dependencies among underlying tasks for orchestrating distributed resources (such as clusters, grids or clouds) appropriately.
Benefits of scientific workflows
Workflows have been adopted by scientific communities as valuable tools to perform data heavy computational tasks necessary for experimentation. Workflows enable researchers to perform data analysis and computation:
- while hiding the complexities of job submission, resource allocation, file handling
- while handling dependencies
- providing simple to use data pipeline codes which requires minimal knowledge base to perform analysis
These benefits save the researcher the time otherwise needed to learn a programming language.
Why are scientific gateways and workflows important?
As data-intensive research continues to be a substantial portion of research interest, cyberinfrastructure and access to it helps with data management plans that more commonly must be satisfied. Some funding agencies expect research grants have attached data management plans to proposals to ensure the data does not disappear and proper dissemination of research occurs. The benefit to the researcher is the same dissemination indicating the analysis performed and how conclusions were reached. This can lead to further collaborations, further funding their research and extending their work. Data management plans have the added benefit of helping scientists keep data resources organized.
Gateways are substantial components of such data management plans and can satisfy many of the associated requirements data management plans must meet. Furthermore, gateways typically give researchers the ability to publish their data sets further promoting the reproducibility of science. In fact, many publications now require data sets be cited. Most gateways can serve this need by providing a citable digital object identifier, DOI.
Workflow and science gateway technologies have been adopted by scientific communities as a valuable tool to carry out complex experiments. They offer the possibility to perform computations for data analysis and simulations, whereas hiding details of the complex infrastructures underneath.
Much of the information used here was kindly provided by the following cited sources.
Citations
- ” John Towns, Timothy Cockerill, Maytal Dahan, Ian Foster, Kelly Gaither, Andrew Grimshaw, Victor Hazlewood, Scott Lathrop, Dave Lifka, Gregory D. Peterson, Ralph Roskies, J. Ray Scott, Nancy Wilkins-Diehr, “XSEDE: Accelerating Scientific Discovery”, Computing in Science & Engineering, vol.16, no. 5, pp. 62-74, Sept.-Oct. 2014, doi:10.1109/MCSE.2014.80
- “Scientific Gateways Community Institute. Retrieved from https://sciencegateways.org”
- “Castelli, Giuliano, et al. “VO-compliant workflows and science gateways.” Astronomy and Computing 11 (2015): 102-108.”
- “Deelman, Ewa, et al. “The future of scientific workflows.” The International Journal of High Performance Computing Applications 32.1 (2018): 159-175.”
Key Points
Scientific gateways are online community spaces providing web-based resources for accessing data, software, computing services, and equipment specific to the needs of a research discipline.
Scientific workflows are computational processes which aid in the automation and managing of data-intensive computing tasks while also removing the direct handling of cyberinfrastructure complexities from users.
Using and Developing Gateways and Some Gateways of Interest
Overview
Teaching: 10 min
Exercises: 0 minQuestions
How are science gateways used?
What are some resources to assist in the development of gateways and data plans for projects?
What are some examples of scientific gateways that may be of interest to researchers?
Objectives
Understand various aspects about gateways; from developing a gateway for a project to how to use a gateway
To discover some existing scientific gateways that may be of interest to the audience.
Using Science Gateways
Using a gateway will be the main focus of this workshop. The audience will get first-hand experience using a scientific gateway with a predetermined dataset. The gateway we will be exploring in the next episode is HydroShare, a gateway created for water sciences and sharing and analyzing water data.
Scientific Gateway Features
Depending on the needs of the communities, a gateway may provide any of the following features:
- High-performance computation resources
- Workflow tools
- Data storage and metadata/documentation tools
- General or domain-specific analytic and visualization software
- Collaborative interfaces
- HPC job submission tools
- Education modules
Developing and Integrating Science Gateways
Scientific gateways can be helpful when developing a data management plan for a research project. There are resources available to those interested in developing a gateway for their research community.
XSEDE
The Extreme Science and Engineering Discovery Environment, XSEDE, can support science gateways in many ways. XSEDE is a single virtual system that provides advanced research computing resources and services. XSEDE can provide virtual machine hosting for scientific gateways and their services. Gateway developers can also benefit from XSEDEs collaborative support services which help gateway providers integrate gateways with XSEDE resources.
More information on how to turn a project into a science gateway using XSEDE can be found at the XSEDE website.
Providers of XSEDE services support community accounts allowing gateways to execute scientific applications on XSEDE resources as a generic gateway user. Users do not need to create XSEDE accounts to use a gateway.
Scientific Gateways Community Institute
In 2016, a team led by the San Diego Supercomputing Center was awarded a National Science Foundation (NSF) grant to establish an institute to accelerate the development of scientific gateways which address the needs of researchers across NSF’s directorates. This institute bacame the Scientific Gateways Community Institute, SGCI.
Organizations like Scientific Gateways Community Institute provide great resources to help when building a scientific gateway. Their goal is to facilitate the sharing of experiences, technologies, and practices of those working with science gateways. SGCI hosts workshops on developing, operating and sustaining gateways as well as provides a science gateway catalog giving users a means to discover gateways within different disciplines. Furthermore, SGCI provides regular tech summits to help developers find solutions to common gateway-related issues.
SGCI Offerings
SGCI offers a full range of services and expertise including:
- Building and running gateways
- Software developers expertise in building or enhancing an existing gateway in the areas of graphic design, cybersecurity, business and sustainability planning, and user engagement
- Usability and user engagement sustainability. (Stakeholder engagement is crucial in terms of keeping in the loop.)
- Free hosting, allowing gateway builders to test frameworks.
- Community resources and networking
- Education and training
SGCI provides large scale training, consulting services, learning opportunities and community which can be found at the SGCI website.
Examples of Some Existing Scientific Gateways
The following are some examples of existing scientific gateways which may be of interest to the audience. Attendees may visit these gateways and discover communities sharing domain specific data and workflows which may assist in their research projects.
CyberInfrastructure for Phylogenetic REsearch, CIPRES
CIPRES is a gateway for systematic and population biology and Phylogenetics related research. It allows researchers to explore relationships between species using supercomputers provided by NSF’s XSEDE. Backed by an NSF award, CIPRES was developed to make supercomputing resources more accessible and flexible for phylogenetic researchers and is among the most popular gateways in the XSEDE community. It has supported more than 12,000 users and been a part of 1,300 publications on phylogenetics.
ChemCompute
ChemCompute is a scientific gateway targeted towards the chemistry community and students. According to ChemCompute’s website, it provides computational chemistry software for undergraduate teaching and research. Their website contains pages for accessing data, various simulation and solver tools, as well as support for computational job submittals to their cloud clusters for workflow applications. ChemCompute’s goal is to enable faculty to incorporate computational chemistry into their undergraduate teaching and research curriculum without the hassle of compiling, installing, and maintaining software and hardware.
Unidata: Data Proximate Services in the Cloud
Funded by the NSF, Unidata is diverse community of research and educational institutions which shares geoscience data and the tools to access and visualize it. They provide cyberinfrastructure, data services and tools to advance earth sciences. Additionally, Unidata develops, maintains, and supports a variety of software packages.
OpenTopography
The NSF funded OpenTopography Facility is a gateway which targets the earth science community. It is a web-based system developed to give access to earth science-oriented LIDAR topography data. OpenTopography is NSF funded and provides free, online access to LIDAR data in a number of forms, including pre-computed raster data as well as the raw point cloud and associated geospatial-processing tools for customized analysis.
HydroShare
HydroShare is an online collaboration environment for sharing data, models, and code for the water science community. It has extensive capabilities to create and discover new data sets while providing its own web apps which can be used in workflows to visualize, analyze, and run models. HydroShare additionally allows users to publish their data and models to satisfy data management plan requirements.
Some of the information used here was kindly provided by the following cited source.
Citations
- “San Diego Supercomputer Center. https://www.sdsc.edu/services/hpc/science_gateways.html”
Key Points
Scientific gateways can provide researchers various features such as computational resources, workflow tools, and collaborative interfaces.
There are various resources available for researchers interested in developing scientific gateways that can help educate them on various aspects of gateway development and where to get started.
There are many scientific gateways aimed towards specific research communities already in existence. XSEDE and SGCI provide links to such gateways for researchers interested in accessing these gatways.
HydroShare: A Science Gateway for Hydrological Sciences
Overview
Teaching: 30 min
Exercises: 0 minQuestions
What is HydroShare?
How does HydroShare work?
What is a HydroShare resource?
How do you share resources with others?
How do you create a HydroShare group?
What are HydroShare webapps?
What is JupyterHub?
Objectives
To allow the audience some experience using a scientific gateway and existing workflows
To introduce attendees to HydroShare
Exploring HydroShare
HydroShare
According to the HydroShare website, HydroShare is a web based hydrologic information system. It is developed and maintained by Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) and was established for users to share and publish data and models. HydroShare makes this information available in a citable, shareable and discoverable manner.
This collaborative aspect of HydroShare is one of its most useful features. It enables users to work as teams in a web-based environment. Additionally, HydroShare includes tools such as web applications backed by computational resources which can be used to perform tasks on data within HydroShare. These web apps usually provide users with scientific workflows to perform computational analysis.
HydroShare has several functionalities which can provide overall research and educational advancement. These functionalities include the following:
-
Archiving and Disseminating Data
HydroShare enables users to upload and store data and corresponding metadata. Data and models uploaded into HydroShare can be assigned citation information which can be used to reference it. By permanently publishing and obtaining a citable digital object identifier (DOI) for resources in HydroShare, resources may no longer be edited and are stored in HydroShare. This data can then be discoverable and be used for reproducibility of research. This is an important feature for satisfying data management plans when research finalizes.
-
Collaboration
Users can control who the data and models are shared with. This helps users connect with other members in their research community, collaborate and receive constructive feedback about their work. HydroShare allows users to create or join groups with common research interests. By sharing data and models with other researchers in their community, users can not only connect with other colleagues within the same field, but also collaborate on research and recieve appropriate feedback from other community members.
-
Discover and Use Existing Datasets
HydroShare provides functions to help users discover data and models within its platform. It consists of a wide range of users many of which contribute data and models to users with particular research interests.
How HydroShare Works
Typical use of HydroShare includes a few steps. These include:
-
Create data
This step involves the act of collecting data. This can be collecting data using sensors, taking images, or any other method. HydroShare supports a variey of data types i.e. GIS, NetCDF, etc.
-
Upload to HydroShare
By uploading data to HydroShare, the user is provided a place to store the information and given the option to make it open and accessible. Alternatively, the user can retain privacy until ready. Uploading is simple through the web user interface. HydroShare automatically extracts as much metadata as it can from the files uploaded.
-
Describe with metadata
HydroShare enables users to annotate data with metadata. This annotation makes data interpretable and discoverable. Together, data and metadata form what HydroShare refers to as a resource.
-
Share with colleagues
Users can share data and models with individuals or make them publicly accesible. There are various method to interact with other researchers. Sharing within a group particularly, gives all members interested in a specific research topic the ability to access that resource and collaborate.
-
Permanently publish
When a HydroShare user publishes data they recieve a digital object identifyer (DOI). This enables users to cite their data resource and reference it in related journals publications. Additionally, DOI’s provide your readers with a way to find your data.
A visual of the overall process on how HydroShare is provided on their site.
HydroShare Resources
HydroShare’s website defines a “resource” as the fundamental unit of digital content in HydroShare that contains data and/or model files and their corresponding metadata. Resources act as a container into which the sharable content is placed.
Together, data, models and their metadata form the resource which acts as the single unit of digital content on which much of HydroShare’s functionality is based on. A resource can hold multiple files of different types and enables users to manage access, version, and share with colleagues and collaborative groups. Content within a resource can be either public or private, each with their own unique identifier, url, or DOI if published.
All resources within HydroShare each have their own landing page. This page displays all resource metadata and its content files. Upon logging into a HydroShare account, users can access resources they have created or that others have shared with them in the “My Resources” section.
Resource Landing Page
Users can access a resources landing page different ways.
-
They can click on resources which were discovered using HydroShare’s ‘Discover’ feature.
-
Using a direct link or citation to a HydroShare resource.
-
Additionally, users can download content files of a resource using the ‘Content’ sections download buttons.
The following figures are an example landing page for the resource titled “DEM for Hawaii” which indludes digital elevations models for Hawaii. HydroShare will attempt to fill out as much information as it can extrapolate from the files. The top of the landing page will contain information such as the author and owner of the resouce, content type, abstract and keywords associated with the resource.
HydroShare will attempt to extract geographical information about data files if they are associated with a geographic region. Data files, models and all other accompanying data for the resource will be available in the “Content” sections.
Creating a Resource and Uploading Data
In this section, readers will read about how HydroShare stores and describes data. They will also get a chance to follow along and create their own resource and share amongst other members or keep it private.
To upload data to HydroShare, users must first sign up and into their account. In the users profile page:
- Click on the Create button in the top right navigation menu.
- From the dropdown, select Resource.
- Provide a resource title and click Create
HydroShare will create the new resource and direct to the landing page. The resource is now ready for adding files and metadata. Metadata sections within the resource include but are not limited to sections for abstract, key words, geographic coverage, references, comments, and a content section in which files are added into.
It is important to emphasize the use of metadata to the resource. It is what makes the content within it interpretable and discoverable. If making the resource public or discoverable, users must at least fill out a descriptive title, abstract, and one subject keyword.
Users should organize files into folders as needed and use descriptive names that can be used to delineate data. Also, the “Add content from the web” icon allows users to provide a link to external web resources in the case the user wants to store only metadata while data is stored elsewhere.
Users must have editing privileges on a resource in order to edit it. To verify if one can edit, navigate to the resource landing page and look for a pencil icon in the top right corner. If no pencil icon is visible then the user has no editing rights. Otherwise, clicking on the pencil icon will allow modification to metadata and content files. Users can also delete resources if necessary by clicking the “Delete” button on the resources landing page.
Note on published resources
Content in formally published resources can no longer be changed. Limited metadata fields can be changed. However authors, title, and content files cannot be changed.
More information on how to formally publish data on HydroShare can be found in their help documentation.
Sharing Resources
The owner of a resource can share that resource allowing HydroShare users, user groups, or the public the right to access the content and metadata of a resource. Shared resources may still be private when shared only with individual users or groups. In this case they would not be discoverable by the public.
One can access the sharing permissions by clicking the share icon in the upper right corner of the resource landing page.
Here, users can control access and sharing status. More information on sharing and privacy can be found at HydroShare’s help documentation.
HydroShare Groups and Communities
A group is a collection of Hydroshare users with a common resource landing page, that is populated by the resources owned and shared by the users in the group. A major part of HydroShare’s functionality includes ‘Groups’. You can create a Group for your research team and share resources within that group. Groups can be public or private, and you have control over what is shared with the group and what access group members have to the resources you share.
Communities are designed for groups to share resources more seamlessly, fostering public data sharing and open access. A community is a set of groups, which allows several differently administered groups to collaborate toward a common goal. Communities are ideal when a project spans administrative domains, e.g., universities, research groups, and/or businesses. Current examples of communities include collaborations between large research networks and collections of universities. As communities consist of groups, an individual user is part of a community through an associated group.
Creating a Group
To create a group:
- Go to Collaborate tab in HydroShare and click on Groups.
- Click on Create Group. – Can also search for an existing group from here.
- Fill in fields.
- Select access permissions.
- Click Create.
The next episode will discuss HydroShare groups in more detail.
HydroShare Web Apps
Apps are the software tools that allow you to visualize, analyze, and work with resources, more specifically data and models. Apps are hosted on separate web servers from the HydroShare website and access HydroShare resources using web services via the REST applications programmers interface (API).
Web apps are how users can work with data that is stored within HydroShare. They usually are applications which can be used to perform computational tasks or scientific workflows. Web apps run from remote servers which act as a computational backend to HydroShare. They act as tools which can be used for exploring and visualizing different types of data or performing general analysis.
Web apps communicate with HydroShare to move data in and out of it by means of a REST API. A set of apps are approved by the HydroShare development team and require account access to use. This is because they can save results that one might want to store back into their HydroShare account. Use of certain apps also requires that users join a HydroShare group associated with that web app.
CUAHSI JupyterHub
The CUAHSI JupyterHub is a web application that allows HydroShare users to execute scientific code in the cloud.
This application supports the execution of code written in several programming languages, including Python and R, and is specifically designed to support the development of research and education focused Jupyter Notebooks
Jupyter notebooks combine narrative text and code into a single document which can be used for creating and disseminating scientific workflows as well as educational tools for classroom exercises and professional workshops. The CUAHSI JupyterHub combines this functionality with the HydroShare data repository to provide a rich computational environment for water sciences.
There are multiple ways to access the CUAHSI JupyterHub web application from HydroShare. The simplest is to launch it from the HydroShare Apps library by clicking the tab labeled ‘APPS’ at the top of the HydroShare webpage.
Any data that is uploaded, downloaded, and created is associated with your HydroShare account and will persist between sessions, meaning that it will be there next time you log in.
HydroShare resources can also be “launched” into the CUAHSI JupyterHub environment. While in a HydroShare resource click the “Open with …” functionality as pictured below. This button can be found in the top right corner of any HydroShare resource landing page.
After launching the CUAHSI JupyterHub application, you will be presented with several purpose-built environments to choose from. Each of these environments contain pre-installed software to assist in the rapid development of code. You will receive the following server options:
More information on CUAHSI JupyterHub could be found in HydroShare’s help documentation.
Much of the information used here was kindly provided by the following cited sources.
Citations
- “HydroShare Support. https://help.hydroshare.org”
Key Points
Scientific gateways like HydroShare, give researchers access to domain specific data sets and workflows which can help in the analysis and visualization of data.
HydroShare provides tools to create resources which can contain data, models, workflows and other useful resources researchers can use and expand on.
HydroShare provides users simple ways to openly collaborate with the public community, privately with specific individuals, or research groups consisting of individuals with similar interests.
Webapps in HydroShare can be used to gain access to computational resources and workflows which perform data analysis tasks.
Collaborating and Data Analysis Using HydroShare Workflows
Overview
Teaching: 20 min
Exercises: 2 minQuestions
How do users discover and join an existing research group in HydroShare?
How do users access existing resources and scientific workflows in a group?
How can users use an existing workflow to carry out data analysis?
Objectives
Allow the audience some experience using science gateways and existing workflows for data analysis in a collaborative environment
In the following sections, examples are provided to gain hands-on experience discovering groups and resources of interest, joining a collaborative environment, and performing analysis on data who’s results can be written back to HydroShare to share with others in the group.
More specifically, users will learn the following:
- How to discover and join an existing group of interest
- How to navigate the groups landing page and access its resources
- How to use an existing workflow to carry out some data analysis using a JupyterHub workflow
- How to write results back into HydroShare as a new resource to share your analysis with others.
Before starting this activity please be sure to be logged into HydroShare.
Discovering/Finding a HydroShare Group
HydroShare provides a simple search functionality to help users discover and find groups within its site. Use the Collaborate tab at the top of the HydroShare page.
The “Find Groups” page can then be used to find a listing of discoverable and public groups available. Using the search function users can enter keywords which may be associated with group names, purpose, descriptions and other keywords indexed by the group owner.
Discover the HI-DSI Group
Use HydroShare’s search functionality to find the HydroShare group named, “HI-DSI Gateways and Workflows”.
Click on the title of any listed group to view the groups landing page. Members of public groups can be seen by searching users while the members of a discoverable group can not be seen. Either way users can request group membership.
Joining a Group
Once a group of interest is found, users can request access by clicking the “Ask to join” button. If the owner of the group has not set the “auto accept” option for new requests, users may experience a wait time until their request is processed. Otherwise, access may be granted almost immediately.
Join the HI-DSI Group
Once the “HI-DSI Gateways and Workflows” group is found, request access by clicking, “Ask to join”.
About Groups
A groups landing page will provide some information about the group such as the groups title, purpose and description of the group. From the landing page, users will be able to see all resources shared with the group as well as all members of the group. It should be noted that resources within the group are not owned by all members. While they are shared, ownership is retained by the individual user which created the resource.
Exercise: Using Resources and Workflows Within a Group to Collaborate On Data Analysis
For this exercise, attendees who have joined the HI-DSI group will use an existing shared resource containing both data and a Jupyter Notebook workflow to perform some visualization and data analysis. The purpose of this exercise is to:
- Demonstrate how a gateway can provide a collaborative environment
- Demonstrate how data workflows can ease the researchers burden of programming certain computational tasks from scratch.
- Exemplify the usefulness of gateways and workflows in the reproducibility of scientific results.
HydroShare supports various web apps which workflows and models can be built in. Typically, most hydrological modeling and data analysis is done on a personal computer or on some centralized computing system. A user’s knowledge of such centralized computing system could all be barriers to the user’s research. Some examples could be:
- Knowledge about high-performance computing systems
- Capacity of a local computer
- Compatability and dependency requirements for installation on local computers, and
- Time taken to get local model installations properly configured and validated
By providing and supporting web apps such as JupyterHub and its Jupyter Notebook functionality, HydroShare enables a more flexible environment for model execution and data analysis. This gives users a preconfigured environment free of worry about cyberinfrastructure and dependencies thus lowering the forementioned barriers to research.
A Jupyter notebook contains live code, equations, visualization, and explanatory text. They can be used to implement scientific workflows and other computational tasks. HydroShare users can either create and share their own workflows using Jupyter notebooks or discover and use existing workflows implemented as Jupyter notebooks.
Jupyter Notebook functionality within HydroShare:
- Notebooks can be launched from HydroShare
- Notebooks can access the contents of HydroShare resources
- Notebooks can save results back into HydroShare
Ease of Use For the User
The code within a Jupyter notebook can execute operating and language specific commands within the hosting environment. This provides users with access to the hosting environment’s computational capabilities while eliminating the need for users to install and configure software.
While this does not relieve the user from needing to learn the programming language, operating system commands, or commands associated with the program being used for analysis, it does remove the need for users to install and have the capacity to run these programs locally.
About the Following Exercise
For this exercise attendees will be performing some visualization and data analysis on a raw .tif file of a geographic region. They will use an existing workflow to perform some common computational tasks on the type of data provided using a software package for digital elevation models called TauDEM. Attendees will then perform further analysis on the data and write their results back to HydroShare as a new separate resource. This resource can then be shared with a group for further scientific collaboration.
The Data
The data used will be a single image. The file was raster data of a surface map and was saved in a .tif format. It is an aerial image of a mountain and water source in Utah.
Rasters are well suited for representing data that changes continuouosly across a surface landscape. They provide an effective method of storing the continuity as a surface. Elevation values measured from the Earth’s surface are the most common application of surface maps.
About TauDem:
Terrain Analysis Using Digital Elevation Models (TauDEM) is a software for Hydrologic Terrain Analysis in Jupyter. TauDEM is a free and open source set of Digital Elevation Model (DEM) tools for the extraction and analysis of hydrologic information from topography as represented by DEM. This software is developed at Utah State University (USU) for hydrologic digital elevation model analysis and watershed delineation.
Further information on the use of TauDEM functions can be found in the TauDEM documentation
Workflows and the TauDEM Notebook
Existing workflows can be helpful in that they remove the need for a researcher to program some common computational tasks for their data from scratch. The idea behind using the TauDEM notebook for this exercise is to use the existing workflow set in place to perform some data analysis and visualization. Users can then perform further analyses to meet their needs. This exemplifies the collaboration aspect which a gateway can provide its users.
Users will use an existing workflow developed by David Tarboton, Director of Utah Water Research Laboratory, in order to run some commonly used visualization and computational methods on digital elevation models. The goal will be to perform some additional analysis in Professor Tarboton’s notebook in order to delineate a subwatershed and a stream network using some given TauDem functions. User’s will then write the Jupyter Notebook with their additional analysis into a new HydroShare resource which can then be shared with their research group.
Workflow for Terrain Analysis Using Digital Elevation Models
To make it easier to follow along with the hands-on portion of this workshop and to understand what each individual cell in the Jupyter Notebook is doing we will break every task into “Steps”.
Step 1: Accessing the JupyterHub Notebook
- Once on the group landing page, select the resource titled “HI-DSI Workshop: Introduction to TauDEM”
- Click “Open with” at the top right.
- Choose CUAHSI JupyterHub as the webapp.
- Agree to the terms of use and click “Sign in with HydroShare”.
- Authorize CUAHSI JupyterHub. The JupyterHub session will open.
- Select the Python environment ‘Python - v3.8’
- Select the file at the top of the list named WorkshopTauDEM.ipynb.
Step 2: Install and Import Libraries
Note that Professor Tarboton’s comments are included in the notebook to help users find documentation and understand what task each cell performs.
The first cell in the notebook is like most Jupyter Notebooks. All necessary software are installed in the users’ JupyterHub using the pip command. Module imports are also carried out here. The hstools module provides functions for interacting with HydroShare, including resource querying, downloading and creation. The TauDEM module provides functions for workspace maintenance, data analysis as well as visualization.
# Only run this cell if the libraries are not already installed
!pip install rasterio
!pip install geopandas
!pip install hstools
import os
import rasterio as rio
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import colors
! pwd
! ls
Step 3: Define a Plotting Function For Visualization
We start with a digital elevation model, DEM, file surrounding the Logan River Watershed in Logan, Utah. The file is named logan.tif. In this section, we illustrate the TauDEM basic grid analysis functions. TauDEM uses files in TIFF format by default.
This workflow first defines a visualization function which can be repeatedly used in order to plot data and resulting figures after analysis. The function is already defined for the user. It can be called on the data file to visualize.
Defining a convenient plotting function
# Define convenience plotting function
cmap='terrain'
label='Elevation (m)'
import math
import numpy as np
def rp(raster,cmap='terrain',label='',bounds=[]):
with rio.open(raster) as src:
boundary = src.bounds
parray = src.read()
nodata = src.nodata
parray=np.where(parray == nodata,math.nan,parray)
# Plot the imported data
plt.figure(figsize=(20,10))
# if using bounds
if(len(bounds) > 0):
bounds=np.insert(bounds,0,np.nanmin(parray))
bounds=np.append(bounds,np.nanmax(parray))
norm = colors.BoundaryNorm(bounds, cmap.N)
imgplot = plt.imshow(parray[0],cmap=cmap,norm=norm)
else:
imgplot = plt.imshow(parray[0],cmap=cmap)
cbar = plt.colorbar()
cbar.set_label(label)
# plt.savefig('LoganDEM.png', dpi=450, bbox_inches = 'tight') # activate this command if you want to save the figure. Note that it should be called before plt.show()
plt.show()
Calling the function on the data provides a visualization of the original file.
# display the raw dem
dem = 'logan.tif'
rp(dem,label="Elevation (m)",)
Step 4: Removing Pits From Data
Next, this workflow begins grid analysis by calling actual TauDEM functions; the first of which is Pit Remove. Pits are grid cells surrounded by higher terrain that do not drain. It is important to remove these pits in order that a more hydrologically conditioned image is created from the original for further analysis. Pits are removed by raising the elevation of pits.
The output of this function will provide information on how long execution might take along with some other system and file information.
!mpiexec -n 4 pitremove {dem}
After execution is done there will be an additional file which is a hydrologically conditioned DEM as previously mentioned.
! ls # This lists the files present.
# Note you should see the additional output file loganfel.tif that was written by the previous command.
loganfel.tif logan.tif Outlet_meta.xml Outlet.shp
logan_meta.xml logan.vrt Outlet.prj Outlet.shx
logan_resmap.xml Outlet.dbf Outlet_resmap.xml TauDEM.ipynb
Step 5: D8 Analysis
The conditioned file with pits removed is then used to perform a D8 analysis. D8 analysis calculates the directions of flow from each cell to its downslope neighbor or neighbors using the D8, D-Infinity (DINF), or Multiple Flow Direction (MFD) method. The TauDEM function used is called d8flowdir and it will output two new files named loganp.tif and logansd8.tif.
Once executed, the function takes a few minutes to complete calculations and produce output files.
!mpiexec -n 4 d8flowdir -p loganp.tif -sd8 logansd8.tif -fel loganfel.tif
Calling the previously defined plotting function, the resulting files, loganp.tif and logansd8.tif, can be visualized.
rp('loganp.tif',plt.get_cmap('gist_earth', 8),'D8 Flow Direction')
rp('logansd8.tif','terrain','D8 Slope')
Step 6: Stream Drop Analysis
The next cell runs a sequence of TauDEM functions to identify the optimal threshold to use to delineate the stream network. The analysis uses TauDEM’s stream drop analysis approach which is based on the weighted contributing area of upward curved grid cells which are mapped using the Peuker Douglas algorithm.
The corresponding output file will be Outlet.shp.
The last command in the sequence, “drop analysis”, maps the highest resolution stream network consistent with the DEM’s geographic morphology and determines the drainage density. It does this by testing a number of stream delineation thresholds. It identifies the threshold where the mean stream drop for first order streams is not statistically different from the mean stream drop of higher order streams.
Execution of these commands will take a few minutes and output some runtime information.
!mpiexec -n 4 aread8 -p loganp.tif -o Outlet.shp -ad8 loganad8o.tif
!mpiexec -n 8 peukerdouglas -fel loganfel.tif -ss loganss.tif -par 0.4 0.1 0.05
!mpiexec -n 8 aread8 -p loganp.tif -ad8 loganssa.tif -wg loganss.tif -o Outlet.shp
!mpiexec -n 8 dropanalysis -fel loganfel.tif -p loganp.tif -ad8 loganad8o.tif -ssa loganssa.tif -o Outlet.shp -drp logandrp.txt -par 10 1000 15 0
# Determine threshold from last line of drp.txt file after colon.
with open("logandrp.txt", 'r') as f:
lines = f.read().splitlines()
thresh=lines[len(lines)-1].split(":")[1]
print("\nOptimal stream definition threshold:"+thresh)
Plot the results using the defined visualization function. The following is the high resolution stream network and portions of the network which contribute more flow within the watershed. This will serve to perform further analysis of the stream network using TauDEM.
rp('loganad8o.tif',colors.ListedColormap(['white', 'lightskyblue', 'cyan', 'blue', 'navy']),'D8 Contributing area',bounds=[300,1000,3000,10000])
Step 7: Stream Network Analysis Using TauDEM functions
In the following steps, users will perform their own analysis to add to this workflow. Since workshop attendees are not expected to be familiar with hydro-sciences or TauDEM, some code will be provided to perform further analysis on the stream network obtained using Professor Tarboton’s workflow. Attendees are encouraged to follow along by copying the provided code here into the Jupyter Notebook in the group resource. They will then be able to save their additions as a new resource into HydroShare using given HydroShare tool commands.
Keep in mind that the important point of this exercise is to experience the collaborative nature scientific gateways can provide. Also the benefit scientific workflows bring to researcher’s work by providing code for common data analysis tasks.
Our goal will be to delineate a watershed and the parts of the stream network which contribute most to water flow.
Using the extracted high resolution stream network, we will use the TauDEM function “threshold” to define a stream raster grid named logansrc.tif. The threshold function uses as input the previously obtained file named loganssa.tif which is a weighted contributing area or the stream source accumulation, ssa, and the threshold determined using the “drop analysis” function.
The “streamnet” function will provide a number of output files including the target files of interest. Outputs will include the shapefile named logannet.shp which will be the portion of the stream network consisting of most water flow and a shapefile with delineated subwatersheds named loganw.tif. This data file represents the amount each portion of the watershed drains to each link of the stream network.
Inputs to these functions include outputs from previous analysis in Professor Tarboton’s workflow.
!mpiexec -n 4 threshold -ssa loganssa.tif -src logansrc.tif -thresh {thresh}
!mpiexec -n 4 streamnet -fel loganfel.tif -p loganp.tif -ad8 loganad8o.tif -src logansrc.tif -ord loganord3.tif -tree logantree.dat -coord logancoord.dat -net logannet.shp -w loganw.tif -o Outlet.shp
Using the plotting function to visualize the delineated watershed.
rp('loganw.tif')
Any resulting output files produced through further analysis will be written to the working directory within JupyterHub and ultimately into the resource written back to HydroShare.
The following list of files are all the output files produced from previous analysis. These will all be saved and stored into a new HydroShare resource.
!ls
loganad8o.tif logannet.shp loganssa.tif Outlet_meta.xml
logancoord.dat logannet.shx loganss.tif Outlet.prj
logandrp.txt loganord3.tif logan.tif Outlet_resmap.xml
loganfel.tif loganp.tif logantree.dat Outlet.shp
logan_meta.xml logan_resmap.xml logan.vrt Outlet.shx
logannet.dbf logansd8.tif loganw.tif TauDEM.ipynb
logannet.prj logansrc.tif Outlet.dbf
The “geopandas” library is used to visualize the stream network shapefile. The following code imports those libraries, reads the file and plots the resulting stream network.
When importing geopandas there will be a compatibiliy warning. This won’t be a problem and the cell will still run.
from geopandas import GeoSeries, GeoDataFrame, read_file, gpd
streamnet = read_file('logannet.shp')
streamnet
plt.figure(figsize=(30, 24))
streamnet.plot()
Congratulations
Congratulations! You have successfully delineated subwatersheds and a stream network using TauDEM. If you want to retain your work you could at this point save the contents of your folder back to HydroShare, or download them.
Step 8: Save the Results Back To HydroShare
Note about HydroShare tools
Recall that Jupyterhub communicates with HydroShare via the Hydroshare REST API. CUAHSI has provided a set of utilities which serve as a means to communicate with HydroSHare from within JupyterHub. These tools can be installed from within the Jupyter notebook using the command ‘!pip install hstools’.
When installed, more information on how to use hstools can be found by executing the command ‘!hs –help’.
!hs --help
usage: hs {get, add, create, delete, list, describe, init} [-h, --help]
HSTools is a humble collection of tools for interacting with data in the
HydroShare repository. It wraps the HydroShare REST API to provide simple
commands for working with resources.
positional arguments:
{get,add,create,delete,list,describe,init}
get Retrieve resource content from HydroShare
add Add files to an existing HydroShare resource
create Create a new HydroShare resource
delete Delete a HydroShare resource
list List HydroShare resources that you own
describe Describe metadata and files
init Initialize a connection with HydroShare
optional arguments:
-h, --help show this help message and exit
The HydroShare “hstools” library will be used to save the changes and additions made to the Jupyter notebook in this exercise back to HydroShare as a new resource. A new HydroShare resource can be created using the ‘!hs create’ command in a separate notebook cell. The ‘!hs create –help’ command provides information on how to use the command.
!hs create --help
usage: hs create [-q] [-v] [-f] [-k] [-t] [-a] [-h]
Create a new HydroShare resource
optional arguments:
-h, --help show this help message and exit
-a ABSTRACT [ABSTRACT ...], --abstract ABSTRACT [ABSTRACT ...]
resource description
-t TITLE [TITLE ...], --title TITLE [TITLE ...]
resource title
-k KEYWORDS [KEYWORDS ...], --keywords KEYWORDS [KEYWORDS ...]
space separated list of keywords
-f FILES [FILES ...], --files FILES [FILES ...]
space separated list of files
-v verbose output
-q suppress output
Before creating the resource, a few notes should be mentioned. First, metadata can be defined within JupyterHub using Python lists and strings. In order to create a resource, certain metadata fields must be included. These fields are the title, abstract, keywords, and content files. It is also important to know and include the files intended to be written to the HydroShare resource.
Users can view all of the files in their directory using the ‘!ls’ command before selecting which to write back to HydroShare as part of the new resource.
!ls
loganad8o.tif logannet.shp loganssa.tif Outlet_meta.xml
logancoord.dat logannet.shx loganss.tif Outlet.prj
logandrp.txt loganord3.tif logan.tif Outlet_resmap.xml
loganfel.tif loganp.tif logantree.dat Outlet.shp
logan_meta.xml logan_resmap.xml logan.vrt Outlet.shx
logannet.dbf logansd8.tif loganw.tif TauDEM.ipynb
logannet.prj logansrc.tif Outlet.dbf
The following code block defines all required metadata for resource creation.
# define metadata variables
keywords = ['TauDEM', 'Logan River']
abstract = "Jupyter Notebook TauDEM was used to define streamflow and subwatersheds in the Logan River Watershed in Utah."
title = "Test TauDEM Result"
files = !find . -maxdepth 1 -type f # shows all the files created
print(files)
# Select the files that you want to save
files = ['./logan.tif', './TauDEM.ipynb',
'Outlet.shp','Outlet.prj',
'Outlet.shx','Outlet.dbf'
]
The following code block is used to initialize HydroShare. It provides authorization from the current location within JupyterHub to write into HydroShare. Users will be asked to sign into HydroShare to connect and write their resource.
It should be noted that according to the ‘hstools’ library, the ‘!hs init’ command can be used to initialize a connection with HydroShare, however there seemed to be an issue using it in conjuction with this notebook. Professor Tarboton provided this block of code in which he defined his own intialization function which was used here. Users are encouraged to try the HydroShare utilities function ‘!hs init’ in their own project when writing back to HydroShare.
# Initialize HydroShare connection
# This is an awkward way to initialize the HydroShare connection from inside Jupyter. An alternative is to open
# a terminal and run hs init
import hstools as hs
import os
import sys
import json
import base64
import argparse
from getpass import getpass
from hstools import hydroshare
def init(loc='.'):
fp = os.path.abspath(os.path.join(loc, '.hs_auth'))
if os.path.exists(fp):
print(f'Auth already exists: {fp}')
remove = input('Do you want to replace it [Y/n]')
if remove.lower() == 'n':
sys.exit(0)
os.remove(fp)
usr = input('Enter HydroShare Username: ')
pwd = getpass('Enter HydroShare Password: ')
dat = {'usr': usr,
'pwd': pwd}
cred_json_string = str.encode(json.dumps(dat))
cred_encoded = base64.b64encode(cred_json_string)
with open(fp, 'w') as f:
f.write(cred_encoded.decode('utf-8'))
try:
hydroshare.hydroshare(authfile=fp)
except Exception:
print('Authentication Failed')
os.remove(fp)
sys.exit(1)
print(f'Auth saved to: {fp}')
init('/home/jovyan')
Finally, create the new resource in Hydroshare using the ‘!hs create’ command. After executing, the command output will provide a link to the HydroShare resource. Users can access the resource directly by clicking on the link or by using the “My Resources” tab in the HydroShare website.
!hs create -t {title} -a {abstract} -k {' '.join(keywords)} -f {' '.join(files)}
+ creating resource
+ adding: ./logan.tif
+ adding: ./TauDEM.ipynb
+ adding: Outlet.shp
+ adding: Outlet.prj
+ adding: Outlet.shx
+ adding: Outlet.dbf
After executing, the command output will provide a link to the HydroShare resource. Users can access the resource directly by clicking on the link or by using the “My Resources” tab in the HydroShare website. Click on the new resource and use the share button share the new resource with the “HI-DSI Scientific Gateways and Workflows” group. Other group members are now able to access the new resource and view the methods used in your analysis and further add to it or use it for the own data analysis and research.
This exercise was meant to emphasize how scientific gateways and workflows can help researchers collaborate, reproduce science, get started quickly with some data related tasks, and help satisfy data management plans.
Key Points
Scienctific gateways like HydroShare, enable researchers to form and join groups and communities, within the gateway, with similar research interests.
Existing workflows can be discovered and used to aid researchers perform visualization and analysis on their data, eliminating the need to write all the necessary foundational codes.
Acknowledgements
Overview
Teaching: min
Exercises: minQuestions
Objectives
Thank you to the Consortium of Universities for the Advancement of Hydrologic Science, Inc., CUAHSI, for their work on HydroShare and their helpful documentation and website which acted as a source of information for much of this workshop and document. Thank you to CUAHSI’s Anthony Castronova, PhD, Senior Research Hydrologist, for consulting on the use of HydroShare for the workshop. Also, thank you to David Tarboton, ScD, Director of Utah Water Resource Laboratory, Utah State University, for providing the “Introduction to TauDEM” JupyterHub notebook used in this workshop. All of the code used in the workshop was provided by Dr. Tarboton in his “Introduction to TauDEM” resource. All explanations of the code and additions to the notebook in the workshop are attributed to Dr. Tarboton’s original notebook. The resource and accompanying notebook version in the workshop was a modified version of his original resource.
Key Points