Introduction

Overview

Teaching: 8 min
Exercises: 0 min

Questions

Who are we and what are we going to learn?

Objectives

Introduce ourselves and the course

Setup Hydroshare our example FAIR data platform

Better research by better sharing

Introductions

Introductions

Sean Cleveland, Cyberinfrastructure Scientist, The University of Hawai’i Cyberinfrastructure & Hawaii Data Science Institute

Bjarne Bartlett, CI-TRACS Data Science Fellow, The University of Hawai’i Cyberinfrastructure & Hawaii Data Science Institute

Hello everyone, and welcome to the FAIR Data Management Security and Ethics workshop.

For many of us, data management or output sharing in general are considered a burden rather than a useful activity. Part of the problem is our bad timing and lack of planning.

Data management is a continuous process

Figure 5.2. Sharing as part of the workflow Figure credits: Tomasz Zielinski and Andrés Romanowski from https://carpentries-incubator.github.io/fair-bio-practice/06-being-precise/index.html

Data management should be done throughout the duration of your project.
If you wait till the end, it will take a massive effort on your side and will be more of a burden than a benefit.
Taking the time to do effective data management will help you understand your data better and make it easier to find when you need it (for example when you need to write a manuscript or a thesis!).
All the practices that enable others to access and use your outcomes directly benefit you and your group

In this workshop we will discuss how your research outputs can be made readily available for re-use by others.

Key Points

You can do more impactful research if you plan to share your outputs!

You can more efficiently publish if you plan to share your outputs!

Open Science

Overview

Teaching: 8 min
Exercises: 4 min

Questions

What is Open Science?

How can I benefit from Open Science?

Why has Open Science become a hot topic?

Objectives

Identify parts of the Open Science movement, their goals and motivations

Explain the main benefits of Open Science

Recognize the barriers and risks in the adoption of Open Science practices

Science works best by exchanging ideas and building on them. Most efficient science involves both questions and experiments being made as fully informed as possible, which requires the free exchange of data and information.

All practices that make knowledge and data freely available fall under the umbrella-term of Open Science/Open Research. It makes science more reproducible, transparent, and accessible. As science becomes more open, the way we conduct and communicate science changes continuously.

What is Open Science

Open science is the movement to make scientific research (including publications, data, physical samples, and software) and its dissemination accessible to all levels of an inquiring society, amateur or professional.

Open Science represents a new approach to the scientific process based on cooperative work and new ways of diffusing knowledge by using digital technologies and new collaborative tools

Open science is transparent and accessible knowledge that is shared and developed through collaborative networks.

Characteristics:

Using web-based tools to facilitate information exchange and scientific collaboration

Transparency in experimental methodology, observation, and collection of data

Public availability and reusability of scientific data, methods and communications

What is the Open Science movement?

Sharing of information is fundamental for science. This began at a significant scale with the invention of scientific journals in 1665. At that time this was the best available alternative to critique & disseminate research, and foster communities of like-minded researchers.

Whilst this was a great step forward, the journal-driven system of science has led to a culture of ‘closed’ science, where knowledge or data is unavailable or unaffordable to many.

The distribution of knowledge has always been subject to improvement. Whilst the internet was initially developed for military purposes, it was hijacked for communication between scientists, which provided a viable route to change the dissemination of science.

The momentum has built up with a change in the way science is communicated to reflect what research communities are calling for – solutions to the majority of problems (e.g. impact factors, data reusability, reproducibility crisis, trust in the public science sector etc…) that we face today.

Open Science is the movement to increase transparency and reproducibility of research, through using the open best practices.

Figure 1. Open Science Building Blocks

Attribution Gema Bueno de la Fuente

Open Science Building Blocks

Open Access: Research outputs hosted in a way that make them accessible for everyone. Traditionally Open Access referred to journal articles, but now includes books, chapters or images.
Open Data: Data freely and readily available to access, reuse, and share. Smaller data sets were often accessible as supplemental materials by journals alongside articles themselves. However, they should be hosted in dedicated platforms for more convenient and better access.
Open Software: Software where the source code is made readily available; others are free to use, change, and share. Some examples of these including the coding language and supporting software R and RStudio, as well as image analysis software such as Fiji/ImageJ.
Open Notebooks: Lab & notebooks hosted online, readily accessible to all. These are popular among some of the large funding bodies and allow anyone to comment on any stage of the experimental record.
Open Peer Review: A system where peer review reports are published alongside the body of work. This can include reviewers’ reports, correspondence between parties involved, rebuttals, editorial decisions etc…
Citizens Science: Lay people become involved in scientific research, most commonly in data collection or image analysis. Platforms such as zooniverse.org help connect projects with lay people interested in playing an active role in research, which can help generate and/or process data which would otherwise be unachievable by one single person.
Scientific social networks: Networks of researchers, which often meet locally in teams, but are also connected online, foster open discussions on scientific issues. Online, many people commonly use traditional social media platforms for this, such as Twitter, Instagram, various sub-reddits, discussion channels on Slack/Discord etc…, although there are also more dedicated spaces such as researchgate.net.
Open Education resources: Educational materials that are free for anyone to access and use to learn from. These can be anything from talks, instructional videos, and explanations posted on video hosting websites (e.g. YouTube), to entire digital textbooks written and then published freely online.
Citizen science: Citizen participation of various stages of research process from project funding to collecting and analysing data.

Exercise 1: Benefits of Open Science

Being open has other outcomes/consequences beyond giving the free access to information. For example, Open educational resources:

enables collaborative development of courses

improves teachers/instructors skills by sharing ideas

Select one or two of the following OS parts:

Open Access

Open Data

Open Software

Open Notebooks

Open Peer Review

and discuss what are the benefits or what problems are solved by adaption of those Open initiatives.

Solution

Possible benefits and consequences for each part:

Open Access

speed of knowledge distribution

leveling field for underfunded sites which otherwise wouldn’t be able to navigate the paywall

prevent articles being paid for ‘thrice’ (first to produce, second to publish, third to access) by institutions.

greater access to work by others, increasing chance for exposure & citations

access to work by lay audiences, thus increases social exposure of research

Open Data

ensures data isn’t lost overtime - reusability

acceleration of scientific discovery rate

value for money/reduced redundancy

permits statistical re-analysis of the data to validate findings

gives access to datasets which were not published as papers (e.g. negative results, large screening data sets)

provides an avenue to generate new hypotheses

permits combination of multiple data sources to address questions, provides greater power than a single data source

Open Software

great source to learn programming skills

the ability to modify creates a supportive community of users and rapid innovation

saves time

faster bug fixes

better error scrutiny

use of the same software/code allows better reproducibility between experiments

need funds to maintain and update software

Open Notebooks

100% transparent science, allowing input from others at early stages of experiments

source of learning about the process of how science is actually conducted

allows access to experiments and data which otherwise never get published

provides access to ‘negative’ results and failed experiments

anyone, anywhere around the world, at any time, can check in on projects, including many users simultaneously

possibility of immediate feedback

thorough evidence of originality of ideas and experiments, negating effect of ‘scooping’

Open Peer Review

visibility leads to more constructive reviews

mitigates against editorial conflicts of interest and/or biases

mitigates against reviewers conflicts of interest and/or biases

allows readers to learn/benefit from comments of the reviewers

Open Educational Materials

Foster collaboration between educators/others

Show clearly how method was taught (e.g. Carpentries materials) which can be reproduces anywhere, anytime

protects materials from becoming technologically obsolete

authors preparing the material or contribute all earn credit (e.g. GitHub)

recycle animations and material that is excellent (why reinvent the wheel?)

Motivation: Money

One has to consider the moral objectives that accompany the research/publication process: charities/taxpayers pay to fund research, these then pay again to access the research they already funded.

From an economic point of view, scientific outputs generated by public research are a public good that everyone should be able to use at no cost.

According to EU report “Cost-benefit analysis for FAIR research data”, €10.2bn is lost every year because of not accessible data (plus additional 16bn if accounting for re-use and research quality).

The goals of Open Science is to make research and research data available to e.g. charities/taxpayers who funded this research.

Motivation: Reproducibility

The inherited transparency of Open Science and the easy access to data, methods and analysis details naturally help to address part of the Reproducibility crisis. The openness of scientific communications and of the actual process of evaluation of the research (Open Peer Review) increases confidence in the research findings.

Personal motivators

Open Science is advantageous to many parties involved in science (including researcher community, funding bodies, the public even journals), which is leading to a push for the widespread adoption of Open Science practices.

Funding bodies are also becoming big supporters of Open Science. We can see with the example of Open Access, that once enforced by funders (the stick) there is a wide adoption. But what about the personal motivators, the carrots.

The main difference between the public benefits of Open Science practices and the personal motivators of outputs creators, that the public can benefit almost instantly from the open resources. However, the advantages for data creator comes with a delay, typically counted in years. For example, building reputation will not happen with one dataset, the re-use also will lead to citations/collaboration after the next research cycle.

Barriers and risks of OS movement:

Exercise 2: Why we are not doing Open Science already

Discuss Open Science barriers, mention the reasons for not already being open:

Solution

sensitive data (anonymising data from administrative health records can be difficult)

IP

misuse (fake news)

lack of confidence (the fear of critics)

lack of expertise

the costs in $ and in time

novelty of data

it is not mandatory

lack of credit (publishing negative results is of little benefit to you)

It may seem obvious that we should adopt open science practices, but there are associated challenges with doing so.

Sensitivity of data is sometimes considered a barrier. Shared data needs to be compliant with data privacy laws, leading many to shy away from hosting it publicly. Anonymising data to desensitise it can help overcome this barrier.

The potential for intellectual property on research can dissuade some from adopting open practices. Again, much can be shared if the data is filtered carefully to protect anything relating to intellectual property.

Another risk could be seen with work on Covid19: pre-prints. A manuscript hosted publicly prior to peer review, may accelerate access to knowledge, but can also be misused and/or misunderstood. This can result in political and health decision making based on faulty data, which is counter to societies’ best interest.

One concern is that opening up ones data to the scientific community can lead to the identification of errors, which may lead to feelings of embarrassment. However, this could be considered an upside - we should seek for our work to be scrutinized and errors to be pointed out, and is the sign of a competent scientist. One should rather have errors pointed out rather than risking that irreproducible data might cause even more embarrassment and disaster.

One of the biggest barriers are the costs involved in “being Open”. Firstly, making outputs readily available and usable to others takes time and significant effort. Secondly, there are costs of hosting and storage. For example, microscopy datasets reach sizes in terabytes, making such data accessible for 10 years involves serious financial commitment.

Where to next

Further reading/links:

Challenges & benefits of OS

Centre for Open Science

Ted talk supporting OS

Attribution

Content of this episode was adapted from:

Wiki Open Science

European Open Science Cloud

Science is necessarily collaborative - The Biochemist article.

@@(https://carpentries-incubator.github.io/fair-bio-practice/)

Key Points

Open Science increases transparency in research

Publicly funded science should be publicly available

Intellectual Property, Licensing, and Openness

Overview

Teaching: 8 min
Exercises: 2 min

Questions

What is intellectual property?

Why should I consider IP in Open Science?

Objectives

Timeline matters for legal protection

Understand what can and cannot be patented

Understand what licenses to use for re-use of data and software

Open Science and Intellectual property

Intellectual property (IP) is something that you create using your mind - for example, a story, an invention, an artistic work or a symbol.

The timeline of “opening” matters when one seeks legal protection for their IP.

For example, patents are granted only for inventions that are new and were not known to the public in any form. Publishing in a journal or presenting in a conference information related to the invention completely prevents the inventor from getting a patent!

You can benefit from new collaborations, industrial partnerships, and consultations which are acquired by openness. This can yield greater benefit than from patent-related royalties.

(Optional) Intellectual property protection

You can use a patent to protect a non-obvious (technical) invention that provides “technical contribution” or solves a “technical problem”. It gives you the right to take legal action against anyone who makes, uses, sells or imports it without your permission.

In principle, software can be patented. It is usually, settled by the court for each case.

Software code is copyrighted. Copyright prevents people from:

copying your code

distributing copies of it, whether free of charge or for sale.

Data cannot be patented, and in principle, it cannot be copyrighted. It is not possible to copyright facts!

Facts are not patentable, and since machine learning algorithms like neural networks are basically mathematical methods, they are exempt from protection. However, applied to a certain problem, an algorithm may become part of a patent. IF framed it in the right way, patenting an algorithm is possible. For example, a deep learning algorithm generating a certain kind of audio may be eligible. But that would not prevent the network from being applied to any other problem.

However, how data are collated and presented (especially if it is a database), can have a layer of copyright protection. Deciding what data needs to be included in a database, how to organize the data, and how to relate different data elements are all creative decisions that may receive copyright protection. Again, it is often a case by case situation and may come down to who has better lawyers.

After:

https://www.uspto.gov/patents/basics

Exercise 3: Checking common licenses

Open CC BY license summary https://creativecommons.org/licenses/by/4.0/ is it clear how you can use the data under this licence and why it is popular in academia?

Check the MIT license wording: https://opensource.org/licenses/MIT is it clear what you can do with software code under this licence?

Compare the full wording of CC BY https://creativecommons.org/licenses/by/4.0/legalcode can you guess why the MIT licence is currently the most popular for open source code?

Solution

CC BY license states material can be reproduced, shared, in whole or in part, unless where exceptions and limitations are stated. Attributions must be made to the Licensor.

MIT license states that Software can by used without restriction (to copy, modify, publish, distribute etc…)

The MIT license is short, to the point and optimised for software developers as it offers flexibility.

Attribution

Content of this episode was adapted from:

@@(https://carpentries-incubator.github.io/fair-bio-practice/)

Key Points

A license is a promise not to sue - therefore attach license files

For data use Creative Commons Attribution (CC BY) license

For code use open source licenses such as MIT, BSD, or Apache license

FAIR Introduction

Overview

Teaching: 5 min
Exercises: 0 min

Questions

What are the FAIR principles?

Why should I care to be FAIR?

How do I get started?

Objectives

Identify the FAIR principles

Recognize the importance of moving towards FAIR in research

Relate the components of this lesson to the FAIR principles

What is FAIR?

The FAIR principles for research data, originally published in a 2016 Nature paper, are intended as “a guideline for those wishing to enhance the reusability of their data holdings.” This guideline has subsequently been endorsed by working groups, funding bodies and institutions.

FAIR is an acronym for Findable, Accessible, Interoperable, Reusable.

Findable: others (both human and machines) can discover the data
Accessible: others can access the data
Interoperable: the data can easily be used by machines or in data analysis workflows.
Re-usable: the data can easily be used by others for new research

The FAIR principles have a strong focus on “machine-actionability”. This means that the data should be easily readable by computers (and not only by humans). This is particularly relevant for working with and discovering new data.

What the FAIR principles are not

A standard: The FAIR principles need to be adopted and followed as much as possible by considering the research practices in your field.

All or nothing: making a dataset (more) FAIR can be done in small, incremental steps.

Open data: FAIR data does not necessarily mean openly available. For example, some data cannot be shared openly because of privacy considerations. As a rule of thumb, data should be “as open as possible, as closed as necessary.”

Tied to a particular technology or tool. There might be different tools that enable FAIR data within different disciplines or research workflows.

Why FAIR?

The original authors of the FAIR principles had a strong focus on enhancing reusability of data. This ambition is embedded in a broader view on knowledge creation and scientific exchange. If research data are easily discoverable and re-usable, this lowers the barriers to repeat, verify, and build upon previous work. The authors also state that this vision applies not just to data, but to all aspects of the research process.

What’s in it for you?

FAIR data sounds like a lot of work. Is it worth it? Here are some of the benefits:

Funder requirements

It makes your work more visible

Increase the reproducibility of your work

If others can use it easily, you will get cited more often

You can create more impact if it’s easier for others to use your data

…

Getting started with FAIR (climate) data

As mentioned above, the FAIR principles are intended as guidelines to increase the reusability of research data. However, how they are applied in practice depends very much on the domain and the specific use case at hand.

For the domain of climate sciences, some standards have already been developed that you can use right away. In fact, you might already be using some of them without realizing it. NetCDF files, for example, already implement some of the FAIR principles around data modeling. But sometimes you need to find your own way.

Challenge for yourself - Evaluate one of your own datasets

Pick one dataset that you’ve created or worked with recently, and answer the following questions:

If somebody gets this dataset from you, would they be able to understand the structure and content without asking you?

Do you know who has access to this dataset? Could somebody easily have access to this dataset? How?

Does this dataset needs proprietary software to be used?

Does this dataset have a persistent identifier or usage licence?

Attribution

Content of this episode was adapted from:

@@(https://esciencecenter-digital-skills.github.io/Lesson-FAIR-Data-Climate/)

Key Points

The FAIR principles state that data should be Findable, Accessible, Interoperable, and Reusable.

FAIR data enhance impact, reuse, and transparancy of research.

FAIRification is an ongoing effort accross many different fields.

FAIR principles are a set of guiding principles, not rules or standards.

Findable

Overview

Teaching: 8 min
Exercises: 2 min

Questions

What is a persistent identifier or PID?

What types of PIDs are there?

Objectives

Explain what globally unique, persistent, resolvable identifiers are and how they make data and metadata findable

Articulate what metadata is and how metadata makes data findable

Articulate how metadata can be explicitly linked to data and vice versa

Understand how and where to find data discovery platforms

Articulate the role of data repositories in enabling findable data

For data & software to be findable:

F1. (meta)data are assigned a globally unique and eternally persistent identifier or PID
F2. data are described with rich metadata
F3. (meta)data are registered or indexed in a searchable resource
F4. metadata specify the data identifier

Persistent identifiers (PIDs) 101

A persistent identifier (PID) is a long-lasting reference to a (digital or physical) resource:

Designed to provide access to information about a resource even if the resource it describes has moved location on the web
Requires technical, governance and community to provide the persistence
There are many different PIDs available for many different types of scholarly resources e.g. articles, data, samples, authors, grants, projects, conference papers and so much more

Different types of PIDs

PIDs have community support, organizational commitment and technical infrastructure to ensure persistence of identifiers. They often are created to respond to a community need. For instance, the International Standard Book Number or ISBN was created to assign unique numbers to books, is used by book publishers, and is managed by the International ISBN Agency. Another type of PID, the Open Researcher and Contributor ID or ORCID (iD) was created to help with author disambiguation by providing unique identifiers for authors. The ODIN Project identifies additional PIDs along with Wikipedia’s page on PIDs.

Digital Object Identifiers (DOIs)

The DOI is a common identifier used for academic, professional, and governmental information such as articles, datasets, reports, and other supplemental information. The International DOI Foundation (IDF) is the agency that oversees DOIs. CrossRef and Datacite are two prominent not-for-profit registries that provide services to create or mint DOIs. Both have membership models where their clients are able to mint DOIs distinguished by their prefix. For example, DataCite features a statistics page where you can see registrations by members.

Anatomy of a DOI

A DOI has three main parts:

Proxy or DOI resolver service
Prefix which is unique to the registrant or member
Suffix, a unique identifier assigned locally by the registrant to an object

Anatomy of a DOI

In the example above, the prefix is used by the Australian National Data Service (ANDS) now called the Australia Research Data Commons (ARDC) and the suffix is a unique identifier for an object at Griffith University. DataCite provides DOI display guidance so that they are easy to recognize and use, for both humans and machines.

Exercise 4

HydroShare is a data repository for water data from a variety of biological disciplines. It allows researchers to share and access water data for research. Visit the HydroShare resource search at Discover. Choose any dataset by clicking on the link. Now use control + F or command + F and search for ‘http’. Did the author use DOIs or persistent link from HydroShare for their data and software?

Solution

Authors will often link to platforms such as GitHub where they have shared their software and/or they will link to their website where they are hosting the data used in the paper. The danger here is that platforms like GitHub and personal websites are not permanent. Instead, authors can use repositories to deposit and preserve their data and software while minting a DOI. Links to software sharing platforms or personal websites might move but DOIs will always resolve to information about the software and/or data. See DataCite’s Best Practices for a Tombstone Page.

Rich Metadata

More and more services are using common schemas such as DataCite’s Metadata Schema or Dublin Core to foster greater use and discovery. A schema provides an overall structure for the metadata and describes core metadata properties. While DataCite’s Metadata Schema is more general, there are discipline specific schemas such as Data Documentation Initiative (DDI) and Darwin Core.

Thanks to schemas, the process of adding metadata has been standardized to some extent but there is still room for error. For instance, DataCite reports that links between papers and data are still very low. Publishers and authors are missing this opportunity.

Challenges: Automatic ORCID profile update when DOI is minted RelatedIdentifiers linking papers, data, software in Zenodo

Connecting research outputs

DOIs are everywhere. Examples.

Resource IDs (articles, data, software, …) Researcher IDs Organisation IDs, Funder IDs Projects IDs Instrument IDs Ship cruises IDs Physical sample IDs, DMP IDs… videos images 3D models grey literature

Connecting Research Outputs

https://support.datacite.org/docs/connecting-research-outputs

Bullet points about the current state of linking… https://blog.datacite.org/citation-analysis-scholix-rda/

Provenance?

Provenance refers to the data lineage (inputs, entitites, systems, etc.) that ultimately impact validation & credibility. A researcher should comply to good scientific practices and be sure about what should get a PID (and what not). Metadata is central to visibility and citability – metadata behind a PID should be provided with consideration. Policies behind a PID system ensure persistence in the WWW - point. At least metadata will be available for a long time. Machine readability will be an essential part of future discoverability – resources should be checked and formats should be adjusted (as far possible). Metrics (e.g. altmetrics) are supported by PID systems.

Publishing behaviour of researchers

According to:

Technische Informationsbibliothek (TIB) (conducted by engage AG) (2017): Questionnaire and Dataset of the TIB Survey 2017 on information procurement and publishing behavior of researchers in the natural sciences and engineering. Technische Informationsbibliothek (TIB). DOI: https://doi.org/10.22000/54

responses from 1400 scientists in the natural sciences & engineering (across Germany)
70% of the researchers are using DOIs for journal publications
less than 10% use DOIs for research data – 56% answered that they don’t know about the option to use DOIs for other publications (datasets, conference papers etc.) – 57% stated no need for DOI counselling services – 40% of the questioned researchers need more information – 30% cannot see a benefit from a DOI

Choosing the right repository

Some things to check:

Ask your colleagues & collaborators
determining the right repo for your research
That data are kept safe in a secure environment and data are regularly backed up and preserved (long-term) for future use
Data can be easily discovered by search engines and included in online catalogues
Intellectual property rights and licencing of data are managed
Access to data can be administered and usage monitored
That visibility of data can be enhanced to enable more use and citation

The decision for or against a specific repository depends on various criteria, e.g.

Data quality
Discipline
Institutional requirements
Reputation (researcher and/or repository)
Visibility of research
Legal terms and conditions
Data value (FAIR Principles)

Some recommendations: → look for the usage of PIDs → look for the usage of standards (DataCite, Dublin Core, discipline-specific metadata → look for licences offered → look for certifications (DSA / Core Trust Seal, DINI/nestor, WDS, …)

Searching re3data w/ exercise https://www.re3data.org/ Out of more than 2115 repository systems listed in re3data.org in July 2018, only 809 (less than 39 %!) state to provide a PID service, with 524 of them using the DOI system

Search open access repos http://v2.sherpa.ac.uk/opendoar/

FAIRSharing https://fairsharing.org/databases/

Data Journals

Another method available to researchers to cite and give credit to research data is to author works in data journals or supplemental approaches used by publishers, societies, disciplines, and/or journals.

Articles in data journals allow authors to:

Describe their research data (including information about process, qualities, etc)
Explain how the data can be reused
Improve discoverability (through citation/linking mechanisms and indexing)
Provide information on data deposit
Allow for further (peer) review and quality assurance
Offer the opportunity for further recognition and awards

Examples:

Nature Scientific data - published by Nature and established in 2013
Geoscience Data Journal - published by Wiley and established in 2012
Journal of Open Archaeology Data - published by Ubiquity and established in 2011
Biodiversity Data Journal - published by Pensoft and established in 2013.
Earth System Science Data - published by Copernicus Publications and established in 2009

Also, the following study discusses data journals in depth and reviews over 100 data journals: Candela, L. , Castelli, D. , Manghi, P. and Tani, A. (2015), Data Journals: A Survey. J Assn Inf Sci Tec, 66: 1747-1762. doi:10.1002/asi.23358

How does your discipline share data

Does your discipline have a data journal? Or some other mechanism to share data? For example, the American Astronomical Society (AAS) via the publisher IOP Physics offers a supplement series as a way for astronomers to publish data.

Adapted from: Library Carpentry. September 2019. https://librarycarpentry.org/lc-fair-research.

Key Points

Findable, means findable long-term. This requires persistent identifiers (PIDs).

DOIs are one of more common PIDs and can be used to persistently identify software and datasets.

Accessible

Overview

Teaching: 8 min
Exercises: 2 min

Questions

What is a protocol?

What types of protocol are FAIR?

Objectives

Understand what a protocol is

Understand authentication protocols and their role in FAIR

Articulate the value of landing pages

Explain closed, open and mediated access to data

For data & software to be accessible:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol
A1.1 the protocol is open, free, and universally implementable
A1.2 the protocol allows for an authentication and authorization procedure, where necessary
A2. metadata remain accessible, even when the data are no longer available

What is a protocol?

Simply put, it’s an access method of exchanging data over a computer network. Each protocol has its rules for how data is formatted, compressed, checked for errors. Research repositories often use the OAI-PMH or REST API protocols to interface with data in the repository. The following image from TutorialEdge.net: What is a RESTful API by Elliot Forbes provides a useful overview of how RESTful interfaces work:

TutorialEdge.net: What is a RESTful API? by Elliot Forbes

Hydroshare offers a REST API protocol which will enable many of the functions that are accessible through the web user interface, to be done programmatically:

Hydroshare Rest API

Wikipedia has a list of commonly used network protocols but check the service you are using for documentation on the protocols it uses and whether it corresponds with the FAIR Principles. For instance, see Hydroshare’s API Instructions page.

Contributor information

Alternatively, for sensitive/protected data, if the protocol cannot guarantee secure access, an e-mail or other contact information of a person/data manager should be provided, via the metadata, with whom access to the data can be discussed. The DataCite metadata schema includes contributor type and name as fields where contact information is included. Collaborative projects such as THOR, FREYA, and ODIN are working towards improving the interoperability and exchange of metadata such as contributor information.

Author disambiguation and authentication

Across the research ecosystem, publishers, repositories, funders, research information systems, have recognized the need to address the problem of author disambiguation. The illustrative example below of the many variations of the name Jens Åge Smærup Sørensen demonstrations the challenge of wrangling the correct name for each individual author or contributor:

Jens Åge Smærup Sørensen

Thankfully, a number of research systems are now integrating ORCID into their authentication systems. Zenodo provides the login ORCID authentication option. Once logged in, your ORCID will be assigned to your authored and deposited works.

Exercise to create a Hydroshare Account

Register for Hydroshare.
You will receive a confirmation email. Click the link in the email…
Go to Hydroshare and select Log in.

Hydroshare Registration

Understanding whether something is open, free, and universally implementable

Exercise 5

ORCID features a principles page where we can assess where it lies on the spectrum of these criteria. Can you identify statements that speak to these conditions: open, free, and universally implemetable?

Solution

ORCID is a non-profit that collects fees from its members to sustain its operations Creative Commons CC0 1.0 Universal (CC0) license releases data into the public domain, or otherwise grants permission to use it for any purpose

It is open to any organization and transcends borders Followup Questions:

Where can you download the freely available data?

How does ORCID solicit community input outside of its governance?

Are the tools used to create, read, update, delete ORCID data open?

Tombstones, a very grave subject

There are a variety of reasons why a placeholder with metadata or tombstone of the removed research object exists including but not limited to staff removal, spam, request from owner, data center does not exist is still, etc. A tombstone page is needed when data and software is no longer accessible. A tombstone page communicates that the record is gone, why it is gone, and in case you really must know, there is a copy of the metadata for the record. A tombstone page should include: DOI, date of deaccession, reason for deaccession, message explaining the data center’s policies, and a message that a copy of the metadata is kept for record keeping purposes as well as checksums of the files.

DataCite offers statistics where the failure to resolve DOIs after a certain number of attempts is reported (see DataCite statistics support pagefor more information). In the case of Zenodo and the GitHub issue above, the hidden field reveals thousands of records that are a result of spam.

DataCite Statistics Page

If a DOI is no longer available and the data center does not have the resources to create a tombstone page, DataCite provides a generic tombstone page.

See the following tombstone examples:

Zenodo tombstone: https://zenodo.org/record/1098445
Figshare tombstone: https://figshare.com/articles/Climate_Change/1381402

Adapted from: Library Carpentry. September 2019. https://librarycarpentry.org/lc-fair-research.

Key Points

Research repositories often use the OAI-PMH or REST API protocols.

Interoperable

Overview

Teaching: 8 min
Exercises: 0 min

Questions

What does interoperability mean?

What is a controlled vocabulary, a metadata schema and linked data?

How do I describe data so that humans and computers can understand?

Objectives

Explain what makes data and software (more) interoperable for machines

Identify widely used metadata standards for research, including generic and discipline-focussed examples

Explain the role of controlled vocabularies for encoding data and for annotating metadata in enabling interoperability

Understand how linked data standards and conventions for metadata schema documentation relate to interoperability

For data & software to be interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation
I2. (meta)data use vocabularies that follow FAIR principles
I3. (meta)data include qualified references to other (meta)data

What is interoperability for data and software?

Shared understanding of concepts, for humans as well as machines.

What does it mean to be machine readable vs human readable?

According to the Open Data Handbook:

Human Readable
“Data in a format that can be conveniently read by a human. Some human-readable formats, such as PDF, are not machine-readable as they are not structured data, i.e. the representation of the data on disk does not represent the actual relationships present in the data.”

Machine Readable
“Data in a data format that can be automatically read and processed by a computer, such as CSV, JSON, XML, etc. Machine-readable data must be structured data. Compare human-readable. Non-digital material (for example printed or hand-written documents) is by its non-digital nature not machine-readable. But even digital material need not be machine-readable. For example, consider a PDF document containing tables of data. These are definitely digital but are not machine-readable because a computer would struggle to access the tabular information - even though they are very human readable. The equivalent tables in a format such as a spreadsheet would be machine readable. As another example scans (photographs) of text are not machine-readable (but are human readable!) but the equivalent text in a format such as a simple ASCII text file can machine readable and processable.”

Software uses community accepted standards and platforms, making it possible for users to run the software. Top 10 FAIR things for research software

Describing data and software with shared, controlled vocabularies

See

Representing knowledge in data and software

See https://librarycarpentry.org/Top-10-FAIR//2018/12/01/historical-research/#thing-5-data-structuring-and-organisation.

Beyond the PDF

Publishers, librarians, researchers, developers, funders, they have all been working towards a future where we can move beyond the PDF, from ‘static and disparate data and knowledge representations to richly integrated content which grows and changes the more we learn.” Research objects of the future will capture all aspects of scholarship: hypotheses, data, methods, results, presentations etc.) that are semantically enriched, interoperable and easily transmitted and comprehended. Attribution, Evaluation, Archiving, Impact https://sites.google.com/site/beyondthepdf/

Beyond the PDF has now grown into FORCE… Towards a vision where research will move from document- to knowledge-based information flows semantic descriptions of research data & their structures aggregation, development & teaching of subject-specific vocabularies, ontologies & knowledge graphs Paper of the Future https://www.authorea.com/users/23/articles/8762-the-paper-of-the-future to Jupyter Notebooks/Stencilia https://stenci.la/

Making Metadata Interoperable

provide machine-readable (meta)data with a well-established formalism
provide as precise & complete metadata as possible
look for metrics to evaluate the FAIRness of a controlled vocabulary / ontology / thesaurus often do not (yet) exist
clearly identify relationships between datasets in the metadata (e.g. “is new version of”, “is supplement to”, “relates to”, etc.)
request support regarding these tasks from the repositories in your field of study for software: follow established code style guides

Examples of Dataste Interoperability:

Automatic ORCID profile update when DOI is minted – DataCite – CrossRef – ORCID

If others can use your code, convey the meaning of updates with SemVer.org (CC BY 3.0) “version number[ changes] convey meaning about the underlying code” (Tom Preston-Werner)

Linked Data

Top 10 FAIR things: Linked Open Data

Standards: https://fairsharing.org/standards/ schema.org: http://schema.org/

ISA framework: ‘Investigation’ (the project context), ‘Study’ (a unit of research) and ‘Assay’ (analytical measurement) - https://isa-tools.github.io/

Example of schema.org: rOpenSci/codemetar

Modularity http://bioschemas.org

codemeta croswalks to other standards https://codemeta.github.io/crosswalk/

DCAT https://www.w3.org/TR/vocab-dcat/

Using community accepted code style guidelines such as PEP 8 for Python (PEP 8 itself is FAIR)

Scholix - related indentifiers - Zenodo example linking data/software to papers https://dliservice.research-infrastructures.eu/#/ https://authorcarpentry.github.io/dois-citation-data/01-register-doi.html

Key Points

Understand that FAIR is about both humans and machines understanding data.

Interoperability means choosing a data format or knowledge representation language that helps machines to understand the data.

Reusable

Overview

Teaching: 8 min
Exercises: 0 min

Questions

What makes data reusable?

Objectives

Explain machine readability in terms of file naming conventions and providing provenance metadata

Explain how data citation works in practice

Understand key components of a data citation

Explore domain-relevant community standards including metadata standards

Understand how proper licensing is essential for reusability

Know about some of the licenses commonly used for data and software

Exercise 6: Thanks, but no thanks!

In groups discuss:

Have you ever received data you couldn’t use? Why or why not?
Have you tried replicating an experiment, yours or someone else? What challenges did you face?

For data & software to be reusable:

R1. (meta)data have a plurality of accurate and relevant attributes
R1.1 (meta)data are released with a clear and accessible data usage licence
R1.2 (meta)data are associated with their provenance
R1.3 (meta)data meet domain-relevant community standards

File naming best practices

A file name should be unique, consistent and descriptive. This allows for increased visibility and discoverability and can be used to easily classify and sort files. Remember, a file name is the primary identifier to the file and its contents.

Do’s and Don’ts of file naming:

Do’s:

Make use of file naming tools for bulk naming such as Ant Renamer, RenameIT or Rename4Mac.
Create descriptive, meaningful, easily understood names no less than 12-14 characters.
Use identifiers to make it easier to classify types of files i.e. Int1 (interview 1)
Make sure the 3-letter file format extension is present at the end of the name (e.g. .doc, .xls, .mov, .tif)
If applicable, include versioning within file names
For dates use the  ISO 8601  standard: YYYY-MM-DD and place at the end of the file number UNLESS you need to organise your files chronologically.
For experimental data files, consider using the project/experiment name and conditions in abbreviations
Add a README file in your top directory which details your naming convention, directory structure and abbreviations
- When combining elements in file name, use common special letter case patterns such as Kebab-case, CamelCase, or Snake_case, preferably use hyphens (-) or underscores (_)
  Don’ts:
Avoid naming files/folders with individual persons names as it impedes handover and data sharing.
Avoid long names
Avoid using spaces, dots, commas and special characters (e.g. ~ ! @ # $ % ^ & * ( ) ; < > ? , [ ] { })
Avoid repetition for ex. Directory name Electron_Microscopy_Images, then you don’t need to name the files ELN_MI_Img_20200101.img

Examples:

Stanford Libraries guidance on file naming is a great place to start.
Dryad example:
1900-2000_sasquatch_migration_coordinates.csv
Smith-fMRI-neural-response-to-cupcakes-vs-vegetables.nii.gz
2015-SimulationOfTropicalFrogEvolution.R

Directory structures and README files

A clear directory structure will make it easier to locate files and versions and this is particularly important when collaborating with others. Consider a hierarchical file structure starting from broad topics to more specific ones nested inside, restricting the level of folders to 3 or 4 with a limited number of items inside each of them.

The UK data services offers an example of directory structure and naming: https://ukdataservice.ac.uk/manage-data/format/organising.aspx

For others to reuse your research, it is important to include a README file and to organize your files in a logical way. Consider the following file structure examples from Dryad:

Dryad File Structures

It is also good practice to include README files to describe how the data was collected, processed, and analyzed. In other words, README files help others correctly interpret and reanalyze your data. A README file can include file names/directory structure, glossary/definitions of acronyms/terms, description of the parameters/variables and units of measurement, report precision/accuracy/uncertainty in measurements, standards/calibrations used, environment/experimental conditions, quality assurance/quality control applied, known problems, research date information, description of relationships/dependencies, additional resources/references, methods/software/data used, example records, and other supplemental information.

Dryad README file example: https://doi.org/10.5061/dryad.j512f21p
Awesome README list (for software): https://github.com/matiassingers/awesome-readme
Different Format Types https://data.library.virginia.edu/data-management/plan/format-types/

Disciplinary Data Formats

Many disciplines have developed formal metadata standards that enable re-use of data; however, these standards are not universal and often it requires background knowledge to indentify, contextualize, and interpret the underlying data. Interoperability between disciplines is still a challenge based on the continued use of custom metadata schmes, and the development of new, incompatiable standards. Thankfully, DataCite is providing a common, overarching metadata standard across disciplinary datasets, albeit at a generic vs granular level.

In the meantime, the Research Data Alliance (RDA) Metadata Standards Directory - Working Group developed a collaborative, open directory of metadata standards, applicable to scientific data, to help the research community learn about metadata standards, controlled vocabularies, and the underlying elements across the different disciplines, to potentially help with mapping data elements from different sources.

Metadata Standards Directory
Features: Standards, Extensions, Tools, and Use Cases

Quality Control

Quality control is a fundamental step in research, which ensures the integrity of the data and could affect its use and reuse and is required in order to identify potential problems.

It is therefore essential to outline how data collection will be controlled at various stages (data collection,digitisation or data entry, checking and analysis).

Versioning

In order to keep track of changes made to a file/dataset, versioning can be an efficient way to see who did what and when, in collaborative work this can be very useful.

A version control strategy will allow you to easily detect the most current/final version, organize, manage and record any edits made while working on the document/data, drafting, editing and analysis.

Consider the following practices:

Outline the master file and identify major files for instance; original, pre-review, 1st revision, 2nd revision, final revision, submitted.
Outline strategy for archiving and storing: Where to store the minor and major versions, how long will you retain them accordingly.
Maintain a record of file locations, a good place is in the README files

Example: UK Data service version control guide: https://www.ukdataservice.ac.uk/manage-data/format/versioning.aspx

Research vocabularies

Research Vocabularies Australia https://vocabs.ands.org.au/ AGROVOC & VocBench http://aims.fao.org/vest-registry/vocabularies/agrovoc Dimensions Fields of Research https://dimensions.freshdesk.com/support/solutions/articles/23000012844-what-are-fields-of-research-

Versioning/SHA https://swcarpentry.github.io/git-novice/reference

Binder - executable environment, making your code immediately reproducible by anyone, anywhere. https://blog.jupyter.org/binder-2-0-a-tech-guide-2017-fd40515a3a84

Narrative & Documentation Jupyter Notebooks https://www.contentful.com/blog/2018/06/01/create-interactive-tutorials-jupyter-notebooks/

Licenses From GitHub https://blog.github.com/2015-03-09-open-source-license-usage-on-github-com/

Lack of licenses provide friction, understanding of whether can reuse Peter Murray Project - ContentMine - The Right to Read is the Right to Mine - OpenMinTed Creative Commons Wizard and GitHub software licensing wizards (highlight attribution, non commercial)

Useful content for Licenses Note: TIB Hannover Slides https://docs.google.com/presentation/d/1mSeanQqO0Y2khA8KK48wtQQ_JGYncGexjnspzs7cWLU/edit#slide=id.g3a64c782ff_1_138

Resources

Choose an open source license: https://choosealicense.com/

4 Simple recommendations for Open Source Software https://softdev4research.github.io/4OSS-lesson/

Top 10 FAIR Imaging https://librarycarpentry.org/Top-10-FAIR//2019/06/27/imaging/

Licensing your work: https://librarycarpentry.org/Top-10-FAIR//2019/06/27/imaging/#9-licensing-your-work

The Turing Way a Guide for reproducible Research: https://the-turing-way.netlify.app/welcome

The Open Science Training Handbook: https://open-science-training-handbook.gitbook.io/book/

Open Licensing and file formats https://open-science-training-handbook.gitbook.io/book/open-science-basics/open-licensing-and-file-formats#6-open-licensing-and-file-formats

DCC How to license research data https://www.dcc.ac.uk/guidance/how-guides/license-research-data

Adapted from: Library Carpentry. September 2019. https://librarycarpentry.org/lc-fair-research.

Key Points

It is possible to publish public data that does not meet FAIR standards.

Different fields have variable standards for metadata.

Metadata

Overview

Teaching: 8 min
Exercises: 14 min

Questions

What is metadata?

What do we use metadata for?

Objectives

Recognise what metadata is

Distinguish different types of metadata

Understand what makes metadata interoperable

Know how to decide what to include in metadata

(5 min teaching)

What is (or are) metadata?

Simply put, metadata is data about the data. Sound confusing? Lets clarify: metadata is the description of data. It allows deeper understanding of data and provides insight for its interpretation. Hence, your metadata should be considered as important as your data. Further, metadata plays a very important role in making your data FAIR. It should be continuously added to your research data (not just at the beginning or end of a project!). Metadata can be produced in an automated way (e.g. when you capture a microscopy image usually the accompanying software saves metadata as part of it) or manually.

Let’s take a look at an example:

This is a confocal microscopy image of a C. elegans nematode strain used as a proteostasis model (Pretty! Isn’t it?). The image is part of the raw data associated to Goya et al., 2020, which was deposited in a Public Omero Server
Project
Figure1 set

nematode_confocal_microscopy_image Figure credits: María Eugenia Goya

. What information can you get from the image, without the associated description (metadata)?

Let’s see the associated metadata of the image and the dataset to which it belongs:

Image metadata

Name: OP50 D10Ad_06.czi Image ID: 3485 Owner: Maria Eugenia Goya ORCID: 0000-0002-5031-2470

Acquisition Date: 2018-12-12 17:53:55 Import Date: 2020-04-30 22:38:59 Dimensions (XY): 1344 x 1024 Pixels Type: uint16 Pixels Size (XYZ) (µm): 0.16 x 0.16 x 1.00 Z-sections/Timepoints: 56 x 1 Channels: TL DIC, TagYFP ROI Count: 0

Tags: time course; day 10; adults; food switching; E. coli OP50; NL5901; C. elegans

Dataset metadata

Name: Figure2_Figure2B Dataset ID: 263 Owner: Maria Eugenia Goya ORCID: 0000-0002-5031-2470

Description: The datasets contains a time course of α-syn aggregation in NL5901 C. elegans worms after a food switch at the L4 stage:

E. coli OP50 to OP50 Day 01 adults Day 03 adults Day 05 adults Day 07 adults Day 10 adults Day 13 adults

E. coli OP50 to B. subtilis PXN21 Day 01 adults Day 03 adults Day 05 adults Day 07 adults Day 10 adults Day 13 adults

Images were taken at 6 developmental timepoints (D1Ad, D3Ad, D5Ad, D7Ad, D10Ad, D13Ad)

* Some images contain more than one nematode.

Each image contains ~30 (or more) Z-sections, 1 µmeters apart. The TagYFP channel is used to follow the alpha-synuclein particles. The TL DIC channel is used to image the whole nematode head.

These images were used to construct Figure 2B of the Cell Reports paper (https://doi.org/10.1016/j.celrep.2019.12.078).

Creation date: 2020-04-30 22:16:39

Tags: protein aggregation; time course; E. coli OP50 to B. subtilis PXN21; food switching; E. coli OP50; 10.1016/j.celrep.2019.12.078; NL5901; C. elegans

This is a lot of information!

Types of metadata

According to How to FAIR we can distinguish between three main types of metadata:

Administrative metadata: data about a project or resource that are relevant for managing it; E.g. project/resource owner, principal investigator, project collaborators, funder, project period, etc. They are usually assigned to the data, before you collect or create them.
Descriptive or citation metadata: data about a dataset or resource that allow people to discover and identify it; E.g. authors, title, abstract, keywords, persistent identifier, related publications, etc.
Structural metadata: data about how a dataset or resource came about, but also how it is internally structured. E.g. the unit of analysis, collection method, sampling procedure, sample size, categories, variables, etc. Structural metadata have to be gathered by the researchers according to best practice in their research community and will be published together with the data.

Descriptive and structural metadata should be added continuously throughout the project.

Exercise 6: Identifying metadata types (4 min)

Here we have an excel spreadsheet that contains project metadata for a made-up experiment of plant metabolites Figure credits: Tomasz Zielinski and Andrés Romanowski

In groups, identify different types of metadata (administrative, descriptive, structural) present in this example.

Solution

Administrative metadata marked in blue

Descriptive metadata marked in orange

Structural metadata marked in green Figure credits: Tomasz Zielinski and Andrés Romanowski

(6 min teaching)

Where does data end and metadata start?

What is “data” and what is “metadata” can be a matter of perspective: Some researchers’ metadata can be other researchers’ data.

For example, a funding body is categorised as typical administrative metadata, however, it can be used to calculate numbers of public datasets per funder and then used to compare effects of different funders’ policies on open practices.

Adding metadata to your experiments

Good metadata are crucial for assuring re-usability of your outcomes. Adding metadata is also a very time-consuming process if done manually, so collecting metadata should be done incrementally during your experiment.

As we saw metadata can take many forms from as simple as including a ReadMe.txt file, by embedding them inside the Excel files, to using domain specific metadata standards and formats.

But,

What should be included in metadata?
What terms should be used in descriptions?

For many assay methods and experiment types, there are defined recommendations and guidelines called Minimal Information Standards.

Minimal Information Standard

The minimum information standard is a set of guidelines for reporting data derived by relevant methods in biosciences. If followed, it ensures that the data can be easily verified, analysed and clearly interpreted by the wider scientific community. Keeping with these recommendations also facilitates the foundation of structuralized databases, public repositories and development of data analysis tools. Individual minimum information standards are brought by the communities of cross-disciplinary specialists focused on issues of the specific method used in experimental biology.

Minimum Information for Biological and Biomedical Investigations (MIBBI) is the collection of the most known standards.

FAIRSharing offers excellent search service for finding standards

Exercise 7: Minimal information standard example (2 min)

Look at Minimum Information about a Neuroscience Investigation (MINI) Electrophysiology Gibson, F. et al. Nat Prec (2008). which contains recommendations for reporting the use of electrophysiology in a neuroscience study.
(Neuroscience (or neurobiology) is the scientific study of the nervous system).

Scroll to Reporting requirement and decide which of the points 1-8 are:

a) important for understanding and reuse of data

b) important for technical replication

c) could be applied to other experiments in neuroscience

Solution

Possible answers:

a) 2, 3, 4, 5, 6, 8a-b

b) 3, 7

c) 2, 3, 4, 5, 6

What if there are no metadata standards defined for your data / field of research?

Think about the minimum information that someone else (from your lab or from any other lab in the world) would need to know to be able to work with your dataset without any further input from you.

Think as a consumer of your data not the producer!

Exercise 8: What to include - discussion (2 minutes)

Think of the data you generate in your projects, and imagine you are going to share them.

What information would another researcher need to understand or reproduce your data (the structural metadata)?

For example, we believe that any dataset should have:

a name/title

its purpose or experimental hypothesis

Write down and compare your proposals, can we find some common elements?

Solution

Some typical elements are:

biological material, e.g. Species, Genotypes, Tissue type, Age, Health conditions

biological context, e.g. speciment growth, entrainment, samples preparation

experimental factors and conditions, e.g. drug treatments, stress factors

primers, plasmid sequences, cell line information, plasmid construction

specifics of data acquisition

specifics of data processing and analysis

definition of variables

accompanying code, software used (version nr), parameters applied, statistical tests used, seed for randomisation

LOT numbers

Metadata and FAIR guidelines

Metadata provides extremely valuable information for us and others to be able to interpret, process, reuse and reproduce the research data it accompanies.

Because metadata are data about data, all of the FAIR principles i.e. Findable, Accessible, Interoperable and Reusable apply to metadata.

Ideally, metadata should not only be machine-readable, but also interoperable so that they can interlink or be reasoned about by computer systems.

Attribution

Content of this episode was adapted from:

Metadata - FAIR data for climate sciences.

Metadata - How to FAIR

MIBBI

Key Points

Metadata provides contextual information so that other people can understand the data.

Metadata is key for data reuse and complying with FAIR guidelines.

Metadata should be added incrementally through out the project

Public repositories

Overview

Teaching: 8 min
Exercises: 2 min

Questions

Where can I deposit datasets?

What are general data repositories?

How to find a repository?

Objectives

See the benefits of using research data repositories.

Differentiate between general and specific repositories.

Find a suitable repository.

What are research data repositories?

(13 min teaching) Research data repositories are online repositories that enable the preservation, curation and publication of research ‘products’. These repositories are mainly used to deposit research ‘data’. However, the scope of the repositories is broader as we can also deposit/publish ‘code’ or ‘protocols’ (as we saw with protocols.io).

There are general “data agnostic” repositories, for example:

Or domain specific, for example:

UniProt protein data,
GenBank sequence data,
MetaboLights metabolomics data
GitHub for code.
Hydroshare for water data.

Research outputs should be submitted to discipline/domain-specific repositories whenever it is possible. When such a resource does not exist, data should be submitted to a ‘general’ repository. Research data repositories are a key resource to help in data FAIRification as they assure Findability and Accessibility.

Exercise 9: Public general record (8 min)

Have a look at the following record for data set in Hydroshare repository: Hydroshare. What elements make it FAIR?

Solution

The elements that make this deposit FAIR are:

Findable (persistent identifiers, easy to find data and metadata):

F1. (Meta)data are assigned a globally unique and persistent identifier - YES

F2. Data are described with rich metadata (defined by R1 below)- YES

F3. Metadata clearly and explicitly include the identifier of the data they describe - YES

F4. (Meta)data are registered or indexed in a searchable resource - YES

Accessible (The (meta)data retrievable by their identifier using a standard web protocols):

A1. (Meta)data are retrievable by their identifier using a standardised communications protocol - YES

A2. Metadata are accessible, even when the data are no longer available - YES

Interoperable (The format of the data should be open and interpretable for various tools):

I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation. - YES

I2. (Meta)data use vocabularies that follow FAIR principles - PARTIALLY

I3. (Meta)data include qualified references to other (meta)data - YES

Reusable (data should be well-described so that they can be replicated and/or combined in different settings, reuse states with a clear licence):

R1. (Meta)data are richly described with a plurality of accurate and relevant attributes - YES

R1.1. (Meta)data are released with a clear and accessible data usage license - YES

R1.2. (Meta)data are associated with detailed provenance - YES

R1.3. (Meta)data meet domain-relevant community standards - YES/PARTIALLY

(4 min discussion)

Minimal data set

Minimal data set to consist of the data required to replicate all study findings reported in the article, as well as related metadata and methods.

The values behind the means, standard deviations and other measures reported;

The values used to build graphs;

The points extracted from images for analysis.

(no need for raw data if the standard in the field is to share data that have been processed)

PLOS

How do we choose a research data repository?

(3 min teaching) As a general rule, your research needs to be deposited in discipline/data specific repository. If no specific repository can be found, then you can use a generalist repository. Having said this, there are tons of data repositories to choose from. Choosing one can be time consuming and challenging as well. So how do you go about finding a repository:

Check the publisher’s / funder’ recommended list of repositories, some of which can be found below:
Check Fairsharing recommendations
- alternatively, check the Registry of research data repositories - re3data

Exercise 10: Finding a repository (5 min + 4 min discussion).

a) Find a repo for genomics data.
b) Find a repo for microscopy data.
Note to instructor: Fairsharing gives few options, people may give different answer follow up why they selected particular ones.

Solution

a) GEO/SRA and ENA/ArrayExpress are good examples. Interestingly these repositories do not issue a DOI.
b) IDR is good examples.

(6 min teaching)

A list of UoE BioRDM’s recommended data repositories can be found here.

What comes first? the repository or the metadata?

Finding a repository first may help in deciding what metadata to collect and how!

Extra features

It is also worth considering that some repositories offer extra features, such as running simulations or providing visualisation. For example, FAIRDOMhub can run model simulations and has project structures. Do not forget to take this into account when choosing your repository. Extra features might come in handy.

Can GitHub be cited?

To make your code repositories easier to reference in academic literature, you can create persistent identifiers for them. Particularly, you can use the data archiving tool in Zenodo to archive a GitHub repository and issue a DOI for it.

Evaluating a research data repository

You can evaluate the repositories by following this criteria:

who is behind it, what is its funding
quality of interaction: is the interaction for purposes of data deposit or reuse efficient, effective and satisfactory for you?
take-up and impact: what can I put in it? Is anyone else using it? Will others be able to find stuff deposited in it? Is the repository linked to other data repositories so I don’t have to search tehre as well? Can anyone reuse the data? Can others cite the data, and will depositing boost citations to related papers?
policy and process: does it help you meet community standards of good practice and comply with policies stipulating data deposit?

Resources

An interesting take can be found at Peter Murray-Rust’s blog post Criteria for succesful repositories.

Attribution

Content of this episode was adapted or inspired by:.

FAIR principles

BioRDM suggested data repositories

DCC - How can we evaluate data repositories?

Criteria for succesful repositories

Key Points

Repositories are the main means for sharing research data.

You should use data-type specific repository whenever possible.

Repositories are the key players in data reuse.

Exercises

Overview

Teaching: 0 min
Exercises: 10 min

Questions

What makes this dataset FAIR?

Objectives

Analyze a dataset to see if it is FAIR.

(10 min exercise)

Exercise 11: What aspect of this dataset are FAIR? (10 minutes)

At bare minimum, any dataset can probably benefit from having the below information available:

a name/title

its purpose or experimental hypothesis

Analyze the below dataset from HydroShare. Hydroshare: Annual soil moisture predictions across conterminous United States using remote sensing and terrain analysis across 1 km grids (1991-2016)

Use the ARDC FAIR self assessment tool

Solution

Solutions will probably contain the following:

Findable: mostly FAIR

Accessible: mostly FAIR

Interoperable: mostly FAIR

Reusable: mostly FAIR

Exercise 12: What aspect of this dataset are FAIR? (10 minutes)

Analyze the below dataset from HydroShare. Long-term, gridded standardized precipitation index for Hawai‘i

Use the ARDC FAIR self assessment tool

Solution

Solutions will probably contain the following:

Findable: mostly FAIR

Accessible: mostly FAIR

Interoperable: mostly FAIR

Reusable: mostly FAIR

Exercise 13: What aspect of this dataset are FAIR? (10 minutes)

Analyze the below dataset from HydroShare. ‘Ike Wai: Groundwater Chemistry - Nutrient Data

Use the ARDC FAIR self assessment tool

Solution

Solutions will probably contain the following:

Findable: mostly FAIR

Accessible: mostly FAIR

Interoperable: mostly FAIR

Reusable: mostly FAIR

Attribution

Content of this episode was adapted from:

Metadata - FAIR data for climate sciences.

Metadata - How to FAIR

MIBBI

Key Points

A spectrum exists for FAIR data sharing.

Ethics

Overview

Teaching: 8 min
Exercises: 0 min

Questions

What ethical considerations are there when making data public?

Objectives

Understand privacy, freedom, explainability, and fairness as they go into managing data ethics.

Technology poses ethical challenges. There are several areas of research in data ethics. This will serve as a primer for them, additional reading is included.

Privacy
Freedom
Explainability
Fairness

Privacy what control do people have over the data collected about them? For example, if a person has a BRCA mutation that increases their risk of cancer should their health insurer be able to increase the insurance rate of their policy?

Freedom do people have the freedom to share data about themselves without fear of consequences?

Explainability can the underlying algorithmic processes be explained or tested?

Are you or do you know anyone who believes their phone is listening to them?

Companies who implement predictive analytics are drifting towards needing to prove the negative with regard to privacy (that they’re following their privacy policies) to consumers because their predictive analytics are so good.

Fairness Because machine learning algorithms often find new, unexpected connections. Such algorithms are useful because they can interpret data that a human cannot; they can improve the fairness of these decisions or they can exacerbate existing biases. Ascertaining when and how machine learning systems introduce bias into decision-making processes presents a significant new challenge in developing these tools.

“While we don’t promise equal outcomes, we have strived to deliver equal opportunity.” –Barack Obama

Humans do not entirely agree on what is fair.

Choice of Algorithmic Model Impacts the Output. Low Impact Decision, High Volume ex. Facebook Advertising High Impact Decision, Low Volume ex. Medical Testing

Anti-discrimination laws cover machine learning algorithms and, even if variables related to protected-class status are excluded, these algorithms can still produce disparate impacts. Such impacts can be measured if the variables that they use correlate with both the output variable and a variable for protected-class status. Even unintentional discrimination results in legal risk.

Both public and private entities should conduct disparate impact assessments. Software developers should perform disparate impact analyses before publishing or using their algorithms.

Resources

https://www.brookings.edu/research/fairness-in-algorithmic-decision-making/

Key Points

Privacy, freedom, explainability, and fairness all go into managing data ethics.

Security

Overview

Teaching: 8 min
Exercises: 0 min

Questions

What measures will you take to secure your data?

Objectives

Discuss steps and changes in your habits you will take after learning about data security.

Today, we will approach two different aspcets of data security.

Securing data from hostile people and groups that would compromise the system.
Securing data from general threats such as power outages, fires, floods, and hardware failure.

Securing data from hostile groups: Types of attacks

Malware: Malicious software. Malware is activated when a user clicks on a malicious link or attachment, which leads to installing dangerous software. Cisco reports that malware, once activated, can: -Block access to key network components (ransomware) -Install additional harmful software -Covertly obtain information by transmitting data from the hard drive (spyware) -Disrupt individual parts, making the system inoperable
Emotet: The Cybersecurity and Infrastructure Security Agency (CISA) describes Emotet as “an advanced, modular banking Trojan that primarily functions as a downloader or dropper of other banking Trojans. Emotet continues to be among the most costly and destructive malware.”
Denial of Service: A denial of service (DoS) is a type of cyber attack that floods a computer or network so it can’t respond to requests. A distributed DoS (DDoS) does the same thing, but the attack originates from a computer network. Cyber attackers often use a flood attack to disrupt the “handshake” process and carry out a DoS. Several other techniques may be used, and some cyber attackers use the time that a network is disabled to launch other attacks. A botnet is a type of DDoS in which millions of systems can be infected with malware and controlled by a hacker, according to Jeff Melnick of Netwrix, an information technology security software company. Botnets, sometimes called zombie systems, target and overwhelm a target’s processing capabilities. Botnets are in different geographic locations and hard to trace.
Man in the Middle: A man-in-the-middle (MITM) attack occurs when hackers insert themselves into a two-party transaction. After interrupting the traffic, they can filter and steal data, according to Cisco. MITM attacks often occur when a visitor uses an unsecured public Wi-Fi network. Attackers insert themselves between the visitor and the network, and then use malware to install software and use data maliciously.
Phishing: Phishing attacks use fake communication, such as an email, to trick the receiver into opening it and carrying out the instructions inside, such as providing a credit card number. “The goal is to steal sensitive data like credit card and login information or to install malware on the victim’s machine,” Cisco reports.
SQL Injection: A Structured Query Language (SQL) injection is a type of cyber attack that results from inserting malicious code into a server that uses SQL. When infected, the server releases information. Submitting the malicious code can be as simple as entering it into a vulnerable website search box.
Password Attacks: With the right password, a cyber attacker has access to a wealth of information. Social engineering is a type of password attack that Data Insider defines as “a strategy cyber attackers use that relies heavily on human interaction and often involves tricking people into breaking standard security practices.” Other types of password attacks include accessing a password database or outright guessing.

An Arms Race

Cyber security practices continue to evolve as the internet and digitally dependent operations develop and change. Data storage on devices such as laptops and cellphones makes it easier for cyber attackers to find an entry point into a network through a personal device. For example, in the May 2019 book Exploding Data: Reclaiming Our Cyber Security in the Digital Age, former U.S. Secretary of Homeland Security Michael Chertoff warns of a pervasive exposure of individuals’ personal information, which has become increasingly vulnerable to cyber attacks. Increase in both data use and data sharing are contributing to cybersecurity threats. Volume and connectedness both create new areas to exploit.

Strategies to Prevent Hostile Groups from Stealing Data:

Complex Passwords
Setting 2 Factor Authentication
Choosing a Reputible Data Repository

Securing Data from Other Natural Events Data loss can happen for many reasons such as fires, floods, and hardware failure. The main strategy for preventing this is data redundancy, where data is stored in multiple locations.

Large datasets, will often span multiple hard disks as we approach 1 terabit per square inch (though to be the superparamagnetic limit) of storage density in magnetic hard drives with increasing data size. Storage arrays are ften redundent, redundancy is often created using various types of RAID configurations.

Raid 1 is an exact copy (or mirror) of a set of data on two or more disks.
Raid 5 consists of block-level striping with distributed parity and requires that all drives but one be present to operate. Upon failure of a single drive, subsequent reads can be calculated from the distributed parity such that no data is lost. RAID 5 requires at least three disks.

Small datasets and files are often stored on laptops and computer hard drives. While storage on these devices is more reliable than ever, the data could be lost in the event of damage to the computer. Data storage is affordable, recreating data is not! Important files should never be in only 1 place. Cloud storage can be useful for this. Some cloud storage services:

Dropbox
GSuite
OneDrive
iCloud

Resources

Attribution RAID The BioRDM team has a lot of information about the here taught course material on their BioRDM wiki.

Key Points

There are simple steps to help make your data more secure

Implement these steps to avoid data loss

Fair Data Management Security and Ethics

Introduction

Overview

Introductions

Better research by better sharing

When should you engage in data sharing and open practices?

Key Points

Open Science

Overview

What is Open Science

What is the Open Science movement?

Open Science Building Blocks

Exercise 1: Benefits of Open Science

Solution

Motivation: Money

Motivation: Reproducibility

Personal motivators

Barriers and risks of OS movement:

Exercise 2: Why we are not doing Open Science already

Solution

Where to next

Attribution

Key Points

Intellectual Property, Licensing, and Openness

Overview

Open Science and Intellectual property

(Optional) Intellectual property protection

Exercise 3: Checking common licenses

Solution

Attribution

Key Points

FAIR Introduction

Overview

What is FAIR?

What the FAIR principles are not

Why FAIR?

What’s in it for you?

Getting started with FAIR (climate) data

Challenge for yourself - Evaluate one of your own datasets

Attribution

Key Points

Findable

Overview

For data & software to be findable:

Persistent identifiers (PIDs) 101

Different types of PIDs

Digital Object Identifiers (DOIs)

Anatomy of a DOI

Exercise 4

Solution

Rich Metadata

Connecting research outputs

Provenance?

Publishing behaviour of researchers

Choosing the right repository

Data Journals

How does your discipline share data

Key Points

Accessible

Overview

For data & software to be accessible:

What is a protocol?

Contributor information

Author disambiguation and authentication

Exercise to create a Hydroshare Account

Understanding whether something is open, free, and universally implementable

Exercise 5

Solution

Tombstones, a very grave subject

Key Points

Interoperable

Overview

For data & software to be interoperable:

What is interoperability for data and software?

What does it mean to be machine readable vs human readable?

Describing data and software with shared, controlled vocabularies

Representing knowledge in data and software

Beyond the PDF

Making Metadata Interoperable

Linked Data