biodiversity1: The semantic linkage: a new tool for data liberation

Building on existing mechanisms, and thanks to the BiCIKL project, a new tool has been developed by PLAZI, a non-profit organisation that supports and promotes the development of persistent and openly accessible digital taxonomic literature. As a partner to BiCIKL, PLAZI has led the development of new tool that uses the process of semantic tagging and enrichment to extract data from Global Biodiversity Information Facility (GBIF) as the big global aggregator of biodiversity data, and link them to publications uploaded in Zenodo. The use of this trusted repository built and operated by CERN and OpenAIRE ensures that all scientific outcomes are easily and immediately available openly, since every upload is assigned a Digital Object Identifier (DOI), to make them citable and trackable.

The integration between PLAZI and GBIF involves making taxonomic names and associated information in PLAZI articles discoverable and linkable to the larger biodiversity dataset available on GBIF.

In this module of the course, we will guide you through the process of setting up through several stages:

1. Preparing data extraction

Understanding the linking process
Getting to know the computer program used
Using GGI for individual extraction

2. Performing data extraction

To guide for realizing an Individual extraction
To place queries, produce statistics, and analyze Data reuse
To benefit from the Matching Service

3. Improving the tool’s results

By using Quality Control checks
And enhancing annotations

1. Preparing data extraction

To go through the process of becoming prepared to extract data from publications, several steps need to be undertaken and fully acknowledged by any user:

understanding the linking process
getting familiar the software

Understanding the linking process

Here’s how the linking process typically works:

Semantic tagging:

PLAZI semantically tags taxonomic names, bibliographic references, and other relevant information in taxonomic articles. This involves using unique identifiers for taxonomic entities, ensuring that each name is machine-readable and globally identifiable.

Use of Persistent identifiers:

Persistent identifiers, such as Digital Object Identifiers (DOIs), are assigned to taxonomic treatments published on PLAZI. These identifiers serve as unique and permanent links to the content.

Integration with GBIF Taxonomic Backbone:

PLAZI links taxonomic names to the GBIF Backbone Taxonomy. The GBIF Backbone Taxonomy is a standardized, global taxonomy that provides a hierarchical structure for organizing biodiversity data. Linking to the backbone taxonomy allows for consistency and interoperability across different biodiversity datasets.

Cross-Referencing with GBIF Datasets:

Taxonomic names in PLAZI are cross-referenced with the extensive biodiversity datasets available on GBIF. This includes information on species occurrences, distribution maps, and other relevant data.

Enrichment of Taxonomic Information:

PLAZI enriches taxonomic information by associating it with additional data available on GBIF. This may include distribution maps, occurrence records, and other ecological or taxonomic details.

Provision of Links in PLAZI Articles:

In PLAZI articles, links to GBIF and other relevant biodiversity databases are provided alongside taxonomic names. These links allow users to access additional information on the respective species or taxa from the broader biodiversity context.

Data Accessibility:

By linking to GBIF, PLAZI contributes to the accessibility and discoverability of taxonomic information. Researchers, policymakers, and the public can easily navigate from taxonomic treatments in PLAZI to more extensive biodiversity datasets on GBIF.

Collaboration and Data Exchange:

PLAZI actively collaborates with GBIF and other biodiversity initiatives to facilitate the exchange of data. This collaboration ensures that taxonomic information is integrated into the broader context of global biodiversity research.

Overall, the linking process between PLAZI and GBIF is a key step in promoting data interoperability and accessibility in the field of biodiversity-related science. It allows for the seamless integration of taxonomic information published in PLAZI with the extensive biodiversity datasets maintained by GBIF, creating a more comprehensive and interconnected knowledge base.

Getting familiar the software

In order to successfully use the software, you will first need to gain some basic knowledge on how the GoldenGate Imagine – GGI program functions. This is the computer program on which the data liberation process is built, to better make your queries to the system and use it meaningfully for your research work.

To install and learn how to successfully perform basic functions in GGI, click on this link below for the tutorial made by the Plazi team:

Tutorial: Getting used to GGI and its tools and functions

2. Performing data extraction

Extracting data can be performed at different levels, by carrying out individual extraction from a single paper using the tool GGI, by placing different queries to the system and produce analytical outcomes and reports, and by using the matching service to link material citations to specimens.

Individual extraction

Glossary GGI

Once the software is installed, you are ready to use it. At this point, there are several steps to follow to extract annotations and attributes from a paper using GGI.

The tutorial below will guide you on how to extract annotations and attributes from a paper using GGI. Firstly, it shows the good practices you may take into consideration before starting an individual extraction. After that, you might be ready to start extracting information!

Tutorial: Steps to extract annotations and attributes from a paper using GGI

In short, follow these quick steps:

Detect Document Structure
Add Document Metadata
Parse Bibliography
Mark Bibliographic Citations
Mark Taxon Names
Mark Taxon Keys
Mark Treatments
Treatment Structure
Mark Treatment Citations
Mark Materials Citations
Parse Materials Citations, and finally
Select Zenodo License

Queries and Data Reuse

At this stage, it is important to realize how to query the Treatment Bank and analyze or reuse your extracted data. For that you may need to access a quite diverse array of repositories and platforms, where the different biodiversity-relevant (types of) data might be deposited, including:

Treatment Bank

Zenodo

GBIF

SynoSpecies

Ocellus

You may understand the process better by reading this file. Here is a practical exercise designed to give participants hands-on experience with the PLAZI workflow, specifically focusing on the extraction and annotation of taxonomic treatments using GoldenGATE Imagine.

Additionally, you may explore some ways to produce and analyze aggregated figures and data sets by accessing some major https://plazi.org/treatmentbank/data-and-statistics/analysis tools such as those provided by Plazi:

Article stats
Treatment stats
Treatment Details Linking statistics -TDL stats

The Matching Service

This section describes the key functionalities of the eBioDiv Matching Service that is used for linking material citations to specimens.

To facilitate the task, a semi-automatic matching approach is used. Users are presented with lists of possible matches. Once a sufficient number of matching decisions have been recorded by the users of the eBioDiv Matching Service, it will be possible to start exploring ways of gradually replacing the current semi-automatic matching approach with a fully automatic matching algorithm for data pairs with a sufficient number of data points of score high enough to warrant automatic acceptance of a match.

Furthermore, it might become possible to automatically sort out data entries that are clearly deficient and therefore unfit as an input for matching decisions. These data entries could then be relegated back to the maintainers of the data along with a request to improve the data.

As a result, fewer matching suggestions will have to be curated by humans. It will be interesting to observe how the intellectual work carried out by the users of the eBioDiv Matching Service will evolve as a result of this shift in the balance between the tasks relegated to computer algorithms and the tasks remaining with human contributors. Here’s how the service might be used in a real-world scenario:

You may access the link above to the eBioDiv Matching Service, to obtain a more detailed introduction to the matching process and to get access to detailed documentation related to the use of the Matching Service.

For new users it is highly recommended to have a close look at the following documentation to get an introduction to the matching process:

For advanced users the key steps of the process are described in the following documentation:

3. Improving the tool’s results

Quality Control

The Quality Control (QC) system is established as a machine detection mechanism to prevent wrong information from reaching the repositories and thus, avoid major errors and rolling mistakes based on misleading information. The level of errors spans from blockers to minor which indicates the impact spread on the wrong use of the data.

The following tutorial on this Plazi tool presents:

how the QC service functions, and
how to run a QC check

Building on top of the results of this service, several checklists have been produced to reflect what should not be done to avoid committing mistakes in the data processing process and the best way to correct the errors, should those appear.

To be on the safest side possible:

Document Metadata
Text Streams
Check illustrations.

Still, those mistakes occur very normally though there is always a solution to tackle those common errors that refer to:

Text (see the following screenshot as an example):

Treatments
Illustrations
Material citations
Bibliographic references
Taxon names
Taxonomic names

Enhancing annotations

To improve results and ensure the best quality possible of the annotations, you may get further in the process and obtain more details for the extracted data. At any stage of the process you may come back to the annotations and improve them in a simplified process that comprises several levels of enhancement:

Edit Annotations Attributes
Copy Annotation Attributes
Parse Materials Citations
Parse References
Parse Taxon names
Assign captions for figures and tables
Find or List Words
List annotations
Revise Block Paragraphs

This is also a service provided by PLAZI that has been built in collaboration with other infrastructures and developed under the umbrella of several projects, COST MOBILISE and BiCIKL among others

Last modified: Thursday, 9 November 2023, 11:52 AM