Both sides previous revision Previous revision Next revision | Previous revision |
dataflow:general_dataflow [2025/03/24 12:48] – birgit | dataflow:general_dataflow [2025/04/15 16:43] (current) – birgit |
---|
===== Data pipeline of research data and corresponding metadata using LIB in-house-management systems (DWB, fylr, BioCase) ===== | ===== Data pipeline of research data and corresponding metadata using LIB in-house-management systems (DWB, fylr, BioCase) ===== |
| |
The [[https://www.gfbio.org/data-centers/LIB|LIB Biodiversity Data Center]] is one of the seven [[https://www.gfbio.org/data-centers|GFBio Collection Data Centers]] that are part and form the backbone of the GFBio Submission, Repository and Archiving Infrastructure. The data archiving and publication at LIB includes the management systems [[https://diversityworkbench.net/Portal/Diversity_Workbench|Diversity Workbench]] as well as the digital asset management system [[https://fylr.io/|fylr]] and [[https://data.bolgermany.de/gbol1/metabarcoding|asv registry]] a tool to manage asv- and otu-tables . Management tools and archiving processes as done at the Datacenter are described under [[https://gfbio.biowikifarm.net/wiki/Technical_Documentations|Technical Documentations]]. This includes services for documentation, processing and archiving of the provided original data and metadata sets (source data; SIP). Data producers are welcome to use Spreadsheet templates as provided under [[https://gfbio.biowikifarm.net/wiki/Forms_and_Assessments|Templates for data submission]]. | The [[https://www.gfbio.org/data-centers/LIB|LIB Biodiversity Data Center]] is one of the seven [[https://www.gfbio.org/data-centers|GFBio Collection Data Centers]] that are part and form the backbone of the GFBio Submission, Repository and Archiving Infrastructure. The data archiving and publication at LIB includes the management systems [[https://diversityworkbench.net/Portal/Diversity_Workbench|Diversity Workbench]] as well as the digital asset management system [[https://fylr.io/|fylr]] and [[https://data.bolgermany.de/gbol1/metabarcoding|ASV-Registry]], a tool to manage asv/otu tables. Management tools and archiving processes as done at the Datacenter are described under [[https://gfbio.biowikifarm.net/wiki/Technical_Documentations|Technical Documentations]]. This includes services for documentation, processing and archiving of the provided original data and metadata sets (source data; SIP). Data producers are welcome to use Spreadsheet templates as provided under [[https://gfbio.biowikifarm.net/wiki/Forms_and_Assessments|Templates for data submission]]. |
The workflow for submission, archiving and publication of data follows the standard for a __O__pen __A__rchival __I__nformation __S__ystem ([[https://www.iso.org/standard/57284.html|OAIS - Open archival information system]] and [[https://public.ccsds.org/pubs/650x0m2.pdf|Reference Model for an Open Archival Information System (pdf)]]). This ISO standard basically distinguished between different information packages for submission (SIP), archiving (AIP), and dissemination (DIP). For an overview of ISO standards for digital archives see [[ https://gfbio.biowikifarm.net/wiki/ISO_Standards_for_Digital_Archives|ISO Standards for Digital Archives]]. | The workflow for submission, archiving and publication of data follows the standard for a __O__pen __A__rchival __I__nformation __S__ystem ([[https://www.iso.org/standard/87471.html|OAIS - Open archival information system]] and [[https://ccsds.org/wp-content/uploads/gravity_forms/5-448e85c647331d9cbaf66c096458bdd5/2025/01//650x0m3.pdf|Reference Model for an Open Archival Information System (pdf)]]). This ISO standard basically distinguished between different information packages for submission (SIP), archiving (AIP), and dissemination (DIP). For an overview of ISO standards for digital archives see [[ https://gfbio.biowikifarm.net/wiki/ISO_Standards_for_Digital_Archives|ISO Standards for Digital Archives]]. |
| |
The different modules from Diversity Workbench for specimen occurrence data, literature, taxonomies, and others are used at LIB for data and metadata import, metadata enrichment and data quality control (see [[https://www.gfbio.org/data/tools|Tools & Workbenches for Data Management at GFBio]]). | The different modules from Diversity Workbench for specimen occurrence data, literature, taxonomies, and others are used at LIB for data and metadata import, metadata enrichment and data quality control (see [[https://www.gfbio.org/data/tools|Tools & Workbenches for Data Management at GFBio]]). |
For multimedia data is [[https://fylr.io/|fylr]] used. All available metadata are stored for each record. | For multimedia data is [[https://fylr.io/|fylr]] used. All available metadata are stored for each record. |
| |
For metabarcoding data and the associated ASV- or OTU-tables is [[https://data.bolgermany.de/gbol1/metabarcoding|asv-registry]] used. | For metabarcoding data and the associated asv/otu tables is [[https://data.bolgermany.de/gbol1/metabarcoding|ASV-Registry]] used. |
| |
Each SIP is imported into the management systems and prepared for dissemination by transforming the original research data and corresponding metadata to meet domain specific requirements as well as requirements data exchange, such as standards like [[https://abcd.tdwg.org/|ABCD]]. | Each SIP is imported into the management systems and prepared for dissemination by transforming the original research data and corresponding metadata to meet domain specific requirements as well as requirements data exchange, such as standards like [[https://abcd.tdwg.org/|ABCD]]. |
| |
; Multimedia : The Digital Asset Management System [[https://fylr.io/|fylr]] allows for uploading, curating and publishing all sorts of multimedia data, e.g. images, sound files, and documents. Entries can be cross-linked to other entries in fylr and linked to corresponding data entries in DiversityCollection. | ; Multimedia : The Digital Asset Management System [[https://fylr.io/|fylr]] allows for uploading, curating and publishing all sorts of multimedia data, e.g. images, sound files, and documents. Entries can be cross-linked to other entries in fylr and linked to corresponding data entries in DiversityCollection. |
| ; Metabarcoding data : The online web-repository [[https://data.bolgermany.de/gbol1/metabarcoding|ASV-Registry]] is used to store, manage and blast asv- or otu-tables. Entries can be linked to corresponding data entries in DiversityCollection. |
; Metabarcoding data : The online web-repository [[https://data.bolgermany.de/gbol1/metabarcoding|asv-registry]] is used to store, manage and blast asv- or otu-tables. Entries can be linked to corresponding data entries in DiversityCollection. | |
| |
; Metadata : Metadata describing data and associated multimedia are either stored together with the data entries (unit level) or handled in different management modules of DiversityWorkbench, such as DiversityProjects or DiversityAgents. The latter provide information about a set of entries, i.e. the dataset, or metadata. | ; Metadata : Metadata describing data and associated multimedia are either stored together with the data entries (unit level) or handled in different management modules of DiversityWorkbench, such as DiversityProjects or DiversityAgents. The latter provide information about a set of entries, i.e. the dataset, or metadata. |
| |
**Sensible data**: Each of the specialized systems listed above allows to withhold or blur data for publication. This can be the complete entry or part of an entry, e.g. information about the exact sampling location of a specimen. All sensible data are handled according to our [[:datapolicy|Data Policy: Data provision for upload]]. For personal data the GDPR as described in the [[:privacypolicy|LIB Privacy Policy]] applies. | **Sensible data**: Each of the specialized systems listed above allows to withhold or blur data for publication. This can be the complete entry or part of an entry, e.g. information about the exact sampling location of a specimen. All sensible data are handled according to our [[:datapolicy|Data Policy: Data provision for upload]]. For personal data the GDPR as described in the [[:privacypolicy|LIB Privacy Policy]] applies. |
| |
| |
| |
| |
| |
== Provision of versioned Datasets == | **Provision of versioned Datasets** |
| |
Datasets containing occurrence data are published by creating a snapshot from the data and metadata in DiversityWorkbench for one dataset. This is done with the external helper tool, available from: [[ | Datasets containing occurrence data are published by creating a snapshot from the data and metadata in DiversityWorkbench for one dataset. This is done with the external helper tool, available from: [[https://gitlab.leibniz-lib.de/BioCASe/vcat-transfer|LIB GitLab: VCAT-Transfer]]. All data are mapped using the [[https://wiki.bgbm.org/bps|BioCASe Provider Software]] to the [[https://archive.bgbm.org/TDWG/CODATA/Schema/ABCD_2.1/ABCD_2.1.html|ABCD 2.1 Standard]]. A Dissemination Information Package (DIP according to OAIS) is created and stored as zip-archive in the digital asset management system [[https://media.leibniz-lib.de/biocase-archives|fylr at LIB]]. Each DIP is versioned and the version is identified by a date suffix and its version number consisting of a major version and a minor version (e.g. 2.1). Major changes, such as the addition of further data, increment the major version. Minor changes, e.g. correction of typing errors or changes in the metadata are reflected in an increment of the minor version. |
https://datacenter.LIB.de/gitlab/BioCASe/biocase_media/releases|LIB GitLab: VCAT-Transfer]]. All data are mapped using the [[https://wiki.bgbm.org/bps|BioCASe Provider Software]] to the [[https://archive.bgbm.org/TDWG/CODATA/Schema/ABCD_2.1/ABCD_2.1.html|ABCD 2.1 Standard]]. A Dissemination Information Package (DIP according to OAIS) is created and stored as zip-archive in the digital asset management system [[https://media.leibniz-lib.de/biocase-archives|fylr at LIB]]. Each DIP is versioned and the version is identified by a date suffix and its version number consisting of a major version and a minor version (e.g. 2.1). Major changes, such as the addition of further data, increment the major version. Minor changes, e.g. correction of typing errors or changes in the metadata are reflected in an increment of the minor version. | |
| |
Datasets stored and curated in [[https://media.leibniz-lib.de|fylr]] are published from within the software. | Datasets stored and curated in [[https://media.leibniz-lib.de|fylr]] are published from within the software. |
| |
ASV-tables managed in [[https://data.bolgermany.de/gbol1/metabarcoding|asv-registry]] are published from within the application. | ASV-tables managed in [[https://data.bolgermany.de/gbol1/metabarcoding|ASV-Registry]] are published from within the application. |
| |
| |
== DOI assignment == | **DOI assignment** |
| |
For each published major version of an occurrence dataset a DOI is assigned. Datasets in fylr receive a DOI on demand. Each asv table receives a DOI when it is published. | For each published major version of an occurrence dataset a DOI is assigned. Datasets in fylr receive a DOI on demand. Each asv-table receives a DOI when it is published. |
| |
The LIB is registered at [[https://www.zbmed.de/|ZB MED]] and can therefore create a DOI at [[https://doi.datacite.org/|DataCite DOI Fabrica]]. The DOI is added to the corresponding version of the information package and is also part of the citation of the data set (see below). | The LIB is registered at [[https://www.zbmed.de/|ZB MED]] and can therefore create a DOI at [[https://doi.datacite.org/|DataCite DOI Fabrica]]. The DOI is added to the corresponding version of the information package and is also part of the citation of the data set (see below). |
| |
| |
== Citation == | **Citation** |
| |
Published datasets are citable using direct URLs to the DIP or via the DOIs. Based on the data provider's input the citation of the dataset will be prepared by the LIB data curator adjusting the input (submission metadata) to be conform with the GFBio citation pattern. The citation is finalized in close collaboration with the data provider. For details see General part of [[https://gfbio.biowikifarm.net/wiki/Data_Publishing/General_part:_GFBio_publication_of_type_1_data_via_BioCASe_data_pipelines|GFBio publication of type 1 data via BioCASe data pipelines]] | Published datasets are citable using direct URLs to the DIP or via the DOIs. Based on the data provider's input the citation of the dataset will be prepared by the LIB data curator adjusting the input (submission metadata) to be conform with the GFBio citation pattern. The citation is finalized in close collaboration with the data provider. For details see General part of [[https://gfbio.biowikifarm.net/wiki/Data_Publishing/General_part:_GFBio_publication_of_type_1_data_via_BioCASe_data_pipelines|GFBio publication of type 1 data via BioCASe data pipelines]] |
| |
Example: ''ZFMK Coleoptera Working Group (2023). ZFMK Coleoptera Oberthuer collection. [Dataset]. Version: 2.0. Data Publisher: LIB Biodiversity Datacenter. https://doi.org/10.20363/ZFMK-Coll.Oberthuer-2023-02'' | Example: ''ZFMK Ichthyology Working Group. (2024). ZFMK Ichthyology collection (Version 5) [Data set]. LIB Biodiversity Datacenter. https://doi.org/10.20363/zfmk-coll.ichthyology-2024-06'' |
| |
| |
; LIB Intranet Filesystem : Backups stored in specific folders on the LIB intranet file system are transferred to tapes in the internal tape library on a regular basis. | ; LIB Intranet Filesystem : Backups stored in specific folders on the LIB intranet file system are transferred to tapes in the internal tape library on a regular basis. |
; fylr : Multimedia files and versioned ABCD packages are stored in fylr, which has its own backup in the LIB tape library. | ; fylr : Multimedia files and versioned ABCD packages are stored in fylr, which has its own backup in the LIB tape library. |
| ; ASV-Registry : The data in ASV-Registry is regularly backed up. This backup is available as a redundant copy separate from the running production system. |
; LIB Tape Library : The generated AIPs are archived in the LIB tape library. These tapes are stored with two identical copies at two different locations in the LIB. | ; LIB Tape Library : The generated AIPs are archived in the LIB tape library. These tapes are stored with two identical copies at two different locations in the LIB. |
; Morph·D·Base : The data in MDB is regularly backed up. This backup is available as a redundant copy separate from the running production system. The backup is copied to a file server located in the LIB IT department, whereas the running system is housed within the data center of the University of Bonn. | |
| |
For detailed information about backups and recovery see [[:digital_preservation_plan|Preservation Plan]]. | For detailed information about backups and recovery see [[:digital_preservation_plan|Preservation Plan]]. |
==== Access to data via different portals ==== | ==== Access to data via different portals ==== |
| |
Indexed and faceted data are available in public portals such as GBIF, Europeana and GFBio, which are operated by national or international consortia. Specialized web portals for access to the data are developed and provided by the LIB Data Center. These include the [[https://collections.leibniz-lib.de|LIB digital collection catalogue]], the portal of the [[https://bolgermany.de|German Barcode of Life project (GBOL)]], or interfaces to the data, which also provide APIs for machine readable formats and access to the data using CETAF stable identifiers ([[https://id.zfmk.de|id.zfmk.de]], or [[https://id.zmh-coll.de|id.zmh-coll.de]]). | Indexed and faceted data are available in public portals such as GBIF, Europeana and GFBio, which are operated by national or international consortia. Specialized web portals for access to the data are developed and provided by the LIB Data Center. These include the [[https://collections.leibniz-lib.de|LIB digital collection catalogue]], the portal of the [[https://bolgermany.de|German Barcode of Life project (GBOL)]], or interfaces to the data, which also provide APIs for machine readable formats and access to the data using CETAF stable identifiers ([[https://id.zfmk.de|id.zfmk.de]], [[https://id.zmh-coll.de|id.zmh-coll.de]]) or [[https://id.zfmk.de/collection_GFBIO/|id.zfmk.de/collection_GFBIO]] |
| |
The published data are provided with a recommended citation, license and DOI (see above). | The published data are provided with a recommended citation, license and DOI (see above). |
=== Access to published data (unit level) === | === Access to published data (unit level) === |
| |
; GFBio, VAT, and LAND : GFBio has developed a web portal that provides search functionalities for biodiversity related datasets and data. All uploaded data are annotated by GFBio's Terminology server, thus providing a richer search experience. A Visualization and Annotation Tool (VAT) allows for analysis and modelling of geo-referenced data. See General part of [[https://gfbio.biowikifarm.net/wiki/Data_Publishing/General_part:_GFBio_publication_of_type_1_data_via_BioCASe_data_pipelines|GFBio publication of type 1 data via BioCASe data pipelines]]. The "Lebendiger Atlas - Natur Deutschland (LAND)" provides an overview of Biodiversity data from Germany: [[https://land.gbif.de/|land.gbif.de]]. Data from Germany that is made available for GBIF can be found here. | ; GFBio, VAT, GBIF and LAND : GFBio has developed a web portal that provides search functionalities for biodiversity related datasets and data. All uploaded data are annotated by GFBio's Terminology server, thus providing a richer search experience. A Visualization and Annotation Tool (VAT) allows for analysis and modelling of geo-referenced data. See General part of [[https://gfbio.biowikifarm.net/wiki/Data_Publishing/General_part:_GFBio_publication_of_type_1_data_via_BioCASe_data_pipelines|GFBio publication of type 1 data via BioCASe data pipelines]]. GBIF is an international network and data infrastructure aimed at providing open access to data about all types of life on Earth [[https://www.gbif.org/|GBIF]]. The "Lebendiger Atlas - Natur Deutschland (LAND)" provides an overview of Biodiversity data from Germany: [[https://land.gbif.de/|land.gbif.de]]. Data from Germany that is made available for GBIF can be found here. |
| |
; Europeana : The multimedia data are accessible via [[https://www.europeana.eu/|Europeana]]. | ; Europeana : The multimedia data are accessible via [[https://www.europeana.eu/|Europeana]]. |
; Digital Collection Catalogue : All data based on physical vouchers within the natural history collections of LIB are accessible via the [[https://collections.leibniz-lib.de/|LIB Digital Collection Catalogue]] | ; Digital Collection Catalogue : All data based on physical vouchers within the natural history collections of LIB are accessible via the [[https://collections.leibniz-lib.de/|LIB Digital Collection Catalogue]] |
| |
; asv-registry : The online web-repository for metabarcoding data provides public access to original asv- or otu-tables and the results from blasting all data to various databases. All data are directly accessible in [[https://data.bolgermany.de/gbol1/metabarcoding|asv-registry]]. | ; ASV-Registry : The online web-repository for metabarcoding data provides public access to original asv/otu tables and the results from blasting all data to various databases. All data are directly accessible in [[https://data.bolgermany.de/gbol1/metabarcoding|ASV-Registry]]. |
| |
; fylr : the Digital Asset Management System at LIB provides access to the digital assets (i.e. multimedia, documents, zip archives) stored in fylr. They are published from within the software via [[https://media.leibniz-lib.de/|media.leibniz-lib.de]]. An API to fylr is avaliable under: https://media.LIB.de/eaurls/ | ; fylr : the Digital Asset Management System at LIB provides access to the digital assets (i.e. multimedia, documents, zip archives) stored in fylr. They are published from within the software via [[https://media.leibniz-lib.de/|media.leibniz-lib.de]]. An API to fylr is avaliable under: https://media.leibniz-lib.de/eaurls/ |
| |
; id.LIB.de : the API to all occurrence data are accessible by humans and machines in html, json, oder rdf format using [[https://id.zfmk.de/collection_zfmk/|id.zfmk.de/collection_zfmk/]], or [[https://id.zmh-coll.de|id.zmh-coll.de/collection_zmh]]. | ; id.leibniz-lib.de : the API to all occurrence data are accessible by humans and machines in html, json, oder rdf format using [[https://id.zfmk.de/collection_zfmk/|id.zfmk.de/collection_zfmk/]], or [[https://id.zmh-coll.de|id.zmh-coll.de/collection_zmh]]. |
| |
| |