FloraNER: A new dataset for species and morphological terms named entity recognition in French botanical text - Unité de modélisation mathématique et informatique des systèmes complexes
Article Dans Une Revue Data in Brief Année : 2024

FloraNER: A new dataset for species and morphological terms named entity recognition in French botanical text

Résumé

FloraNER is a distantly supervised named entity recognition dataset (NER). The dataset is built from botanical French literature extracted from the OCR-preprocessed flora of New Caledonia, provided by the National Museum of Natural History in France (MNHN), and distantly annotated with a botanical French corpus created by merging botanical lexicons available online. FloraNER comprises separate sub-datasets for the recognition of plant species names, as well as coarse-grained and fine-grained botanical morphological terms. The resulting datasets are in CSV format, displaying textual data, identified named entities, and their annotations, covering one named entity type “Species” (Espèce in French) for species name identification, two named entity types “Organ” and “Descriptor” for coarse-grained morphological term identification, and eight named entity types for fine-grained morphological term identification: Organ, Descriptor, Form, Color, Development, Structure, Surface, Position, Disposition, and Measure. This dataset can be utilized to train and evaluate named entity recognition models for extracting information from botanical French literature.
Fichier principal
Vignette du fichier
1-s2.0-S2352340924007881-main.pdf (1.67 Mo) Télécharger le fichier
Origine Fichiers éditeurs autorisés sur une archive ouverte
licence

Dates et versions

hal-04692857 , version 1 (10-09-2024)

Licence

Identifiants

Citer

Ayoub Nainia, Régine Vignes-Lebbe, Eric Chenin, Maya Sahraoui, Hajar Mousannif, et al.. FloraNER: A new dataset for species and morphological terms named entity recognition in French botanical text. Data in Brief, 2024, 56, pp.110824. ⟨10.1016/j.dib.2024.110824⟩. ⟨hal-04692857⟩
0 Consultations
0 Téléchargements

Altmetric

Partager

More