From text to data inside bibliographic records. Entity recognition and entity linking of contributors and their roles from statements of responsibility

Loading...
Thumbnail Image

Date

2022-09-12

Journal Title

Journal ISSN

Volume Title

Publisher

International Federation of Library Associations and Institutions (IFLA)

Abstract

Sudoc is the french higher education union catalogue. It is run by Abes. As any large database (15 million records), Sudoc has some quality issues that can negatively impact the user experience or the database maintenance efforts, e.g. the process towards a LRM compliant catalogue. Quality issues are diverse: data can be inaccurate, ambiguous, miscategorized, redundant, inconsistent or missing. Sometimes, they are not really missing, they are hidden, lost in some text inside the bibliographic record itself. For instance, contributor names and roles are transcribed from the document to MARC descriptive fields (statement of responsibility). Most of them have a corresponding access point that contains the normalized name and a relator code (to express the role) - optionally the identifier of an authority record. But in Sudoc, many records have contributor mentions in descriptive fields that are not identified in access points. Moreover, many access points lack a relator code. This paper will describe our efforts to extract structured information about contributors and their role from the statements of responsibility to automatically generate the following data in access points: last name, first name, relator code and optionally identifier to link to www.idref.fr, the french higher education authority file. The first step is a named entity recognition task implemented through a machine learning (ML) approach. For the recognition of names, a pre-existing generic model (from Spacy library) is employed and retrained with ad hoc data, annotated by librarians through a dedicated annotation tool (Prodigy). For roles, a model is generated from scratch. The second step is an entity linking task. The linking of contributor names is achieved with Qualinka, a logical rule based artificial intelligence framework (LE PROVOST, 2017 IFLA conference).The linking of roles is currently still being debated with a preference for either an entity linking model or a classification model over a rule based approach. This pipeline is for Abes a first experience in adopting machine learning and building a generic approach with the librarian in the loop.

Description

Keywords

Subject::Cataloguing, Subject::Access, Subject::Classification and indexing, Subject::Artificial intelligence

Citation