Please use this identifier to cite or link to this item:
Title: From text to data inside bibliographic records. Entity recognition and entity linking of contributors and their roles from statements of responsibility
Authors: Zaragoza, Thomas
Nicolas, Yann
Le Provost, Aline
Keywords: Subject::Cataloguing
Subject::Classification and indexing
Subject::Artificial intelligence
Issue Date: 12-Sep-2022
Publisher: International Federation of Library Associations and Institutions (IFLA)
Series/Report no.: 87th IFLA World Library and Information Congress (WLIC);Satellite Meeting: Information Technology: New Horizons in Artificial Intelligence in Libraries
Abstract: Sudoc is the french higher education union catalogue. It is run by Abes. As any large database (15 million records), Sudoc has some quality issues that can negatively impact the user experience or the database maintenance efforts, e.g. the process towards a LRM compliant catalogue. Quality issues are diverse: data can be inaccurate, ambiguous, miscategorized, redundant, inconsistent or missing. Sometimes, they are not really missing, they are hidden, lost in some text inside the bibliographic record itself. For instance, contributor names and roles are transcribed from the document to MARC descriptive fields (statement of responsibility). Most of them have a corresponding access point that contains the normalized name and a relator code (to express the role) - optionally the identifier of an authority record. But in Sudoc, many records have contributor mentions in descriptive fields that are not identified in access points. Moreover, many access points lack a relator code. This paper will describe our efforts to extract structured information about contributors and their role from the statements of responsibility to automatically generate the following data in access points: last name, first name, relator code and optionally identifier to link to, the french higher education authority file. The first step is a named entity recognition task implemented through a machine learning (ML) approach. For the recognition of names, a pre-existing generic model (from Spacy library) is employed and retrained with ad hoc data, annotated by librarians through a dedicated annotation tool (Prodigy). For roles, a model is generated from scratch. The second step is an entity linking task. The linking of contributor names is achieved with Qualinka, a logical rule based artificial intelligence framework (LE PROVOST, 2017 IFLA conference).The linking of roles is currently still being debated with a preference for either an entity linking model or a classification model over a rule based approach. This pipeline is for Abes a first experience in adopting machine learning and building a generic approach with the librarian in the loop.
Appears in Collections:World Library and Information Congress (WLIC) Materials

Files in This Item:
File SizeFormat 
s08-2022-zaragoza-en.pdf709.83 kBAdobe PDFView/Open

This item is licensed under a Creative Commons License Creative Commons