From text to data inside bibliographic records. Entity recognition and entity linking of contributors and their roles from statements of responsibility

Zaragoza, Thomas; Nicolas, Yann; Le Provost, Aline

Please use this identifier to cite or link to this item: https://repository.ifla.org/handle/123456789/2075

Title:	From text to data inside bibliographic records. Entity recognition and entity linking of contributors and their roles from statements of responsibility
Authors:	Zaragoza, Thomas Nicolas, Yann Le Provost, Aline
Keywords:	Subject::Cataloguing Subject::Access Subject::Classification and indexing Subject::Artificial intelligence
Issue Date:	12-Sep-2022
Publisher:	International Federation of Library Associations and Institutions (IFLA)
Series/Report no.:	87th IFLA World Library and Information Congress (WLIC);Satellite Meeting: Information Technology: New Horizons in Artificial Intelligence in Libraries
Abstract:	Sudoc is the french higher education union catalogue. It is run by Abes. As any large database (15 million records), Sudoc has some quality issues that can negatively impact the user experience or the database maintenance efforts, e.g. the process towards a LRM compliant catalogue. Quality issues are diverse: data can be inaccurate, ambiguous, miscategorized, redundant, inconsistent or missing. Sometimes, they are not really missing, they are hidden, lost in some text inside the bibliographic record itself. For instance, contributor names and roles are transcribed from the document to MARC descriptive fields (statement of responsibility). Most of them have a corresponding access point that contains the normalized name and a relator code (to express the role) - optionally the identifier of an authority record. But in Sudoc, many records have contributor mentions in descriptive fields that are not identified in access points. Moreover, many access points lack a relator code. This paper will describe our efforts to extract structured information about contributors and their role from the statements of responsibility to automatically generate the following data in access points: last name, first name, relator code and optionally identifier to link to www.idref.fr, the french higher education authority file. The first step is a named entity recognition task implemented through a machine learning (ML) approach. For the recognition of names, a pre-existing generic model (from Spacy library) is employed and retrained with ad hoc data, annotated by librarians through a dedicated annotation tool (Prodigy). For roles, a model is generated from scratch. The second step is an entity linking task. The linking of contributor names is achieved with Qualinka, a logical rule based artificial intelligence framework (LE PROVOST, 2017 IFLA conference).The linking of roles is currently still being debated with a preference for either an entity linking model or a classification model over a rule based approach. This pipeline is for Abes a first experience in adopting machine learning and building a generic approach with the librarian in the loop.
URI:	https://2022.ifla.org/ https://repository.ifla.org/handle/123456789/2075
Appears in Collections:	World Library and Information Congress (WLIC) Materials

Files in This Item:

File	Size	Format
s08-2022-zaragoza-en.pdf	709.83 kB	Adobe PDF	View/Open

Show full item record

This item is licensed under a Creative Commons License