From text to data inside bibliographic records. Entity recognition and entity linking of contributors and their roles from statements of responsibility

Zaragoza, Thomas; Nicolas, Yann; Le Provost, Aline

Please use this identifier to cite or link to this item: https://repository.ifla.org/handle/123456789/2075

Full metadata record

DC Field	Value	Language
dc.rights.license	CC BY 4.0	en_US
dc.contributor.author	Zaragoza, Thomas	-
dc.contributor.author	Nicolas, Yann	-
dc.contributor.author	Le Provost, Aline	-
dc.coverage.spatial	Location::France	en_US
dc.date.accessioned	2022-09-12T14:06:44Z	-
dc.date.available	2022-09-12	-
dc.date.available	2022-09-12T14:06:44Z	-
dc.date.issued	2022-09-12	-
dc.identifier.uri	https://2022.ifla.org/	-
dc.identifier.uri	https://repository.ifla.org/handle/123456789/2075	-
dc.description.abstract	Sudoc is the french higher education union catalogue. It is run by Abes. As any large database (15 million records), Sudoc has some quality issues that can negatively impact the user experience or the database maintenance efforts, e.g. the process towards a LRM compliant catalogue. Quality issues are diverse: data can be inaccurate, ambiguous, miscategorized, redundant, inconsistent or missing. Sometimes, they are not really missing, they are hidden, lost in some text inside the bibliographic record itself. For instance, contributor names and roles are transcribed from the document to MARC descriptive fields (statement of responsibility). Most of them have a corresponding access point that contains the normalized name and a relator code (to express the role) - optionally the identifier of an authority record. But in Sudoc, many records have contributor mentions in descriptive fields that are not identified in access points. Moreover, many access points lack a relator code. This paper will describe our efforts to extract structured information about contributors and their role from the statements of responsibility to automatically generate the following data in access points: last name, first name, relator code and optionally identifier to link to www.idref.fr, the french higher education authority file. The first step is a named entity recognition task implemented through a machine learning (ML) approach. For the recognition of names, a pre-existing generic model (from Spacy library) is employed and retrained with ad hoc data, annotated by librarians through a dedicated annotation tool (Prodigy). For roles, a model is generated from scratch. The second step is an entity linking task. The linking of contributor names is achieved with Qualinka, a logical rule based artificial intelligence framework (LE PROVOST, 2017 IFLA conference).The linking of roles is currently still being debated with a preference for either an entity linking model or a classification model over a rule based approach. This pipeline is for Abes a first experience in adopting machine learning and building a generic approach with the librarian in the loop.	en_US
dc.language.iso	en	en_US
dc.publisher	International Federation of Library Associations and Institutions (IFLA)	en_US
dc.relation.ispartofseries	87th IFLA World Library and Information Congress (WLIC);Satellite Meeting: Information Technology: New Horizons in Artificial Intelligence in Libraries	-
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en_US
dc.subject	Subject::Cataloguing	en_US
dc.subject	Subject::Access	en_US
dc.subject	Subject::Classification and indexing	en_US
dc.subject	Subject::Artificial intelligence	en_US
dc.title	From text to data inside bibliographic records. Entity recognition and entity linking of contributors and their roles from statements of responsibility	en_US
dc.type	Articles	en_US
dc.type	Events Materials	en_US
dc.rights.holder	Thomas Zaragoza, Yann Nicolas and Aline Le Provost	en_US
dc.audience	Audience::Information Technology Section	en_US
ifla.oPubId	0	en_US
ifla.Unit	Units::Section::Information Technology Section	en_US
Appears in Collections:	World Library and Information Congress (WLIC) Materials

Files in This Item:

File	Size	Format
s08-2022-zaragoza-en.pdf	709.83 kB	Adobe PDF	View/Open

Show simple item record

This item is licensed under a Creative Commons License