Please use this identifier to cite or link to this item: https://repository.ifla.org/handle/123456789/2075
Full metadata record
DC FieldValueLanguage
dc.rights.licenseCC BY 4.0en_US
dc.contributor.authorZaragoza, Thomas-
dc.contributor.authorNicolas, Yann-
dc.contributor.authorLe Provost, Aline-
dc.coverage.spatialLocation::Franceen_US
dc.date.accessioned2022-09-12T14:06:44Z-
dc.date.available2022-09-12-
dc.date.available2022-09-12T14:06:44Z-
dc.date.issued2022-09-12-
dc.identifier.urihttps://2022.ifla.org/-
dc.identifier.urihttps://repository.ifla.org/handle/123456789/2075-
dc.description.abstractSudoc is the french higher education union catalogue. It is run by Abes. As any large database (15 million records), Sudoc has some quality issues that can negatively impact the user experience or the database maintenance efforts, e.g. the process towards a LRM compliant catalogue. Quality issues are diverse: data can be inaccurate, ambiguous, miscategorized, redundant, inconsistent or missing. Sometimes, they are not really missing, they are hidden, lost in some text inside the bibliographic record itself. For instance, contributor names and roles are transcribed from the document to MARC descriptive fields (statement of responsibility). Most of them have a corresponding access point that contains the normalized name and a relator code (to express the role) - optionally the identifier of an authority record. But in Sudoc, many records have contributor mentions in descriptive fields that are not identified in access points. Moreover, many access points lack a relator code. This paper will describe our efforts to extract structured information about contributors and their role from the statements of responsibility to automatically generate the following data in access points: last name, first name, relator code and optionally identifier to link to www.idref.fr, the french higher education authority file. The first step is a named entity recognition task implemented through a machine learning (ML) approach. For the recognition of names, a pre-existing generic model (from Spacy library) is employed and retrained with ad hoc data, annotated by librarians through a dedicated annotation tool (Prodigy). For roles, a model is generated from scratch. The second step is an entity linking task. The linking of contributor names is achieved with Qualinka, a logical rule based artificial intelligence framework (LE PROVOST, 2017 IFLA conference).The linking of roles is currently still being debated with a preference for either an entity linking model or a classification model over a rule based approach. This pipeline is for Abes a first experience in adopting machine learning and building a generic approach with the librarian in the loop.en_US
dc.language.isoenen_US
dc.publisherInternational Federation of Library Associations and Institutions (IFLA)en_US
dc.relation.ispartofseries87th IFLA World Library and Information Congress (WLIC);Satellite Meeting: Information Technology: New Horizons in Artificial Intelligence in Libraries-
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en_US
dc.subjectSubject::Cataloguingen_US
dc.subjectSubject::Accessen_US
dc.subjectSubject::Classification and indexingen_US
dc.subjectSubject::Artificial intelligenceen_US
dc.titleFrom text to data inside bibliographic records. Entity recognition and entity linking of contributors and their roles from statements of responsibilityen_US
dc.typeArticlesen_US
dc.typeEvents Materialsen_US
dc.rights.holderThomas Zaragoza, Yann Nicolas and Aline Le Provosten_US
dc.audienceAudience::Information Technology Sectionen_US
ifla.oPubId0en_US
ifla.UnitUnits::Section::Information Technology Sectionen_US
Appears in Collections:World Library and Information Congress (WLIC) Materials

Files in This Item:
File SizeFormat 
s08-2022-zaragoza-en.pdf709.83 kBAdobe PDFView/Open


This item is licensed under a Creative Commons License Creative Commons