Between two worlds: harmonizing automated and manual term labelling

Sfakakis, Michalis; Zoutsou, Kyriaki; Papachristopoulos, Leonidas; Tsakonas, Giannis; Papatheodorou, Christos

Between two worlds: harmonizing automated and manual term labelling

Files

s02-2019-sfakakis-en.pdf (255.2 KB)

Date

2019

Authors

Sfakakis, Michalis

Zoutsou, Kyriaki

Papachristopoulos, Leonidas

Tsakonas, Giannis

Papatheodorou, Christos

Abstract

In the era of enormous information production human capabilities have reached their limits. The need for automatic information processing which would not be incommensurate to human sophistication seems to be more than imperative. Information scientists have focused on the development of techniques and processes that would assist human contribution while improve, or at least guarantee, information quality. Automatic indexing techniques may lay on various approaches offering different results in information retrieval. In this paper, we introduce an automated methodology for subject analysis, including both the determination of the aboutness of the documents and the translation of the related concepts to system terms. Focusing on a corpus consisting of articles related to the Digital Library Evaluation domain, topic modeling algorithms are utilized for the aboutness of the documents, while the context of the words in topics, as captured by Word Embeddings, are used for the translation of the extracted topics to EuroVoc concepts.

Keywords

Citation

Afiontzi, E., Kazadeis, G., Papachristopoulos, L., Sfakakis, M., Tsakonas, G., & Papatheodorou, C. (2013). Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology. Proceedings of the 13th ACMIEEECS Joint Conference on Digital Libraries, 125–134. Retrieved from https://doi.org/10.1145/2467696.2467713 Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022. Brown, K., & Barrière, C. (2006). Indexing, Automatic. Encyclopedia of Language & Linguistics, 603–610. https://doi.org/10.1016/B0-08-044854-2/00963-9 Chu, C. M., & Ajiferuke, I. (1989). Quality of indexing in library and information science databases. Online Review, 13(1), 11–35. Dunham, G. S., Pacak, M. G., & Pratt, A. W. (1978). Automatic indexing of pathology data. Journal of the American Society for Information Science, 29(2), 81–90. https://doi.org/10.1002/asi.4630290207 Fox, C. (1989). A stop list for general text. ACM SIGIR Forum, 24(1–2), 19–21. https://doi.org/10.1145/378881.378888 Fuhr, N., Tsakonas, G., Aalberg, T., Agosti, M., Hansen, P., Kapidakis, S., … Solvberg, I. (2007). Evaluation of Digital Libraries. Int. J. Digit. Libr., 8(1), 21–38. https://doi.org/10.1007/s00799-007-0011-z Hjørland, B. (2001). Towards a theory of aboutness, subject, topicality, theme, domain, field, content and relevance. Journal of the American Society for Information Science and Technology, 52(9), 774–778. https://doi.org/10.1002/asi.1131 Lau, J. H., Grieser, K., Newman, D., & Baldwin, T. (2011). Automatic labelling of topic models. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, 1536–1545. Lau, J. H., Newman, D., Karimi, S., & Baldwin, T. (2010). Best topic word selection for topic labelling. 605–613. Retrieved from http://dl.acm.org/citation.cfm?id=1944566.1944635 Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X., & Chen, E. (2015). Word embedding revisited: A new representation learning and explicit matrix factorization perspective. IJCAI International Joint Conference on Artificial Intelligence, 2015–Janua(Ijcai), 3650–3656. Magatti, D., Calegari, S., Ciucci, D., & Stella, F. (2009). Automatic labeling of topics. 2009 Ninth International Conference on Intelligent Systems Design and Applications, 1227–1232. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2018). Advances in Pre-Training Distributed Word Representations. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018). Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 3111–3119. Mimno, D. (2018). jsLDA: An implementation of latent Dirichlet allocation in javascript. Retrieved February 10, 2015, from https://github.com/mimno/jsLDA Névéol, A., Shooshan, S. E., Humphrey, S. M., Mork, J. G., & Aronson, A. R. (2009). A recent advance in the automatic indexing of the biomedical literature. Journal of Biomedical Informatics, 42(5), 814–823. https://doi.org/10.1016/J.JBI.2008.12.007 Papachristopoulos, L., Kleidis, N., Sfakakis, M., Tsakonas, G., & Papatheodorou, C. (2015). Discovering the Topical Evolution of the Digital Library Evaluation Community. In E. Garoufallou, R. Hartley, & P. Gaitanou (Eds.), Metadata and Semantics Research SE - 9 (pp. 101–112). https://doi.org/10.1007/978-3-319-24129-6_9 Papachristopoulos, L., Tsakonas, G., Sfakakis, M., Kleidis, N., & Papatheodorou, C. (2016). The “Nomenclature of Multidimensionality” in the Digital Libraries Evaluation Domain. https://doi.org/10.1007/978-3-319-43997-6_19 Publications Office of the European Union. (2015). EuroVoc thesaurus Volume 1 Alphabetical version Part B. Retrieved from http://europa.eu Pulgarı́n, A., & Gil-Leiva, I. (2004). Bibliometric analysis of the automatic indexing literature: 1956–2000. Information Processing & Management, 40(2), 365–377. https://doi.org/10.1016/S0306-4573(02)00101-2 Thellefsen, T. L., Brier, S., & Thellefsen, M. L. (2003). Problems concerning the process of subject analysis and the practice of indexing. Semiotica, 2003(144), 177–218. https://doi.org/10.1515/semi.2003.022

URI

https://repository.ifla.org/handle/20.500.14598/6718

Collections

World Library and Information Congress (WLIC) Papers and Presentations

Full item page

Between two worlds: harmonizing automated and manual term labelling

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections