Between two worlds: harmonizing automated and manual term labelling
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In the era of enormous information production human capabilities have reached their limits. The need for automatic information processing which would not be incommensurate to human sophistication seems to be more than imperative. Information scientists have focused on the development of techniques and processes that would assist human contribution while improve, or at least guarantee, information quality. Automatic indexing techniques may lay on various approaches offering different results in information retrieval. In this paper, we introduce an automated methodology for subject analysis, including both the determination of the aboutness of the documents and the translation of the related concepts to system terms. Focusing on a corpus consisting of articles related to the Digital Library Evaluation domain, topic modeling algorithms are utilized for the aboutness of the documents, while the context of the words in topics, as captured by Word Embeddings, are used for the translation of the extracted topics to EuroVoc concepts.
Description
Keywords
Citation
Afiontzi, E., Kazadeis, G., Papachristopoulos, L., Sfakakis, M., Tsakonas, G., & Papatheodorou, C. (2013). Charting the Digital Library Evaluation Domain with a Semantically Enhanced Mining Methodology. Proceedings of the 13th ACMIEEECS Joint Conference on Digital Libraries, 125–134. Retrieved from https://doi.org/10.1145/2467696.2467713
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.
Brown, K., & Barrière, C. (2006). Indexing, Automatic. Encyclopedia of Language & Linguistics, 603–610. https://doi.org/10.1016/B0-08-044854-2/00963-9
Chu, C. M., & Ajiferuke, I. (1989). Quality of indexing in library and information science databases. Online Review, 13(1), 11–35.
Dunham, G. S., Pacak, M. G., & Pratt, A. W. (1978). Automatic indexing of pathology data. Journal of the American Society for Information Science, 29(2), 81–90. https://doi.org/10.1002/asi.4630290207
Fox, C. (1989). A stop list for general text. ACM SIGIR Forum, 24(1–2), 19–21. https://doi.org/10.1145/378881.378888
Fuhr, N., Tsakonas, G., Aalberg, T., Agosti, M., Hansen, P., Kapidakis, S., … Solvberg, I. (2007). Evaluation of Digital Libraries. Int. J. Digit. Libr., 8(1), 21–38. https://doi.org/10.1007/s00799-007-0011-z
Hjørland, B. (2001). Towards a theory of aboutness, subject, topicality, theme, domain, field, content and relevance. Journal of the American Society for Information Science and Technology, 52(9), 774–778. https://doi.org/10.1002/asi.1131
Lau, J. H., Grieser, K., Newman, D., & Baldwin, T. (2011). Automatic labelling of topic models. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, 1536–1545.
Lau, J. H., Newman, D., Karimi, S., & Baldwin, T. (2010). Best topic word selection for topic labelling. 605–613. Retrieved from http://dl.acm.org/citation.cfm?id=1944566.1944635
Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X., & Chen, E. (2015). Word embedding revisited: A new representation learning and explicit matrix factorization perspective. IJCAI International Joint Conference on Artificial Intelligence, 2015–Janua(Ijcai), 3650–3656.
Magatti, D., Calegari, S., Ciucci, D., & Stella, F. (2009). Automatic labeling of topics. 2009 Ninth International Conference on Intelligent Systems Design and Applications, 1227–1232.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space.
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2018). Advances in Pre-Training Distributed Word Representations. Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 3111–3119.
Mimno, D. (2018). jsLDA: An implementation of latent Dirichlet allocation in javascript. Retrieved February 10, 2015, from https://github.com/mimno/jsLDA
Névéol, A., Shooshan, S. E., Humphrey, S. M., Mork, J. G., & Aronson, A. R. (2009). A recent advance in the automatic indexing of the biomedical literature. Journal of Biomedical Informatics, 42(5), 814–823. https://doi.org/10.1016/J.JBI.2008.12.007
Papachristopoulos, L., Kleidis, N., Sfakakis, M., Tsakonas, G., & Papatheodorou, C. (2015). Discovering the Topical Evolution of the Digital Library Evaluation Community. In E. Garoufallou, R. Hartley, & P. Gaitanou (Eds.), Metadata and Semantics Research SE - 9 (pp. 101–112). https://doi.org/10.1007/978-3-319-24129-6_9
Papachristopoulos, L., Tsakonas, G., Sfakakis, M., Kleidis, N., & Papatheodorou, C. (2016). The “Nomenclature of Multidimensionality” in the Digital Libraries Evaluation Domain. https://doi.org/10.1007/978-3-319-43997-6_19
Publications Office of the European Union. (2015). EuroVoc thesaurus Volume 1 Alphabetical version Part B. Retrieved from http://europa.eu
Pulgarı́n, A., & Gil-Leiva, I. (2004). Bibliometric analysis of the automatic indexing literature: 1956–2000. Information Processing & Management, 40(2), 365–377. https://doi.org/10.1016/S0306-4573(02)00101-2
Thellefsen, T. L., Brier, S., & Thellefsen, M. L. (2003). Problems concerning the process of subject analysis and the practice of indexing. Semiotica, 2003(144), 177–218. https://doi.org/10.1515/semi.2003.022