An innovative approach to scalable semantic embedding

Koopman, RobWang, Shenghui2025-09-242025-09-242019Achlioptas, D. (2003). Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences 66(4), 671–687. Agirre, E., C. Banea, D. Cer, M. Diab, A. Gonzalez-Agirre, R. Mihalcea, G. Rigau, and J. Wieb (2016). Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the SemEval-2016. Bhatia, K., H. Jain, P. Kar, M. Varma, and P. Jain (2015). Sparse local embeddings for extreme multi- label classification. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), Advances in Neural Information Processing Systems 28, pp. 730–738. Curran Associates, Inc. Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. Castermans, T., K. Verbeek, B. Speckmann, M.A. Westenberg, R. Koopman, S. Wang, H. van den Berg, and A. Betti. SolarView: Low Distortion Radial Embeddings with a Focus. In IEEE Transactions on Visualization and Computer Graphics, 2018 Furnas, G. W., T. K. Landauer, L. M. Gomez, and S. T. Dumais (1983). Statistical semantics: Analysis of the potential performance of keyword information systems. Bell System Technical Journal 62(6), 17531806. Harris, Z. (1954). Distributional structure. Word 10(23), 146162. Johnson, W. and J. Lindenstrauss (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Math. 26, 189–206. Le, Q. V. and T. Mikolov (2014, 5). Distributed Representations of Sentences and Documents. Interna- tional Conference on Machine Learning - ICML 2014 32, 11881196. Liu, J., W.-C. Chang, Y. Wu, and Y. Yang (2017). Deep learning for extreme multi-label text classifica- tion. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, New York, NY, USA, pp. 115–124. ACM. Mikolov, T., I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, USA, pp. 3111–3119. Curran Associates Inc. Koopman, Rob, Shenghui Wang, Andrea Scharnhorst, and Gwenn Englebienne. 2015. “Ariadne’s Thread: Interactive Navigation in a World of Networked Information.” In Proceedings of the 33rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems. New York: ACM, 1833–38. Koopman, Rob, Shenghui Wang and Andrea Scharnhorst. 2017. “Contextualization of Topics: Browsing Through the Universe of Bibliographic Information.” Scientometrics 111(2):1119–39. Koopman, Rob, Shenghui Wang and Gwenn Englebienne. 2019. “Fast and Discriminative Semantic Embedding.” In: Proceedings of the 13th International Conference on Computational Semantics, Long Papers, edited by Simon Dobnik, Stergios Chatzikyriakidis and Vera Demberg. Gothenburg: Association for Computational Linguistics, 235–46. Peters, M. E., M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018). Deep contextualized word representations. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Prabhu, Y. and M. Varma (2014). Fastxml: A fast, accurate and stable tree-classifier for extreme multi- label learning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA, pp. 263–272. ACM. Sahlgren, M. (2008). The distributional hypothesis. Rivista di Linguistica 20(1), 3353. Weaver, W. (1955). Translation. In W. Locke and D. Booth (Eds.), Machine Translation of Languages, pp. 15–23. Cambridge, Massachusetts: MIT Press. Wang, S., Koopman, R.: Clustering articles based on semantic similarity. Gläser, A. Scharnhorst, W. Glänzel (eds.) Same data – different results? Towards a comparative approach to the identification of thematic structures in science, Special Issue of Scientometrics (2017) DOI 10.1007/s11192-017-2298-x http://rdcu.be/pDZH Wang, S., Koopman, R.: Semantic Embedding for Information Retrieval. In: Proc. of the 5th Workshop on Bibliometric-enhanced Information Retrieval (BIR 2017), CEUR-WS.org (2017) 122–132 Wang, S., R. Koopman and G. Englebienne. Non-parametirc Subject Prediction. In Proceedings of the International Conference on Theory and Practice of Digital Libraries (TPDL2019). To appear. Pauline van Wierst, Steven Hofstede, Yvette Oortwijn, Thom Castermans, Rob Koopman, Shenghui Wang, Michel A. Westenberg, and Arianna Betti. BolVis: Visualization for Text-based Research in Philosophy. In proceedings of 2018 Workshop on Visualization for the Digital Humanities (VIS4DH).https://repository.ifla.org/handle/20.500.14598/6710Embedding words, entities and documents in compact, semantically meaningful vector spaces allows for computable semantic similarity/relatedness which could make search more intelligent and benefit other tasks conducted in libraries, such as entity disambiguation, de-duplication, clustering, recommendation, subject prediction, etc. Deep learning models are powerful but require high computing power and careful tuning hyperparameters for optimal performance. In our quest for practical solutions to support libraries in this field, we revisit the global co-occurrence based embedding methods and propose a conceptually simple and computationally lightweight approach. Our experiments show highly competitive results with a few state-of-the-art embedding methods on different tasks, including the standard STS benchmark and a subject prediction task, at a fraction of the computational cost. We will show the potentials of this scalable semantic embedding method for other applications such as entity disambiguation, citation recommendation, clustering and collection exploration.engAttribution 4.0 Internationalhttps://creativecommons.org/licenses/by/4.0/An innovative approach to scalable semantic embeddingArticlehttps://2019.ifla.org/conference-programme/satellite-meetings/open accessSemantic EmbeddingRandom ProjectionSubject Prediction