First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges

Kragelj, Matjaž; Kovačič, Mitja

First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges

dc.audience	Audience::Preservation and Conservation Section
dc.conference.sessionType	Preservation and Conservation with Information Technology
dc.conference.venue	Cape Town International Convention Centre
dc.congressWLIC	IFLA WLIC 2015 - Cape Town, South Africa
dc.contributor.author	Kragelj, Matjaž
dc.contributor.author	Kovačič, Mitja
dc.date.accessioned	2025-09-24T08:22:29Z
dc.date.available	2025-09-24T08:22:29Z
dc.date.issued	2015
dc.description.abstract	The National and University Library (NUK) has been archiving the web for almost fifteen years. During the last six years, we have been trying to act on different levels of harvesting. For most of the time, we have dealt with harvesting of selected web sites that might be significant for future generations. The harvesting process runs smoothly, with the exception of some technical difficulties resulting from the use of scripted languages (for instance Ajax, Flash, Java script, asynchronous transmissions, real time streaming protocols, etc.). The number of archived web pages keeps growing very fast. We are also very successful in harvesting social media web sites with tools developed in NUK. Being aware that the amount of the web pages cannot be compared with the harvested one - it is much more extensive – we decided to start the Slovenian domain (.si) harvesting. The first domain harvesting was successful; however, we realized that much deeper and broader levels should be harvested by using heuristic methods. Our experiences showed that most informative web contents are hidden beneath the .si domain's data provided by ARNES (Academic Research Network of Slovenia), therefore, the contents are not accessible. The paper presents the results of the first harvesting iteration of the Slovenian web. Further, on a sample of the first hundred domains, the results of the first and second harvesting iteration will be compared and analysed. At the end, the relevance of data acquired in the harvested web pages as a digital library complementary data source will be presented.	en
dc.identifier.citation	-
dc.identifier.relatedurl	http://conference.ifla.org/ifla81
dc.identifier.uri	https://repository.ifla.org/handle/20.500.14598/5548
dc.language.iso	eng
dc.rights	Attribution 3.0 Unported
dc.rights.accessRights	open access
dc.rights.uri	https://creativecommons.org/licenses/by/3.0/
dc.subject.keyword	Web archiving
dc.subject.keyword	harvesting
dc.subject.keyword	national domain
dc.subject.keyword	social networks harvesting
dc.subject.keyword	digital library
dc.title	First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges	en
dc.type	Article
ifla.Unit	Section:Preservation and Conservation Section
ifla.oPubId	https://library.ifla.org/id/eprint/1191/

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 090-kragelj-en.pdf
Size:: 975.23 KB
Format:: Adobe Portable Document Format

Download

Collections

World Library and Information Congress (WLIC) Papers and Presentations