First crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challenges

dc.audienceAudience::Preservation and Conservation Section
dc.conference.sessionTypePreservation and Conservation with Information Technology
dc.conference.venueCape Town International Convention Centre
dc.congressWLICIFLA WLIC 2015 - Cape Town, South Africa
dc.contributor.authorKragelj, Matjaž
dc.contributor.authorKovačič, Mitja
dc.date.accessioned2025-09-24T08:22:29Z
dc.date.available2025-09-24T08:22:29Z
dc.date.issued2015
dc.description.abstractThe National and University Library (NUK) has been archiving the web for almost fifteen years. During the last six years, we have been trying to act on different levels of harvesting. For most of the time, we have dealt with harvesting of selected web sites that might be significant for future generations. The harvesting process runs smoothly, with the exception of some technical difficulties resulting from the use of scripted languages (for instance Ajax, Flash, Java script, asynchronous transmissions, real time streaming protocols, etc.). The number of archived web pages keeps growing very fast. We are also very successful in harvesting social media web sites with tools developed in NUK. Being aware that the amount of the web pages cannot be compared with the harvested one - it is much more extensive – we decided to start the Slovenian domain (*.si) harvesting. The first domain harvesting was successful; however, we realized that much deeper and broader levels should be harvested by using heuristic methods. Our experiences showed that most informative web contents are hidden beneath the *.si domain's data provided by ARNES (Academic Research Network of Slovenia), therefore, the contents are not accessible. The paper presents the results of the first harvesting iteration of the Slovenian web. Further, on a sample of the first hundred domains, the results of the first and second harvesting iteration will be compared and analysed. At the end, the relevance of data acquired in the harvested web pages as a digital library complementary data source will be presented.en
dc.identifier.citation-
dc.identifier.relatedurlhttp://conference.ifla.org/ifla81
dc.identifier.urihttps://repository.ifla.org/handle/20.500.14598/5548
dc.language.isoeng
dc.rightsAttribution 3.0 Unported
dc.rights.accessRightsopen access
dc.rights.urihttps://creativecommons.org/licenses/by/3.0/
dc.subject.keywordWeb archiving
dc.subject.keywordharvesting
dc.subject.keywordnational domain
dc.subject.keywordsocial networks harvesting
dc.subject.keyworddigital library
dc.titleFirst crawling of the Slovenian National web domain *.si: pitfalls, obstacles and challengesen
dc.typeArticle
ifla.UnitSection:Preservation and Conservation Section
ifla.oPubIdhttps://library.ifla.org/id/eprint/1191/

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
090-kragelj-en.pdf
Size:
975.23 KB
Format:
Adobe Portable Document Format