IRLS675 Attt's blog: Unit 7 Problem searching digital collections

Each of the digital repository platforms we have read about and or installed recently, including Omeka, DSpace, and Drupal, are among other things, designed to make digital collections more accessible and or more sustainable. I'm a little concerned that the proliferation of digital repository software might be creating a profusion of digital information silos, that subverts interoperability and introduces a new layer of complexity for scholarly communications. Put another way are all of these repositories sufficiently standardized to enable interoperability and effective search and retrieval across collections? I am prompted to write this because of my experience trying to search for a digital repository similar to my Saratoga Springs, NY local history collection that is also using DSpace. It sounded like a simple assignment, but I found it really isn't. The DSpace.org website while it has a linked listing of all institutions using DSpace doesn't have a search engine that searches and retrieves the deep content from those same DSpace collections. Nor apparently is there a comprehensive DSpace repository directory similar to those constructed by Google or Yahoo that permits finding content by browsing and drilling down from Broader Term to Narrower Term or Related Terms. Even if there were such a directory it would for the most part only facilitate search at the collection level rather than at the object level.

The “Digital Library” article from Wikipedia makes this point when it reveals that "Most digital libraries provide a search interface which allows resources to be found. These resources are typically deep web (or invisible web) resources since they frequently cannot be located by search engine crawlers. Some digital libraries create special pages or sitemaps to allow search engines to find all their resources.”

The same point is made in our reading this week "Digital Repositories" by Nancy John, she writes: "One thing is certain, despite the protocols and technologies, original unscientific, non-standard Web-based exhibitions published by the world’s libraries remain far more accessible to the average user than the content-rich digital repositories they are creating. This is because they are part of the indexed, spidered, crawled open Web that is accessed via search engines. Digital repositories, for the time being, belong to the vast hidden Web,found through published citations, links and personal recommendations."

Bottom line: generally at the individual digital object level, items or nodes are largely invisible to popular search engines. True the OAI-PMH is supposed to gather sufficient metadata to allow popular search engines to provide some degree of federated searching across numerous digital repositories, but that depends entirely on the richness and depth of the metadata created for each object. At best OAI-PMH only gathers the metadata associated with each digital item, making it searchable; directly searching the full text of each individual item in the digital collection is beyond the capabilites of popular search engines. Since we are far away from anything like the semantic web, images, audio and video files remain totally dependent on metadata for discovery.

It seems to me that repositories using DSpace and other digital repository software are most effectively searched one at a time at the web site level, sometimes using the browse function or the sitemap feature. This approach is not very scalable and can hardly be said to be the best solution for improving scholarly communications. Some libraries expose their IR content to the campus through the OPAC but usually that solution is limited to that campus IR and it searches only the metadata not the full text.

This problem may grow in complexity and become even more intractable as the number of digital collections and digital objects in each digital collection continues to increase. Another problem with using digital repositories as tools of interdisciplinary communiction is that the content in many digital repositories is open only to authenticated users. We recently explored permissions and authentication at the OS and repository administration level and there are of course valid reasons for this security, not the least of which is ensuring that digital content remains authentic and unaltered. But securing content by employing user id's and passwords does prevent open access to those digital collections. The use of Captchas technology by digital collections curators is another instance of placing speed bumps in the way of automatic or robotic harvesting of metadata and content by popular search engines, which are the tool of choice for content discovery by many students and even scholars.

For now, one way to search the many thousands of Drupal, DSpace and Omeka digital collections is to first search what is already thoroughly and professionally indexed, namely scholarly peer reviewed published journals and studies found in traditional proprietary databases and then to locate by perserverance and serendipity among the embedded persistent URI footnotes and references in each relevant article those links to individual digital objects that are otherwise secured behind the high walls of specific digital collections. Of course this approach turns the digital collections movement on its head, as the digital repository is in part a response to the unacceptably high prices charged by publishers for access to journals. Using references embedded in the high priced journal article to find the "free" digital object in the IR is paradoxical at best.

IRLS675 Attt's blog

Sunday, October 10, 2010

Unit 7 Problem searching digital collections

No comments:

Post a Comment

Followers

Blog Archive