Mashing Up the Library – the Library Catalogue in Google Desktop

submitted by Art Rhyno & Ross Singer

This entry has its roots in an  informal project to decouple the Library Catalogue from the Integrated Library System (ILS). The goal of the project is to create a sort of “mirror” of the library catalogue, a sustainable replica of the contents of the catalogue while providing conduit to dynamically pull in “state” information, such as the circulation status of an item. The plumbing for this activity involves three subtasks:

1) Exposing the Library Catalogue to desktop indexers

2) Providing a folder-based view of the catalogue via WebDAV based on the classification scheme used for the collection

3) Setting up a distributed indexing system using Lucene

The first two subprojects probably come the closest to meeting the notion of a “mashup” and this contest entry is completely based on the work done with desktop indexers, in part, because it avoids some of the tricky ILS-specific and cataloguing policy detours of the WebDAV work. The third project, which focuses almost exclusively on Lucene, is well removed from interfaces of any sort, and is designed to put the catalogue in Lucene's very flexible index format, as well as to make it malleable for using solr and other Lucene-based tools.

In terms of scale, desktop searching is not that removed from the challenges found with indexing library catalogues. Desktops are now the repositories of many thousands of files, and Windows, Linux, and OS/X have developed strong tools for retrieving the content that is created and stored in these environments. We have focused on Google Desktop (GDS) for this contest entry but the approach is fairly generic and could be re-purposed for other indexing scenarios.

The Importance of Throughput

Desktop indexers are designed to work with files and folders, the building blocks for storing content in most desktop operating systems. The indexer traverses the folder structure of the desktop's hard drive(s) and indexes the files found in the folders at varying speeds and levels of specificity.

Google Desktop is a relatively recent entry to this space, and is currently only available for Windows. There are other options for desktop indexing that should be noted. For example, Copernic desktop search  for Windows has features not found in GDS , and applications like Spotlight for OS/X and Beagle for Linux are well established options for these platforms.

Despite this, GDS is of considerable interest for those running Windows desktops. It is available as a free download, and it is easy to install. Within minutes, it is possible to have Google Desktop working away at indexing local content.


In fact, if anything, Google Desktop is too eager to index local content, and the recently introduced GDS Gadgets, which are unavoidable with the install, can quickly pass the threshold from causing casual interest to deep-seated irritation.  A recommended approach is to “Pause indexing” immediately by right-clicking on the GDS icon in the taskbar as soon as it is displayed, usually seconds after the image above appears. The gadgets can be configured and removed quite easily, and apparently can be made less obtrustive.

A more intriguing and useful option than the gadgets is Google Desktop's support for Web Folders, Microsoft's in-built and oft-overlooked WebDAV option. It is possible to deliver the library catalogue as a series of Web Folders, as shown below:


Google Desktop will dutifully crawl through such a folder structure, and with some layers of optimization, this could form the basis of keeping library content up to date in GDS. In its current form, however, the Web Folder option is unworkable for all but the tiniest of library collections. The process would take weeks to complete for a large catalogue and GDS also limits how many files can be retrieved through this option.

From the beginning, Google Desktop has supported the creation of  “plugins” using its own API. A plugin can be used to bypass many of the limitations that are encountered with GDS. One plugin, kongulo, is of special interest, because it was designed for adding the contents of a web site to GDS. This is valuable on its own, and if the library catalogue was exposed as a set of linked documents on the web, kongulo could obtain and index the catalogue with no modifications.

Unfortunately, the overhead of making a separate web request for each record would be prohibitive in terms of time, just as it is with Web Folders. kongulo also runs from the command line, so obtaining library records in this manner is obtrusive enough that it would require diligently monitoring it for weeks.  As well, kongulo is not date-aware, it reindexes whatever it is pointed at, more than reasonable for crawling a typical web site's content but problematic for the number of objects in a typical library catalogue.

We have constructed a system  that uses a modified form of kongulo on the desktop to get around these limitations for catalogue records. The plugin works directly with an Apache Cocoon application to minimize the overhead of moving library catalogue content into Google Desktop. This has been set up so that any library that can transfer MARC records out of its existing library system can work through an implementation.  

The Desktop Configuration

Google Desktop is set up to start indexing immediately upon installation, and it is much easier to track its progress with library content if you have used the option to “Pause Indexing” as soon as the GDS icon appears on the desktop. This will keep the counters set at zero until you are ready to begin indexing.

The catalogue records will be treated as “Web history” objects in our configuration, and it is possible to set the GDS “Preferences” pane to tell the indexer that this is the only content to be dealt with for now, as shown.


The modified kongulo plugin will be used to define the Web history entries that correspond to catalogue records, and requires Python 2.4 for Windows.  After installing Python, kongulo needs the Win 32 extensions and py2exe. When these are in place, the kongulo distribution from sourceforge can be unzipped to its own directory. Replace the kongulo.py file with this one and run the setup program:

python.exe setup.py py2exe

This will create a dist subdirectory where kongulo can be run from. Note that the original kongulo program is broken for the current version of GDS (4.2x), it is well worth exploring on its own but some modifications are necessary for it to work properly.

The Cocoon Configuration

As a web crawler, kongulo requires a web server to talk to, and in this case, Apache Cocoon is used to parcel up MARC records and hand them over. Cocoon is a servlet and comes with an embedded version of Jetty, a lightweight servlet container. As long as the Java SDK is available on the machine, and the JAVA_HOME environment is set (as described in Cocoon's INSTALL.txt file), Cocoon's installation is relatively straightforward.

Before running Cocoon, this file (indexcat.zip) contains a folder called indexcat which should be placed in the build/webapp directory found in the Cocoon root. There is also a subdirectory called marcXML which contains an application for breaking apart MARC records. The source code is available but the easiest way to put this in place is to copy the marcXML.class file from build/classes to the WEB-INF/classes directory. MarcXML relies on Bas Peters' great marc4j toolkit and the marc4j.jar file must be put in the WEB-INF/lib directory in order for the MARC handling to fall into place.

The heart of any Cocoon application is the sitemap.xmap file, a sort of XML-based switchboard for incoming Web requests. The indexcat subdirectory contains the sitemap.xmap for serving the MARC records to kongulo. There is a section at the top of this file called “global variables”. The values described here in the file must be changed before kongulo begins. It is also necessary to “break” up the MARC records from the catalogue for kongulo. This is done by putting the file into the content subdirectory in indexcat and using Cocoon's cron block (based on quartz), which can be accessed by starting Cocoon (usually cocoon.bat) and going to the following URL in a browser:

http://localhost:8888/indexcat/cron/

(where localhost is the location of the machine where Cocoon is installed).

By default, this URL will be set to invoke the process that will break apart the MARC file into multiple XML files. The process is ignited by the “Fire Now” button. Depending on the size of the file and the horsepower of the machine, this may take hours. The cron setup includes a “Refresh” link to check if the job has completed or you can monitor the console as well.

Starting the Indexing

When the XML files are in place, kongulo can do its work. This is as simple as giving it a URL pointing to the machine where Cocoon has been set up:

kongulo http://library.somewhere.org/indexcat/gds/index-0

The first time kongulo runs, it tells you that it is going to install itself:


At this point, kongulo then presents the most unexciting interface imaginable. By default, we have set kongulo to receive 1000 records at a time, these are placed in one zipped file by Cocoon and then uncompressed by kongulo, before being handed to Google Desktop for indexing.

The URLs follow a sequential numbering scheme, and each compressed set of records contains the information for constructing the URL for  the next set of records:


The Google Desktop indexing queue can be overrun by a process like this, and if the queue reports a problem, we have set up kongulo to wait 5 minutes before starting to load records again. This should  normally handle even a million plus records over several hours, but if the timeouts don't succeed, or if the network connection is broken, kongulo can be handed the last URL displayed on the screen for picking up where it left off. You can track kongulo's efforts by using the index status option within Google Desktop:


The index queue is typically quite large, but it still won't take long to fill it up. Google Desktop will also be more responsive to kongulo if set to “index now” rather than waiting for idle time. This is an ideal activity to set up before leaving work in the evening. If all goes well, a screen like this will be waiting for you in the morning as soon as you check the index:


Arbitrary Library Injections into Google Space

When content is indexed by Google Desktop, it is often given a top spot in regular Google searches, as shown:


This might have potential for public library stations which run windows. Google Desktop can be set up to cease adding new entries (which would be essential for a public station) and locked down in the same manner as other software offered to the public. The reason to go to this much effort is that GDS could be used to promote library content in general Google searching regardless of the browser used by a patron (duly noting that Firefox can be configured to perform this kind of magic automatically without Google Desktop).

If you choose to search “Desktop” from Google, you will be using Google's interface and indexing algorithms on the library catalogue. In the distribution, we have set up a “redirect” mechanism to the web interface of the ILS when viewing a record. For ILS systems that have poorly implemented session strategies, a redirect might be an extremely bad idea but is included here since it is the most generic approach.

A better strategy is to use something like the Talis platform or the API or SQL options available with the ILS to pull up a display directly. We use a simple XSLT transformation in Cocoon to produce the HTML which is handed to kongulo, and it inserts the following as an iframe:

http://library.wherever.org/indexcat/gds/status?theID=879151

Google Desktop ignores iframe tags but this is one simple option for pulling together state information dynamically at the time of display. GDS can index the content as defined by the XSLT, and information like the circulation status of an item can be retrieved separately as shown:


There are, of course, lots of other browser tricks to pull in content dynamically, but iframes do not rely on JavaScript or CSS, and are generally well supported by web browsers.

With a Google Desktop plugin, the content handed to GDS can also differ greatly from what is displayed. This is of special interest because GDS does not always show library records in general searches, presumably because the relevancy weighting algorithms favor the full text resources on the Internet. However, it would be possible to overload terms coming from subject headings and elsewhere in order to bump up this weighting. There may be other ways to ensure library content is put at a premium, and tools like the wonderful TweakGDS exist to move GDS  indexes to a sharable network drive in addition to numerous hacks that have cropped up since Google Desktop was first introduced.

All of this argues for experimentation before deciding how library records in GDS could fit into the library's services but, if nothing else, this could be a low barrier answer to “why doesn't the library put its records into Google” or “I wish the library catalogue could be searched with a Google interface”.

Limitations

Caching

Somewhere around 850K entries, Google Desktop seems to stop adding entries to the cache. The records are still indexed, and the title still appears in searches, but the highlighting is not displayed, as shown.


Google Desktop does seem to start recording full entries again after a day or two. It might take several variations of feeding the index to figure out how to work around this.

Substring Support

Google Desktop does not support substring searching. For example, “Detroit River*” will not return any of the records above.

Further Explorations

When the Library has Full Text

It would be trivial to tell kongulo to pass content pointed to by 856 (Electronic Information) fields to the indexer. Google Desktop handles MIME types outside of HTML, and there are tools to increase the range of types that it will index. Plus, kongulo is, by definition, a crawler, and could go several screens into a resource identified by the 856.

What do you mean you were browsing the Web in 1980?

We have set up the kongulo plugin to tell Google Desktop the date of a resource based on the publication date in the MARC 008 Date1 value. GDS seems to work dates into its weighting and it seemed appropriate to timestamp content based on its publication's date. When a year is handed to GDS for a date, it defaults to the last day of the last month for that year,. At the same time, the Google Desktop “Browse Timeline” feature only starts in 1980. This leads to timeline displays like the following:


There are probably hacks to expand the range of the calendar, and the date logic could be tweaked to avoid having every publication date from falling on the last day of the year, but it's an interesting perspective on the library collection, and needs some pondering about how date logic should fit into a web-orientated indexing tool.

Moving and Maintaining Indexes

Several hacks seem to be available for indexing content in Google Desktop on one machine, and then handing the index to another machine. We have put in a datestamp mechanism for kongulo such that a scheduled task could retrieve updates to an index without re-indexing the entire set. This is an essential aspect of mirroring library catalogues. After an initial index is built, the increments tend to be much more modest (at least viewed from certain acquisitions budgets in Georgia and Ontario).

If there is a seamless way to get the initial index on a remote machine, maintaining it is not a huge issue in terms of data transfer at least. Cocoon also has a good scheduling layer via the cron block, so there might be an option to periodically harvest updates from the catalogue to make them available to machines with Google Desktop. As well, this depends on the ease to which updates can be identified and exported from the ILS.

Privacy

Obviously, a library has to be extremely careful with what is indexed and displayed anywhere  in environments where personal information and content is created. Google Desktop also brings to light how much browsing activity is stored in general browser use by default. There are ways to lock down what Google Desktop does. For a public computing environment, this needs to be looked at in detail.

The role of Cocoon

Cocoon is an extremely useful tool in creating mashups, particularly because of something we do not use for this project, and that is its SVG (Scalable Vector Graphics) support. For example, you can create a graphic in SVG, use a very simple stylesheet to tweak it for display for functions like possibly placing a red dot on a shelf location based on some input criteria, and then render it in an image format (or directly in SVG for browsers that have appropriate support).

SVG is also extremely valuable for placing custom maps into Google Maps. The typical approach to this has been to use a utility to create the gazillion tile files required for a Google Map display. But SVG can be used in Cocoon to dynamically carve out and scale the tiles needed for display of custom maps. For example, here is a map that shows a link to a scanned historical map (courtesy of OurOntario).

Cocoon also shines at talking to databases, and supports caching, connection pooling and other constructs for minimizing overhead. As well, Cocoon features a server side implementation of XForms and has provided a declarative implementation of Ajax as well as a very easy to use option for continuations. Combined with Lucene, Cocoon could be used to build and maintain a pretty nifty interface.

OSS Licenses

The work done by the authors of this document is released under the GPL License. Cocoon, marc4j, and Google Desktop all have differing OSS licenses and need to be used under the terms specified.


About this document

This document was created in OpenOffice and is served directly on the web using Cocoon's nifty Zip support and the elegant and sensible XML syntax of OpenDocument. The XSLT to make this happen has been modified from the original work by Svante Schubert and has been shared with the Lenya project.

With OpenOffice and Cocoon, OpenDocument content can live directly on the web, a prospect much more audacious in concept than execution. Add in WebDAV support, and the barriers between the desktop and the Web start to blur, and the options for repurposing content achieve megaton levels. Regardless of one's feeling about office productivity tools, libraries have a vested interest in supporting OpenDocument for long-term preservation and access.

The Files (in case you missed them in the links above)

kongulo.py
indexcat.zip

The Guilty

Art Rhyno, Systems Librarian, University of Windsor
Ross Singer, Library Application Developer, Georgia Tech Library

Last Modified: Aug. 17, 2006

Update: Aug. 22, 2006

I had a suggestion that I add a parameter to break up indexing into two sets, for example, stopping at 500K for 1M records, and then picking up  again a few hours later. Apparently this gets around the cache issues? I am not sure why it would make a difference but I will try this out. Also, I fixed the graphic showing the “highlight” problem since it was pointed out that it was confusing (highlighting works on the title, but if a record is in the list because of text other than the title, then it doesn't display). Many eyes , shallow bugs, and all that...

Update: Aug. 24, 2006

The cache issues are not related to breaking up the input record sets, it took me loading 1M records twice to figure this out. It seems to be a function of data size, somewhere between 3 and 4 gigs of consumption for the GDS index, Google Desktop starts getting cranky about particular domains. The trick is to minimize the HTML content that GDS receives (since selecting a Google Desktop result goes to whatever URL is specified). This is not hard to do, and this stylesheet (gds.xsl) can replace the existing one in the indexcat.zip distribution in order to come up with a format that feeds the same amount of content to the index but allows at least 1M additions. As I noted above, GDS ignores the iframe info anyway so there was lots of scope for reducing the size of the file. The good news is that without the cache issues, GDS seems to faithfully display local results on every search. In the words of the great Homer Simpson: “whoo-hoo”!