What does Harvesting mean?

Harvesting is the process of gathering decentralized Metadata from partner collections into the central BEN Portal. In other words, harvesting simple means getting metadata records from partner collections and storing then in the BEN Portal so that these records can be searched/browsed from the BEN Portal.

Isovera “harvests” data from partners in a quarterly basis. See the Harvesting Process for details on what exactly happens during a harvesting cycle.

What are the components of the BEN harvester?

The Harvester is actually a combination of two components. One of these reside on the BEN Portal and is responsible for requesting/acquiring data from partner collections and translating and storing the acquired data to the BEN Portal database. We call the component the Thresher.

The second component is the Thresher's counterpart residing on the partner collection and is responsible for sending BEN LOM structured Metadata to the Thresher in a format that the Thresher understands (OAI PMH) upon the request of the Thresher.

The Harvester component for partner collections is unique for each collection as it has to work with different databases and database structures. The main challenge in Harvesting Metadata into the BEN Portal from the collection of a new partner is to set up this Harvesting component.

We created a generalized version of the collection side harvesting component so that each new partner can install it in their collection but it requires a lot of configuration and tweaking before it becomes fully functional. Never the less, it is a good starting point for a new partner collection trying to set up their Harvester component.. We called this component the Reaper and it can be downloaded from here.

How does the harvester work?

Technical Overview

  • BEN Harvester/Thresher issues “Identify” request.
  • Collaborator Harvester responds with identifying information.
  • BEN Harvester/Thresher issues “ListIdentifiers” request.
  • Collaborator Harvester responds with a list of identifiers for all resources created or modified since the last harvest. Usually, an identifier is just a number automatically generated by the database as a primary key.
  • BEN Harvester/Thresher issues a “GetRecord” request for each resource listed in Step 4.
  • Collaborator Harvester/Reaper responds with a BEN-LOM XML document for the requested resource.

A more detailed overview of how the Thresher works with the Reaper:

  • Collaborator web server executes Harvester/Reaper CGI program (Perl script)
  • Harvester/Reaper parses HTTP request
  • Harvester/Reaper requests record from XML-DBMS library
  • XML-DBMS library reads and parses file mapping database structure to XML document.
  • XML-DBMS issues SQL queries and transforms query results into raw XML.
  • Harvester/Reaper transforms XML into well-formed BEN-LOM.
  • Harvester/Reaper wraps BEN-LOM document in OAI-PMH envelope.
  • Harvester/Thresher opens OAI-PMH envelope, reads BEN-LOM document, and inserts metadata into BEN Portal database.

Sample BEN LOM XML