Digital Initiatives Home About the Digital Initiatives Services Research and Development Metadata Reports Ask Questions Virgo Catalog
University of Virginia
University of Virginia Library
Digital Initiatives: Research and Development

The Fedora™ Andrew W. Mellon Foundation Grant

Phase 1 of the Open-Systems Fedora™ Repository Development Project (funded December, 2001)

Introduction
Project Description
Phase 1
Phase 2
Phase 3
Implementation Plan
Budget
Appendix A
Working Project Site

Introduction

The University of Virginia Library has been building digital collections since 1992. We have amassed a large collection that includes a variety of SGML encoded etexts, digital still images, video and audio files, and social science and geographic data sets that are being served to the public from a collection of independent web sites that have very little cross-integration.

We began searching in 1998 for a digital library management system that could effectively meet both our current and future digital content needs. Like many other libraries, we initially sought a vertical vendor solution that provided a complete, self-contained package for delivering and managing all digital content needs. We investigated a number of commercial solutions, including IBM's Digital Library Software system (later renamed Content Manager) and SIRSI's Hyperion digital media archive system. We started our investigation with the requirement for a digital content repository with a wide variety of features, including scalability to handle hundreds of millions of digital resources, flexibility to handle the ever expanding list of digital media formats, and extensibility to facilitate the building of customizable tools and services that can interoperate with the repository. Our view is that such repository functionality is the core of a digital library system providing a means of uniquely identifying each piece of digital content as well as identifying groups of related content or collections. The remaining services and functionality of a digital library system would then be built on top of this core.

Our investigations revealed a number of shortcomings in commercial digital library products:

  • Most products are narrowly focused on specific media formats that offer good solutions for managing and delivering video or images but lack adequate tools and support for structured (i.e., xml or sgml) electronic texts or the ability to intermingle media types.
  • Many products perform well at document management but offer no features for dealing with video or images.
  • None of the products we examined adequately addressed the need to track and manage the array of ancillary programs and scripts that play an essential role in the delivery of that digital content.
  • Many products fail to effectively deal with the complex interrelationships amongst digital content. As an example, consider an electronic text in the form of a five hundred-page book. The book consists of a single file containing all five hundred pages of text marked-up using XML. In addition to the XML file, there are also five hundred images that represent the scanned pages from the original hardcopy edition of the book. There are also twenty-five audio files that provide a recording of the books content read aloud. To the librarian, all of these digital media comprise the intellectual object known as the "book" and all are closely related to one another.
  • Finally, we found that few of the products attend to the critical issue of interoperability, failing to provide an open interface to allow sharing services and content with systems from other vendors at other libraries.

Based on these investigations we decided to embark on an in-house development effort. Modularity and use of open-system standards is fundamental to our design strategy. Such modularity is essential for future evolution through component replacement. We are convinced that an object-oriented design is most appropriate, allowing us maximum flexibility, scalability and, eventually, interoperability with other repositories. We are also convinced that the Library should be providing tools to our users to give them sophisticated access to our collections and to help them manage their own collections.

In the summer of 1999, early in our design process, we discovered a paper about the Flexible Extensible Digital Object Repository Architecture (Fedora™) written by Carl Lagoze and Sandra Payette at Cornell's Digital Library Research Group, describing the architecture that they had designed. Fedora is a modular architecture built on the principle that interoperability and extensibility is best achieved by the clean separation of data, interfaces, and mechanisms (i.e., executable programs). A Fedora Repository provides a general-purpose management layer for digital objects. In their simplest form, digital objects are containers that aggregate mime-typed streams of data (e.g., digital images, XML files, metadata), known as datastreams. It should be noted that datastreams can be references to external data - either disseminations of other Fedora digital objects, or service requests to remote data sources. This capability allows Fedora digital objects to serve as aggregators and value-added surrogates for existing on-line digital content.

In addition to behaving in a generic manner, digital objects must be able to mirror real-world entities by providing access methods that make an object behave in a content-specific manner. For example, a natural behavior for a book would be "Get Table of Contents." Fedora allows the association of rich and extensible behaviors with digital objects by "plugging in" generic components known as disseminators. Each disseminator aggregates references to: (1) a formally defined behavior interface that defines a set of methods for a particular kind of digital library resource (e.g. a Book interface), (2) an executable mechanism that runs these methods, and (3) the datastreams that the execution mechanism should use to fulfill specific method requests. These interfaces and mechanisms can, themselves, be stored as digital objects, laying the foundation for unlimited extensibility of the architecture. A major strength of the Fedora extensibility model is that clients can use the generic methods (of the default API) to discover and invoke content-specific methods defined on disseminators. The digital object facilitates the invocation of these extended methods, returning customized disseminations of content to the client.

With the Cornell group's help, we installed their research reference software version of Fedora and began experimenting with some of our digital collections. We pretty quickly found that the reference implementation, elegant piece of software that it is, was not what we needed for a large-scale digital library. But we were convinced that Fedora was exactly the conceptual framework that we were looking for. So with the authors' help, we reinterpreted the architecture and implemented it using an SQL database as the backend.

Since that time we have built a testbed that includes 500,000 data objects including digital images and a wide variety of XML objects. We have developed a variety of disseminators that provide a rich set of functionality for electronic finding aids, TEI-encoded etexts of letters and books, and for XML-encoded structured collections of art, architecture and archeology images. We have also implemented three different object models for images, one for multiple files for the various resolutions of a single scan, one for single-file wavelet-encoded images and one for page images that uses a single compressed TIFF file. In all three cases, the user sees the images from one abstract point of view and is spared the requirement of knowing their format.

Most recently, we have begun to do some stress testing of our implementation using software that simulates simultaneous users requesting a realistic mixture of different requests. We have been quite pleased with the results. On a Sun Ultra80 two-processor workstation, simulating 20 simultaneous users making requests with an average delay of 300 milliseconds, response time averages are approximately one half second per request. Note that for most of the XML object transactions this includes a server-side rendering of the XML into HTML, a relatively processor-intensive action. We are in the process of moving the repository to a four-processor, dedicated server, where we will continue our testing. We plan to start scaling our testbed up by duplicating the existing objects repeatedly, running the user tests at 1,000,000 and 10,000,000 objects. We believe that a repository that provides fast access to 10,000,000 objects is a very good starting point for a practical digital library.

top

Project Description

We believe that it is time to both start developing a practical implementation of the work that we have been prototyping, and to explore and prototype some of the more complex issues related to the more complete implementation. We propose to do that with input from other members of our community (see Phase 1), so that we develop a good general solution as quickly as possible. We also would like to get the repository software that we produce into the hands of other people who are ready to use and evaluate it.

We are convinced that we are on the right track with our implementation of Fedora. It gives us the basic approach that we need to manage all of the digital resources that we are accumulating, while delivering a very high level of service to our users. And we believe that the extensibility of the architecture will allow us to adapt to the rapid technological changes and new content forms that are inevitable.

We request funding for this project from the Andrew W. Mellon Foundation based on our belief that the project is closely aligned with the mission of the Foundation of promoting the broad dissemination of scholarly content. In the digital age, such broad dissemination is dependent on core technical developments, the roots of which lie in the research community. The original research and development related to Fedora was undertaken under the auspices of DARPA and NSF funded research at Cornell University. Per the general understanding in such research projects, the funding was available for initial concept development, prototype demonstration, and reportage in the form of conference and journal papers. On the other hand, such government-sponsored research funding is not available for the subsequent stages necessary for moving from research to deployable implementation and other aspects of technology transfer including packaging and support. The Foundation's possible funding of Fedora work by Virginia and Cornell would consequently leverage several years of successful government funded basic research and facilitate the availability of the fruits of that research to the broadest community. Such funding would also benefit from the fact that the NSF funding continues at Cornell and would dovetail into the project as it matures. We are confident that such pairing of funding mechanisms is the best possible model for fostering state-of-the-art advances in digital libraries and scholarly communication.

This project also would build from and directly support Mellon-funded projects already underway at Virginia. The Supporting Digital Scholarship (SDS) project, which concentrates on collecting the digital scholarly projects that are being created by humanities scholars in the Institute for Advanced Technology in the Humanities at Virginia, is built around our prototype Fedora implementation. That project has already informed the design that underlies the project described in this proposal and will continue to do so. The version of the repository that results from phase 1 of this project (described below) should become available right at the time that the SDS project delivers policy and technical guidelines for collecting digital projects. This should allow us to implement those policies immediately and begin formally collecting scholarly projects into the digital library.

In the same manner, the basic working repository created with this grant will deliver a full suite of management utilities to other Mellon-funded projects underway or envisioned at Virginia. Our work will dovetail with the digital imprint project approved for the University Press of Virginia, and will be immensely useful for the American Studies Information Community project (itself bringing in the Mellon-funded Early American Fiction collection) that is one of the Open Archives Initiative projects recently funded by the Foundation. The startup phases of these two projects coincide with the detailed design and implementation phases of the repository project, providing an opportunity for influencing the details of the initial product by providing different content collection and delivery issues to resolve. Then both projects would be able to move directly to concentrating on using the repository to meet the specific needs of publishers and American Studies scholars, respectively.

We believe this project is best undertaken in collaboration with our colleagues at Cornell. We find the missions of our two groups to be synergistic, spanning a continuum from basic research, through prototyping, to eventual deployment of reference implementations. The Cornell group works mainly in the basic research and prototype mode. Fedora was originally developed within this research framework, and NSF DLI2 funding currently supports the basic work on policy enforcement and context sensitive behaviors, which we will leverage as described later in this proposal. The Virginia group sees itself functioning as a bridge between the computer scientists doing digital library research and the libraries that are building large digital collections. We believe that the collaborative activities of this project will effectively demonstrate how digital library research can be more immediately deployed in the libraries that it is intended to serve.

We propose that we will form a research and development team composed of people from Virginia and Cornell, with 1.0 FTE added at Cornell to Lagoze's group, and 2.5 FTEs added at Virginia. The principal investigators will be Thornton Staples, the director of the Digital Library Research and Development group in the Library at Virginia, and Carl Lagoze, co-director of the Cornell Digital Library Research Group. Also, people from the Institute for Advanced Technology in the Humanities and from the Advanced Technology Group in the Information Technology and Communications Department, both at Virginia, will continue their work with Fedora as members of this team. The team will pursue a three-phase project, as detailed below, with the goal of producing an open-source reference implementation, which will be available to other libraries and practitioners as they construct digital library systems. The first phase involves taking a strong proof of concept (already done) and producing a package that can be distributed and used in a variety of settings. The later phases propose extending the results of ongoing research in order to fill out the system with important functions that a sophisticated digital library system needs.

Top

Phase 1

This phase will involve finalizing the specifications of the basic Fedora system, implementing that system, and testing it in a variety of deployment scenarios. The resulting product will be an efficient, scalable reference implementation that can be the basis for many different development efforts, one that libraries with a reasonably sophisticated technical staff can use to begin to build their digital library systems. It will include a set of generic modular tools that provide a full set of basic repository management functions. The time period for this phase is assumed to be one year, probably from the time that the programming effort begins.

We will continue to build our testbed at Virginia and we anticipate having at least 1 million digital objects of a variety of types ready to test the system that we will deploy in phase 1. An essential part of this phase will be the participation of a select deployment group (distinct from the development group) that will deploy testbeds of their own materials at the same time. In each case, the participants listed below either heard our presentations at the Association for Computing in the Humanities and Digital Library Federation conferences or read the article in DLIB Magazine where we described our work and came to us to find out more.

Two of the participating institutions are also digital library groups which will be evaluating the system from that point of view. The rest of the participants are project-oriented humanities groups who will be testing the repository system as a basis for supporting projects rather than for building a general digital library. As a group we will be evaluating the system specifications and planning the system evaluation as the programming is being carried out. At the end of this phase we expect to have at least six implementations of working digital object repositories to evaluate. We also expect that many if not all of these repositories will continue to give us a rich testbed for later phases of the project.

The success of phase one will be determined by the success of the deployment group (consortium participants) in deploying separate testbeds in each of their institutions. Feedback from the consortium members and other users after the public release of the software will be used to evaluate version 1.0 of the software and will guide future enhancements to the repository software.

We would like to keep the number of participants to ten or fewer, to make the process more manageable. Currently, the participants include:

  • Jon Dunn from the Digital Library Group at Indiana University,
  • Lorna Hughes from the Humanities Computing group at New York University
  • David Kahle and Greg Colati from the Digital Collections and Archives Department at Tufts University
  • Harold Short from the Humanities Computing group at Kings College London
  • Marilyn Deegan from the Refugee Studies Center at Oxford University

Following initial drafting of the specifications by the development group and dissemination to the deployment group in the summer of 2001, the work in this phase will include:

  1. Participating institutions will send a technical representative to a two-day meeting in Charlottesville in the fall of 2001, where we will:
    • demonstrate and discuss the Virginia implementation to make sure that all of the participants fully understand the basic concepts, see how the repository is being used by us, and understand the possibilities for other installations;
    • agree on a final version of the architectural specifications that will be version 1 of the repository system;
    • discuss the features defined for later phases, in order to sketch out the next steps so that we can attempt to account for them in the basic object architecture;
    • agree upon the specifications for the testbed that each participant will develop in step 3.


  2. The development team will create the system software. The functionality of the first version will include:
    • implementation of a complete basic repository architecture that is based on the original Fedora concepts;
    • a management console that a repository manager can use to perform all of the basic management functions;
    • a metadata searching and indexing service that is compliant with the Open Archives Initiative.


  3. 3. Participants will each deploy version 1.0 of the system and build a testbed of their own digital objects, as agreed upon in step 1.

  4. Appropriate fixes and small changes will be incorporated based on the testbed experience and version 1.0 will be made available from a publicly accessible open-source site.

Top

Phase 2

The second phase of this project will concentrate on adding the functionality needed for a repository that supports large-scale digital content creation, storage and delivery efforts. This will involve enhancing and extending the management utilities developed in phase 1, in addition to concentrating on the development areas listed below. We expect that some of the participants from phase 1 will be interested in the problems associated with large scale production and will be interested in developing a new testbed definition as this phase develops and deploying the new version when it is complete. We will solicit new participants who are well situated to evaluate the work as it progresses. We will also be interested in continuing to work with groups that are interested in deploying smaller repositories to evaluate how these additional functionalities can be used effectively in those settings.

Security and Policy Enforcement – We assume that each digital object in the repository should be able to have a variety of policies associated with it. First among these policies must be those associated with access control. But many other policies are possible, for example preservation policies that describe the events and actions necessary to maintain objects over time.

In the area of access control, we recognize the need to specify policies that are both general-purpose and object-specific. Some policies may be defined at the repository level and may address high-level operations such as who can create or delete objects. Other policies may be tailored to the nature of individual objects in the repository. Initially, we will focus investigation in two areas:

  • Flexible Policy Specification: Cornell's NSF-funded work is investigating new policy definition languages that are both easy-to-use and expressive. We plan to exercise that research in this project so that we may specify access control policies that are customized to fit the nature of different kinds of objects and usage scenarios. We plan for policies to be expressed in both human and machine-readable formats.

  • Extensible Policy Enforcement Mechanisms: To be effective, policies must not only be expressed, they must be enforced. Thus, we will also examine several mechanisms for enforcing machine-readable policies. We also recognize that digital objects change and evolve over time and that our enforcement mechanism must be extensible. We will investigate mechanisms that are easily adaptable to changes in objects and their usage context.

Collection Objects – We believe that item-level granularity is not appropriate for all the functionality that we want to build into our system. Indeed, there are a variety of repository functions that should be implemented at an aggregated level, within what we call collection objects. These objects would represent a group of related digital objects and provide a place to describe and document a collection as a whole, as well as to attach computer programs to be used for manipulating and analyzing it. The relationship of collection objects with related items will be either rule-based (for which criteria associated with the collection object are used to locate the objects which are members of the collection) or explicit (in which objects that are members of the collection are enumerated). Collection objects might be used to generalize a specific function across a class of digital objects; for example, a collection object might be used to implement a function such as metadata searching and indexing across its set of constituent objects by accessing a specific datastream in those objects. We will also develop collection objects that can act as templates for large classes of objects, providing a way to streamline the process of updating large classes of objects.

Storage Management – We will develop a storage management system that would allow the repository to control access to one or more file systems that house local datastreams. The processes that create or update a local datastream would, in addition to updating the repository, be responsible for accepting the contents of the datastream from the user and passing it to the file server.

The goal of phase two is to use the results of evaluations of version 1.0 of the software (conducted in phase one) to add new functionality to make the software usable in large- scale production environments. The features outlined here are a first impression of what those additional features may need to include. We expect many of the repositories deployed by consortium participants in phase one will provide valuable testbeds to conduct additional testing of the new enhancements. We also expect these testbeds to provide valuable feedback for both evaluating the new features and for suggesting additional enhancements for the future.

Top

Phase 3

The third phase of this project will concentrate on extending the facilities in the repository that provide more sophisticated delivery of end-user experiences in a large scale digital library. This will include extending the functionality of disseminators, adding services that are important for collecting scholarly projects and publications, as well as overall optimization of the system. As with phase 2, we hope that the deployers from earlier phases of the project will be interested in continuing with this phase. We will also solicit new participants who are well situated to evaluate the work as it progresses.

Editions and Versioning of an Object – The repository must make it possible to retain and provide on demand every version of an object if desired. We propose to offer a standard way to make a new edition of an object available as a separate object in the repository, as well as to make it possible to track every change to an object within the object itself. A new edition of an object will be a completely new object. It will have a new PID, it's own metadata, etc. There will be a field in the system metadata that contains the PID of the object from which it was derived.

Versions of the components of an object will be kept in the object. The create date for each version of each component of the object will indicate the date and time that the version became current. The version of the whole object on a given date and time could be disseminated by giving the date and time as an extension of the PID. The version of each component in the object with a create date and time most nearly previous to the given date and time would be used in the dissemination.

Dynamic, Context Sensitive Behaviors – We envision scenarios where the predefined disseminations on an object will not be appropriate to a given usage context. In certain cases, our collaborators may wish to reuse each other's objects in new ways. One option is for repository managers to create new disseminators on objects to meet such needs as they arise. Another interesting approach is to provide a mechanism for exposing a special kind of structural metadata about an object that enables a 3rd party to: (1) learn about the nature of the object's raw content, and (2) access relevant parts of that content in a format that facilitates reuse. In a way, we can think of this as enabling "just-in-time" disseminators for an object.

We envision implementing this scheme by introducing a new service into the repository architecture: a context broker service. We anticipate a time when all of our collaborators are running repositories using the repository software we develop. Each site can also run a context broker service whose purpose is to contextualize the experience of objects in other collaborator's repositories.

Efficiency and Scale Optimization – Though we will be attempting to optimize each module of the system as we develop it, we believe that we need to devote part of the last phase of the project to optimizing the integrated system. We need to ascertain that the repository can support hundreds of million of objects with 50 simultaneous users, in a realistic combination of user requests and repository management processes. If the proposed scale proves to be impossible, we will investigate other strategies, such as coordinated, multi-repository installations.

The goal of phase three is to continue to evaluate and enhance the software, building upon input received from consortium participants and others in the open source community who are actively using the repository software. We expect the version of software that emerges at the end of phase three to be capable of supporting large-scale digital content and delivery efforts. We also expect the software to be capable of providing the necessary services that are important for collecting scholarly projects and publications and provide tools for end-users to discover and manipulate content in the repository. We anticipate the success of phase three and the project as a whole will be judged by the experiences of consortium members and others in implementing the software to manage large scale digital collections. We also envision that the various implementations of the software will offer rich testbeds for future projects.

Top

Implementation Plan

The project will extend over a three-year period anticipating approximately a year to complete each of the three phases outlined above. Evaluative input from the consortium participants about software features and performance may necessitate changes to the planned Phase Two and Three activities. Obviously, delays in hiring or unplanned technical issues may require adjustments, but an approximate timeline of events includes the following:

  1. Year 1 (Phase 1)
    • Conduct first meeting of participants and orient consortium members to project activities timeline
    • Finalize version 1 design and programmer implementation specifications
    • Implement alpha version of software, leveraging existing code from prototype
    • Orient consortium participants on deployment of alpha testbeds
    • Deploy alpha testbeds by each participating consortium member
    • Obtain initial evaluation results from alpha testbeds
    • Address concerns/problems from initial evaluations of alpha testbeds
    • Issue open source public release of version 1.0 of repository software
    • Issue annual report summarizing progress

  2. Year 2 (Phase 2)
    • Evaluate version 1.0 of software
    • Conduct meeting of consortium members to assess feature priorities and evaluate Year 1 activities
    • Address security and policy enforcement
    • Address collection objects and collection object management
    • Address storage management system
    • Release version 1.0.x of repository software with enhancements
    • Issue annual report summarizing progress

  3. Year 3 (Phase 3)
    • Evaluate version 1.0.x of software
    • Conduct meeting of consortium members to assess feature priorities and evaluate Year 2 activities
    • Address editions and versioning of objects
    • Address dynamic context sensitive behaviors
    • Optimize for efficiency and scale
    • Release version 1.1.x of repository software with enhancements
    • Conduct meeting of consortium members to evaluate project
    • Issue final report summarizing project

top

Budget

The costs associated with this project will predominately be the costs of personnel. We are requesting $1,000,128 from the Mellon Foundation to provide 3.5 full time equivalent staff, including a Technical Coordinator and staff to work on the design and programming of the system proposed, plus funding for equipment for those people and funding to provide travel expenses for 4 meetings of the development team per year for each of three years.

The Technical Coordinator position will be in the Digital Library Research and Development (DLRD) and will report to Thornton Staples. We expect this person to participate in design and implementation discussions with the research and development team and to be the primary point of contact for the deployment group as they begin to deploy the software. He or she will test the software as it evolves using the Virginia testbed. This person will not necessarily be a high-level programmer but will need to be very technically sophisticated, as well as a good communicator and organizer. We see this position as key to organizing the project and keeping all of the participants in sync. This person will coordinate all of the activities associated with the project, including organizing meetings, conference calls and other communications, as well as overseeing any administrative needs.

The 2.5 FTEs of programming time will be divided between Virginia and Cornell. These will be high-level programming positions that we believe are necessary to design and develop the software required for the proposed system.

1.5 FTE – programmers/Virginia: The 1.5 FTE at Virginia will be divided between the DLRD (1.0 FTE to be supervised by Ross Wayland) and the Advanced Technology Group (ATG) (.5 FTE to be supervised by Tim Sigmon, director of the group). We believe that by placing these positions in these three software development groups we strengthen the connections among them to continue the collaboration that has produced the prototype. Note that 25% of Staples' time and 50% of Wayland's that has been committed to developing the prototype will continue to be devoted to this project. Also, the ATG will match the .5 FTE from the grant, plus Tim Sigmon will continue to dedicate 10% of his time to the project.

1.0 FTE – programmers/Cornell: Carl Lagoze will directly supervise the 1.0 FTE to be added to the Cornell Digital Library Research Group (CDLRG). This will include moving 50% of Sandy Payette's time from another grant project plus adding another .5 FTE. Funding will be provided to Lagoze through a subcontracting arrangement (see attached letter). Lagoze will contribute 5% of his time to the project.

The commitment that has been made by the participants in the deployment group was to cover expenses of their participation themselves. We certainly will use the Technical Coordinator position to make that commitment as easy as possible. Also, we have a verbal commitment from Daniel Greenstein, the director of the Digital Library Federation, to cover the expenses of the fall meeting.

Top

Appendix A.

  1. Architectural Specifications

    1. General Object Model
      The core of the specification centers around the model used to define an object. Understanding the meaning of each of these components and how they interact is critical to the successful design and implementation of the repository system. To establish a common vocabulary we have provided definitions for key components of the object model and the repository.

      1. Object – From an architectural perspective, a digital object consists of a number of components that include a Unique Identifier (UID), an Object Map, System Metadata, one or more Disseminators, and one or more Datastreams.

        • Unique Identifier (UID) – A UID is the unique persistent identifier for the object, maintained by the repository software. Note that the UID is defined for internal use in the repository and may or may not be exposed to the outside world.

        • Object Map – An Object Map describes the internal structure of an object. It identifies each component in the object, and defines relationships among components. An important function of the object map is defining the roles that datastreams play in the context of specific disseminators (i.e., which datastreams are used by which disseminators). Each object must contain a single Object Map.

        • System Metadata – System metadata consists of a stream of bytes that contain ASCII text marked-up with XML tags that conform to a specific XML Schema/DTD. It records a minimal amount of information about the object and its components that are necessary for basic internal repository management and indexing.

        • Datastream – A Datastream is a component that consists of a typed stream of bytes that adds content to the object (e.g., a digital image, an electronic text, a program, metadata, a database, a mapping or relational structure, etc.). An object can have one or more datastreams. Each datastream must contain the following:
          • Name – identifier for the datastream
          • Description – textual description of the datastream
          • Content Type – MIME type of the datastream
          • Control Type – There are two possible control types:
            • Internal – a datastream for content under the direct control of the repository owner.
            • External – a datastream for content that is outside the direct control of the repository.
          • Contents – a pointer to a MIME-typed stream of bytes e.g., a local file system address or an http address.


        • Disseminator - A Disseminator is a component that associates a set of behaviors with an object so that clients can obtain different views of the object's content (datastreams). A disseminator may transform content, process content, or prepare custom presentations of content. Each disseminator contains a mapping to a Signature and a Servlet, both described below. A Signature is a specification for a set of behaviors that an object is able to perform. In object-oriented terms, a signature is an abstract interface definition that can be implemented by a program. A Servlet is a module that implements the behaviors (methods) described by a signature, in a specific setting. A servlet is comprised of one or more methods each of which implements a behavior as an "action" upon the object's content. Within any given signature and servlet pairing, there is a one-to-one correspondence between each method in the servlet and a method in the associated signature. Incidentally, a one-to-many relationship exists between signatures and servlets since the same signature can be implemented in different ways by multiple servlets. In object-oriented terms, a servlet implements the interface defined by its corresponding signature.

          • Signature – A signature contains one or more methods, each consisting of :
            • MethodName – identifier for the method
            • SignatureMethodDescription – an end-user oriented description of the method/behavior
            • MethodReturn Type – MIME type returned by the method
            • MethodParameter(s)
              • ParameterName – identifier for the parameter
              • ParameterDescription – a description of the parameter and how it is used
              • ParameterDefault Value – default value for the parameter (optional)

          • Servlet – A servlet contains one or more methods, each consisting of:
            • MethodName – identifier for the method (identical to method name in the corresponding signature)
            • ServletMethodDescription – a description of how this specific behavior is implemented; this is different from the end-user oriented description of the behavior defined in the signature
            • MethodReturn Type – MIME type returned by the method (identical to the method return type in the corresponding signature)
            • MethodParameter(s) – (identical to the parameters in the corresponding signature)
              • ParameterName - identifier for the parameter
              • ParameterDescription – a description of the parameter and how it is used
              • ParameterDefault Value – default value for the parameter (optional)
            • Action – a machine interpreted instruction on how to invoke a piece of executable code that performs the specific behavior

      2. Dissemination- A dissemination is the result of executing a specific behavior defined by a signature that returns a MIME-typed stream of bytes. In order to perform a dissemination, a repository must receive a request that contains four pieces of information :
        • Name – an identifier for the object
        • Signature name – the identifier of a signature to which the object subscribes. Again, a signature describes a set of possible behaviors for the object.
        • Method name – the name of a behavior of the given signature
        • Parameter name/value pairs – zero or more parameter name/value pairs that may be required by the given method name

        The ability to perform disseminations is a fundamental requirement of the repository architecture. When a client issues a dissemination request, the repository software must be able to (1) locate the corresponding servlet for the request, (2) execute the appropriate servlet action, and (3) deliver the result to the requester of the dissemination.

        From an access perspective, a client does not need to know the architectural details of an object. The only thing an accessing client must do is issue a valid dissemination request (with the above four arguments).


    2. Open Protocols

      • Open Repository Dissemination Protocol (ORDP)
        The repository will define an open protocol for obtaining disseminations from objects. This will promote interoperable access to digital object content among distributed repositories. The protocol will be implemented over HTTP.
      • Open Archives Initiative (OAI) Protocol
        The repository should be compliant with the OAI protocol so that metadata can be harvested by other services. To this end, we will build an OAI component in the repository. This component will respond to all valid OAI requests and will harvest metadata from all objects that implement Metadata Disseminators. The OAI-compliant repository can expose metadata to both companion service components (see Indexing and Search Service below), and to 3rd party external services.


    3. Mechanisms for Mediating Communication
      The repository must have mechanisms that enable access to external services. From the perspective of accessing an object, there are two distinct scenarios where communication must be mediated:

      1. Accessing datastreams – datastreams are stored as pointers within the repository and require a supporting mechanism to enable the object to retrieve the contents of that remote referenced datastream.

      2. Executing disseminations – the dissemination of a specific behavior of an object requires executing the action specified in the appropriate servlet module. The execution of the servlet module will require a supporting mechanism that may be different than that used for accessing a datastream.

      Currently, the UVa implementation supports an HTTP-mediating mechanism for accessing datastreams and executing disseminations. It is highly desirable to consider offering mechanisms that mediate other protocols (e.g., Z39.50) so the repository can utilize remote services not directly accessible through HTTP.

    4. External Application Management Services
      The execution of disseminations by the repository requires running a specific chunk of code that is referenced through the disseminator's associated servlet module. Running this piece of code may require one or more application services for the host server to accomplish this task. For example, if the executable chunk of code is a cgi script written in perl, then the host on which the servlet is implemented must also provide a perl interpreter with which to run the perl script. In this example, the external application service is the perl interpreter. Developers writing servlet modules will know what application services will be required by their servlets. The repository must provide a means of installing servlet modules into the repository environment. This includes the storing of metadata about external application services used by one or more servlets modules. To ensure the integrity of the execution environment, the repository will periodically run tests to detect any undesirable changes in the external application services. System managers and developers will be able to query the repository for information about which services are currently supported by the repository, and about dependencies between servlet modules and external services.

    5. Indexing and Search Service
      This companion service would provide a way to index and search XML-encoded metadata for all objects in the repository that contain Metadata Disseminators. The service will harvest metadata from the Repository using the OAI protocol. Harvested metadata would be made available to both internal and external XML indexing and search engines.

    6. B
      Because all actions of or on an object are disseminations of that object, it seems natural to keep track of the use of objects by creating log records for each dissemination. All of the parameters of the dissemination call must be kept, including: the UID of the object, signature name, method name, and parameter name/value pairs. In addition, as much information about the character of the requester as possible should be kept as well, including: IP address, communications protocol, etc. The logging mechanism should be configurable similar to that used in apache web server logs.

  2. Repository Management Utilities

    Given the Object Model, the function of the repository is to store, retrieve, index (possibly multiple indices), and maintain all the objects in the repository. To use the repository effectively, a suite of repository management tools must also exist that enable system administrators and library personnel to interact with objects stored in the repository.

    Phase 1 of the project would include the development of a "control panel" for the repository that allows the repository manager to carry out all of these functions. However, this list of utilities should all be available as modules to be used in a variety of other processes.

    1. Create an Object
      • Create a UID
      • Create Object Map
      • Create System Metadata datastream

    2. Modify an Object
      • Modify UID
      • Modify Object Map
      • Add a datastream
      • Modify a datastream
      • Delete a datastream
      • Add a disseminator
      • Modify a disseminator
      • Delete a disseminator
      • Add access policy
      • Modify access policy
      • Delete access policy

    3. Delete an Object
      • Remove an object "appropriately"
        Questions:
        • What does it mean to completely remove an object?
        • Is the UID to be deleted referenced anywhere else in the repository?
        • Do you retain the UID after deleting an object?
        • How to handle remote referenced datastreams in a deleted object?
        • What to do if this object is the last one referring to a servlet and/or signature (i.e., should removing an object also mean removing any associated signatures and servlets)?

    4. Access an Object
      • List map of an Object
        • For internal datastreams, provide a listing of the header information only.
        • For remote referenced datastreams, provide a listing of the header information and the reference to the resource.
      • List the methods of a given Signature or Servlet and all their components)
      • List access policies
      • List contents of a datastream
      • Disseminate an object (i.e., execute a specific behavior)
      • Disseminate a list of objects

    5. Search for Objects
      • Search for object based on repository component information (e.g., UID, signature name, servlet name, datastream name)
      • Search for objects based on metadata information (e.g., creator, title, subject,etc.)

      All repository management utilities should provide a batch processing interface that enables batch loading, modifying, and removing of complete objects or any of their components.

  3. Implementation Details
    • In the first phase, we will develop a system that can handle at least 1 million objects while sustaining average response times of less than 2 seconds per transaction for 20 simultaneous users, running on a 4-processor SUN Enterprise 420R server, and using a freely available SQL database package. These conditions will be simulated using JMeter, or an equivalent software user simulation system. The transactions will be a realistic mix of a variety of user requests and management processes.
    • For the repository backend database, we will provide default bindings for one freeware SQL database package (e.g., MySQL) and one commercial package (e.g., Oracle).

    • For the repository indexing and search services, we will provide default bindings for one freeware XML package (e.g., SGREP) and one commercial package (e.g., XYZFind).

    • All repository software development will be done using Java.

    • The resulting software will all be made freely available under a GPL open-source license.

Top

Digital Initiatives
University of Virginia
PO Box 400112
Charlottesville, VA 22904-4112

Digital Initiatives Home • UVa Library Home
Search the Library Site • UVa Home
Maintained by: dl@virginia.edu
Last Modified: Monday, June 02, 2008
© The Rector and Visitors of the University of Virginia