NCSTRL Documentation

Defining your own document database structure and document formats

The automatic configuration tool simplifies the configuration process of Dienst and the initial creation of a document database. It also defines a set of default document formats. You may, however, wish to use another directory structure as your document database or add other document formats to your database. This requires that you hand-craft the Perl code that defines Dienst's configuration in Config/custom.pl. This document describes this process. Warning if you modify code in Config/custom.pl, running auto_config.pl will overwrite these modifications. If you have any further questions contact help@ncstrl.org.

You should first run auto_config.pl, to generate the template configuration files. When the tool asks you if wish to create a document database, respond no. You can then edit the configuration files described below.

The Config directory contains all of the configuration files for a Dienst server. These files are:

Modifying document database mappings in custom.pl

Recall that every document has a unique, location-independent identifier called a handle which consists of a naming authority and a document name. The document name may include additional information, such as a year or series, that is of local significance. Dienst internally uses docids for document naming. The mapping from a handle to a docid is simple; replace the first / in the handle with a :. For example, ncstrl.cornellcs/TR94-1418 becomes ncstrl.cornellcs:TR94-1418).

The perl code in the custom.pl is used by Dienst to map this docid to the locations in your file system of the various representations of a document.

The subroutines in this file that you must modify are defined below. Note that when a fatal condition occurs, the respective subroutine should issue a perl die command with an appropriate error message.

directory_to_docid
Description: returns the docid stored in the directory supplied as the input argument.
Arguments:
Returns: a string that is the docid found or the null string if the mapping fails. The docid must conform to the format naming_authority:document_name.
Fatal Conditions: if the mapping fails and the argument $die is true.
docid_to_directory
Description: maps a docid to the directory pathname where the document is stored. It is the inverse of directory_to_docid.
Arguments:
Returns: a string that is the full path of the directory where the docid is stored.
Fatal Conditions: if the mapping fails.
Traverse_TRs
Description: walks the entire document database; for each directory that maps to a docid (using directory_to_docid, documented above) invoke the subroutine supplied as the argument. The arguments to this subroutine call should be the docid found and the corresponding full directory path.
Arguments:
Returns: none
Fatal Conditions: if the call to $subroutine returns an error code ($@ is non-null). Also fatal if the directory traversal fails (e.g., unable to open the directory with the opendir function).
Get_year
Description: parse a docid that contains the year of the document and return the year. If the document year cannot be inferrred from your docids you must remove, or comment out this routine. This routine is used for browsing the collection by year and is meant to be as fast as possible.
Arguments:
Returns: a string that is the year (19XX) or zero.
Fatal Conditions: although the default is zero, the routine should not be defined if there is no way to infer the year from a docid.

Declaring storage formats in custom.pl

This section of the Dienst installation instructions discusses adding or modifying document format types to your document database and Dienst server. Document formats are defined in custom.pl via calls to the subroutine def_format. The information in these calls is used by Dienst to map from a specific format (e.g. postscript) to the file(s) in the respective format in your document database . Note that Dienst requires that all formats for a given document must be stored (or linked) in or under a common sub-directory. You do not need to re-index your database after adding new formats. Your server will automatically recognize the files that correspond to the formats.

Document Format Types: Paged and Non-Paged

Dienst supports a multi-formatted document collection. A multi-formatted document is one that consists of many files, each one being a different representation of the document. Plain ASCII, PostScript and TIFF are examples of possible formats for a document. Document formats are either non-paged or paged.
Non-paged formats
Non-paged document formats are those where there is a single file that represents the entire document. Two common non-paged fomats are PostScipt and plain ASCII text. Users will not be able to view specific pages of a document in a non-paged format, but they will be able to download a single file that represents the entire document. Non-paged formats may be stored either in your document database, or they may be accessible to your Dienst server through an anonymous FTP repository.

Paged formats
Paged document formats are those where the document is represented by a set of files, each one corresponding to a single page of the document. Two comon paged formats are bitmapped images (TIFFs or GIFs) or the output of an OCR process. The Dienst user interface allows users to view specific pages of paged formats of documents. Downloading the individual pages, however, requires multiple Dienst requests. Paged formats must be in your document database for access by the Dienst server.

A Note on compressed file usage in Dienst

The files that represent your document formats may be stored in compressed or uncompressed form. Dienst assumes that compressed files are identified by a suffix unique for each compression scheme. You define compression suffixes and the corresponding command for decompressing files identified by that suffix using the compression_schemes configuration setting. The pattern argument in your def_format storage declarations should always identify the file name without its compression suffix.

How to make or modify calls to def_format

The arguments to def_format are defined below. Here is an example: The auto_config.pl script produces the following def_format call for PostScript:
&def_format ("postscript", "application/postscript", 0,
	"PostScript", "",
	"%s.ps", "name", "");
where name is the name of document (the part of the handle after the naming authority). Note that this format is not naming authority specific.

So, the PostScript file for document ncstrl.cornellcs/TR94-1418 is stored in the top level directory for the document and is called TR94-1418.ps. Note also that if $compression_schemes{'Z'} is defined, for example, this storage format declaration will also match the file 94-1418.ps.Z.

Another example: The auto_config.pl script produces the following def_format call for inline images:

&def_format ("inline", "image/gif", 0,
	     "inline gif image", "",
	     "inline/%04d.gif", "page");
where page is one of the predefined variables that you may use in your def_format calls.

So, all inline page image files for a document are stored in a directory inline in the top level directory for the document. Each file in inline is a four digit page number with the suffix gif (e.g., 0001.gif). The suffixes defined by the %compression_schemes array are also used for file name matching.

Up to Main Information Menu


NCSTRL Documentation
Any comments or questions?
Contact us at help@ncstrl.org.