File: build-inverted-indexes.pl Description: Utility to perform offline build of database files for the Dienst RFC-1357 db engine. The files generated are: abstractWords.ind - docids are double encoded* titleWords.ind authorWords.ind * Docids are double encoded for the following reason: When we decode the list of docids we normally do a publisher match on the unencoded list. Publisher names tend to be long and are growing in number. Therefore we do the publisher comparision with encoded publisher which reduced the publisher to 1-2 characters. This optimization benefits servers with several publishers, ie., backup_server. AuthorIndex.ind TRbyNumber.ind TRMap.ind - Map tech report name to encoding KTRMap.ind - Map encoded tech report number to tech report name Authority.ind - Map publisher to encoding KAuthority.ind - Map publisher encoding to publisher The first three files are inverted indexes for the respective field in the rfc-1357 bib file. Each line of a file is a case-insensitive word in the respective bib file field, followed by sorted list of DocID's that contain this word. These items are delimitted by our favorite $RS delimitter. AuthorIndex.ind is an inverted index of full Author Name. Each line is an author name followed by a sorted list of DocID's by this author (doing some authority control at this point might make sense). Each line of TRbyNumber is a DocID, Title, and Author. Dienst loads the contents of these files in at startup and uses the generated arrays as its database. When run with no arguments, this utility does a complete regeneration of the index files. An incremental reload can be done in two ways: 1) with a list of docids as arguments, 2) with an argument that specifies a file containing docids (one per line). These three methods are shown below: build-inverted-indexes.pl build-inverted-indexes.pl CORNELLCS:TR94-1418 CORNELLCS:TR71-25 build-inverted-indexes.pl -f newdocids Batch mode: Normally, build-inverted-indexes.pl prompts the user when it detects a document that has already been indexed. If the batch (-b) flag is specified the program will not prompt user when bibliographic entry already exists in the database. It will delete the old entry, and index the new bibliographic record. build-inverted-indexes.pl -f newdocids -b Skip documents that have already been indexed (exist in database): The -s flag will cause build-inverted-indexes.pl to skip documents that already exist in the database. If you know that the database records are valid there is no reason to remove the existing index entries and then add them again.
Up to Main Information Menu