NCSTRL Documentation

build-inverted-indexes.pl

 File: build-inverted-indexes.pl

 Description:
       Utility to perform offline build of database files for the Dienst 
       RFC-1357 db engine.  The files generated are:

           abstractWords.ind  - docids are double encoded*
           titleWords.ind
           authorWords.ind

       * Docids are double encoded for the following reason: When we decode
         the list of docids we normally do a publisher match on the unencoded
         list. Publisher names tend to be long and are growing in number.
         Therefore we do the publisher comparision with encoded publisher
         which reduced the publisher to 1-2 characters. This optimization
         benefits servers with several publishers, ie., backup_server.

           AuthorIndex.ind
           TRbyNumber.ind

           TRMap.ind          - Map tech report name to encoding
           KTRMap.ind         - Map encoded tech report number to tech
                                report name
           Authority.ind      - Map publisher to encoding
           KAuthority.ind     - Map publisher encoding to publisher

       The first three files are inverted indexes for the respective field 
       in the rfc-1357 bib file.  Each line of a file is a case-insensitive 
       word in the respective bib file field, followed by sorted list of 
       DocID's that contain this word.  These items are delimitted by our 
       favorite $RS delimitter.  AuthorIndex.ind is an inverted index of 
       full Author Name.  Each line is an author name followed by a sorted 
       list of DocID's by this author (doing some authority control at this 
       point might make sense).  Each line of TRbyNumber is a DocID, Title, 
       and Author.  Dienst loads the contents of these files in at startup 
       and uses the generated arrays as its database.

       When run with no arguments, this utility does a complete regeneration
       of the index files.  An incremental reload can be done in two ways: 1)
       with a list of docids as arguments, 2) with an argument that specifies
       a file containing docids (one per line).  These three methods are shown
       below:
           build-inverted-indexes.pl   
           build-inverted-indexes.pl CORNELLCS:TR94-1418 CORNELLCS:TR71-25
           build-inverted-indexes.pl -f newdocids

       Batch mode: Normally, build-inverted-indexes.pl prompts the user
                   when it detects a document that has already been indexed.
                   If the batch (-b) flag is specified the program will
                   not prompt user when bibliographic entry already
                   exists in the database. It will delete the old entry, 
                   and index the new bibliographic record.

            build-inverted-indexes.pl -f newdocids -b

        Skip documents that have already been indexed (exist in database):

                   The -s flag will cause build-inverted-indexes.pl to skip
            documents that already exist in the database. If you know that
            the database records are valid there is no reason to remove the
            existing index entries and then add them again.

Up to Main Information Menu


NCSTRL Documentation
Any comments or questions?
Contact us at help@ncstrl.org.