Programming Bulk-Incremental Dataflows

Dionysios Logothetis, Christofer Olston, Benjamin Reed, Kevin Webb and Kenneth Yocum
CS2009-0951
October 30, 2009

Government, medical, financial, and web-based services increasingly depend on the ability to rapidly sift through huge, evolving data sets. These data-intensive applications perform complex multi-step computations over successive generations of data inflows (e.g., weekly web crawls, nightly telescope dumps, or hourly surveillance videos). Because of the data volumes involved, applications must avoid reprocessing old data when new data arrives and instead process incrementally. Unlike in stream-based systems, incoming data does not have to be processed immediately, permitting work to be amortized via bulk processing. Such bulk-incremental processing represents an emerging class of applications whose needs are not fully met by current systems. This paper presents a generalized architecture for bulk-incremental processing systems (BIPS), simplifying the creation of such programs. In contrast with incremental view maintenance in data warehousing, BIPS provides flexible low-level primitives for managing incremental data and processing, upon which both relational and non-relational operations can be implemented. The paper describes the BIPS programming model along with several example applications and examines some key implementation choices. These choices are shown to play a major role in overall system performance, via experiments on a large testbed cluster.


How to view this document


The authors of these documents have submitted their reports to this technical report series for the purpose of non-commercial dissemination of scientific work. The reports are copyrighted by the authors, and their existence in electronic format does not imply that the authors have relinquished any rights. You may copy a report for scholarly, non-commercial purposes, such as research or instruction, provided that you agree to respect the author's copyright. For information concerning the use of this document for other than research or instructional purposes, contact the authors. Other information concerning this technical report series can be obtained from the Computer Science and Engineering Department at the University of California at San Diego, techreports@cs.ucsd.edu.


[ Search ]


NCSTRL
This server operates at UCSD Computer Science and Engineering.
Send email to webmaster@cs.ucsd.edu