An Experimental Comparison of Complex Objects Implementations in Big Data Systems
Author
Sikdar, Sourav
Date
2017-06-07Advisor
Jermaine, Christopher
Degree
Master of Science
Abstract
Many data management and analytics systems support complex objects. Dataflow
platforms such as Spark and Flink allow programmers to manipulate sets consisting
of objects from a host programming language, often Java. Document databases such
as MongoDB make use of hierarchical interchange formats--most popularly JSON--which
embody a data model where individual records can themselves contain sets of records.
Systems such as Dremel and AsterixDB allow complex nesting of data structures. The
desire to support such complex objects forces a system designer to ask: how should
complex objects be implemented in a modern data management system? In this
thesis, over a suite of representative data management tasks, I experimentally evaluate
the performance implications of a wide variety of complex object implementations.
The choice of object implementation can have a profound effect on performance. For
example, the same external sort to perform a duplicate removal can take anywhere
between a half hour to fourteen and a half hours depending upon the complex object
implementation. A corollary is that a bad object implementation can doom system
performance. In addition, we reaffirm the value of the classical database way of
storing complex objects - where there is no distinction between the in-memory and
over-the-wire data representation, within a modern big data system.
Keyword
Complex Objects Implementations; Experimental Evaluation; Big Data Systems