This is the first of a series of posts where I will investigate Big Data concepts at a very fundamental level. If you have been reading my earlier posts which are pretty high level and accepted, bear with me for this post and the following few which will be more exploratory, and reasonably technical. I’ll continue to post on more business-relevant topics in the near future, however this is a more technical post. I think this and the following few posts will be worth reading, as I will take you along a journey starting in the past and projecting the future, and I will do so in a way that explains the paradigm shift we are experiencing and will experience in BI processing at a fundamental level. I’ll explain along the way why I believe unstructured data is the primary reason for Big Data technologies. Again, please bear with me though, as these thoughts are quite fragmented and partial.
I will start with the fundamental concept of a relation which is the concept we refer to when we talk about relational databases. In his paper A Relational Model of Data for Large Shared Data Banks, Edgar Codd proposed a method for modelling data based on relation theory. Since many databases propose relationships between entities based upon database constraints, practitioners often believe that relational databases are called such because of these constraints. But this is not the case. The term “relational database” has its origins in the mathematical concept of a finitary relation.
A finatry relation is, loosely described, a set of tuples with a header. You can think of a tuple as a row of data- if you are unfamiliar with databases, think of a row in excel. Codd used these terms in the mathematical sense and for the current purposes I would like to point out that each tuple must be, by definition, unique within the relation. This characteristic of uniqueness has important implications. One implication is that we can treat it as a logical model and employ predicate logic against tuple attributes.
A further implication, and one I would like to emphasise here, is that any given relation is a finite entity. Although mathematical concepts must deal with infinity, finite relations are just that- finite. A relation can be simply defined as a cross product of the tuples that make it up. To ensure tuple uniqueness, Codd proposed his famous rules for normal forms and described what we know today as database normalisation.
As an aside, I would like to point out that almost all relational database engines do not implement Codd’s vision in the strict sense. The problem of disk seek latency means that they do not constrain to tuple uniqueness and the rules for normalisation as Codd proposed. In addition, many in-memory systems have been built to be backward compatible and so although they substitute pointer indirection and memory retrieval ops for disk seeks, they too do not constrain to Codd’s strict model.
As we take advantage of in-memory technologies, we can move closer to Codd’s vision and the advantages strictly applied relation theory has for answering our questions. We should however realise that these questions will too, be quite finite in scope – any relational system logically presupposes the questions that can be asked of it and so presupposes all possible contexts for these questions.
In the next post I will look at the generic BigData model as implemented in Hadoop, and compare this with the relation model.


Leave a Reply