In my last post I showed how traditional database technology is founded upon the concept of a relation and discussed how this had implications in terms of limiting the number of questions we could expect to answer from any given database. In this post, I will explain some fundamental concepts that underpin the Hadoop technology stack – a Big Data platform that is gaining wide acceptance.
In their 2004 paper MapReduce: Simplied Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat described a model for writing computer programmes that would allow them to be automatically executed on any number of computers at the same time. The MapReduce framework, when combined with the ability to store files on any number of computers (known as HDFS) has become the de facto way programmes are written that are able to process massive amounts of data. In order to process more data using MapReduce, you can simply add more computers – known as “scale out” – in order to achieve results in pretty much the same time. By describing a general programming framework as the basis of data processing Ghemawat and Dean also radically shifted the boundaries of the questions we are able to ask data stores. Compared to relational databases, MapReduce programmes are particularly good at analysing unstructured data such as videos or text. This is because the programmes written using this framework are not limited by the constraints originally envisaged for relational databases. In their original article, Dean and Ghemewat described text pattern matching, URL frequency counting, web link graphing and word counts as examples, but much more sophisticated analyses are possible.
While not everything is possible using MapReduce, the arsenal available to technicians is much greater than for the techniques for querying relational databases and the ability to process very large amounts of unstructured data means that a much greater range of business questions can be answered using this technology.
In the next post, I will describe architectures for deploying MapReduce and Relational systems alongside one another and the changes that are happening within teams that generate information insight as a result.


Leave a Reply