Big Data Architectures: Deep Dive III

Digital Architect Architecture, Technical Big Data Architectures: Deep Dive III

Architecture Technical

Big Data Architectures: Deep Dive III

Posted By Phillip Higgins

In my first and second posts, I discussed fundamental principles behind data processing and showed how they enable or constrain the sorts of questions we can ask information systems. In this post, I will show how the framework of MapReduce processing determines the people-value side of insight endeavours and also the interactions required to achieve successful outcomes.

Overall, a widely accepted and promoted architecture for BI has a back-end cluster, perhaps in the cloud, supported by an up-front relational data warehouse. However, I think combining the relational paradigm and the MapReduce framework will support many given architectures depending on a range of factors. For example, for start-ups with highly specialised teams, utilising MapReduce with higher level software such as the “HBase” database will suffice in itself. Data may make sense to be processed in the MapReduce system and utilising data export tools such “Sqoop” moved to the data warehouse, but other patterns would make sense, depending again on circumstances. The layout of processing will depend on the context, in the wider sense.

Traditionally in BI, we have sought generalist knowledge publishers. For example, Ralph Kimball’s book The Data Warehouse Toolkit is both a technical architecture manual for star schema design and a management consulting book with a strong bias towards furthering business understanding of use cases in specific industries and segments. It represents the generalist paradigm in BI processing associated with cross-skilled teams delivering technical assets to business knowledge workers and decision makers.

MapReduce, with it’s hugely flexible model, will demand more specialist teams. The framework demands not only deep skills in back-room configurations (rack and switch configurations for cluster processing, server configurations) or the cloud (particularly sizing and getting your data into the system – known as “ingress” or “ingestion”) but also very specific programming skills for answering the questions associated with semi-structured or unstructured data such as video, audio or text. These skills will be quite varied and for particular problems such as correlating, data mining, image processing or text mining, will demand more highly trained specialists than for relational processing. So we are moving from a team design of specialist-generalists to specialist-technicians. Hence the rise of the “data scientist”.

Collaboration becomes decisive and the ability to translate from a business understanding and perspective to a specialist-technical viewpoint is a game-changer in terms of knowledge-power. Teams will require much greater mutual understanding and very clear communication in order to produce information assets. So we are seeing a twofold transformation in team dynamics: a broadening of perspectives associated with collaboration and, at the same time, a deepening of technical expertise associated with the processing in question – allowing us to answer much more difficult questions and so maintain competitive advantage or a value proposition; questions that are, at a foundational level, the result of a flexible framework for data processing.

Tagged ,

Leave a Reply

Your email address will not be published. Required fields are marked *