Peadar Coyle published an interview with me about data science on his blog, here it is for your reference: Interview with a data scientist
Read More
Agile and Architecture
Oftentimes, I receive queries about agile development and architecture. In fact, all development has an architecture that is either explicit or implicit- you should favour explicit architecture so you know what you are getting! Architecture can be agile- as the project progresses models are refined and the work products are built over time. In particular, the architect is deeply involved at the prototyping stage as it is a communication stage. In the beginning of the project an initial architecture is envisioned and this vision is communicated to stakeholders and developers. The architect continues to work with developers to realise the architecture and as the project progresses work products – architectural artefacts are updated. Architecture becomes a collaborative affair – the realisation of a digital strategy utilising digital pace requires great collaboration and communication in order to build well designed, highly usable products.
Read More
Digital and Data
What do you mean by “Digital” – are you referring to digital connections to your customers, or digital process optimisation in your business or whole new digitally enabled business models? Often each of these approaches is seen as a new system implementation and the role of data is underplayed. Implementing a digital platform is about enabling open data access and gaining new business insights- both of which mean that data architecture becomes a central concern. In fact, the flows of data are fundamental to digital enablement. So a digital strategy relies heavily on data strategy. In fact, data strategy encompassing data access is digital strategy.
By focussing upon data as a the core asset to be levered in inventing new business models, deepening customer engagement and interactions and optimising business operations, a range of capabilities are enabled, of which here are a few:
- Deeper customer value
- Stronger brand loyalty
- Optimal pricing
- Efficiency of operations
- Optimal supply chain operations
- New information based products
- Improved forecasting
- Optimised management
By focussing upon data as an asset, and architecting the storage, access and interchange of data, companies can succeed in optimising digital capability.
Read More
Big Data Architectures: Deep Dive III
In my first and second posts, I discussed fundamental principles behind data processing and showed how they enable or constrain the sorts of questions we can ask information systems. In this post, I will show how the framework of MapReduce processing determines the people-value side of insight endeavours and also the interactions required to achieve successful outcomes.
Overall, a widely accepted and promoted architecture for BI has a back-end cluster, perhaps in the cloud, supported by an up-front relational data warehouse. However, I think combining the relational paradigm and the MapReduce framework will support many given architectures depending on a range of factors. For example, for start-ups with highly specialised teams, utilising MapReduce with higher level software such as the “HBase” database will suffice in itself. Data may make sense to be processed in the MapReduce system and utilising data export tools such “Sqoop” moved to the data warehouse, but other patterns would make sense, depending again on circumstances. The layout of processing will depend on the context, in the wider sense.
Traditionally in BI, we have sought generalist knowledge publishers. For example, Ralph Kimball’s book The Data Warehouse Toolkit is both a technical architecture manual for star schema design and a management consulting book with a strong bias towards furthering business understanding of use cases in specific industries and segments. It represents the generalist paradigm in BI processing associated with cross-skilled teams delivering technical assets to business knowledge workers and decision makers.
MapReduce, with it’s hugely flexible model, will demand more specialist teams. The framework demands not only deep skills in back-room configurations (rack and switch configurations for cluster processing, server configurations) or the cloud (particularly sizing and getting your data into the system – known as “ingress” or “ingestion”) but also very specific programming skills for answering the questions associated with semi-structured or unstructured data such as video, audio or text. These skills will be quite varied and for particular problems such as correlating, data mining, image processing or text mining, will demand more highly trained specialists than for relational processing. So we are moving from a team design of specialist-generalists to specialist-technicians. Hence the rise of the “data scientist”.
Collaboration becomes decisive and the ability to translate from a business understanding and perspective to a specialist-technical viewpoint is a game-changer in terms of knowledge-power. Teams will require much greater mutual understanding and very clear communication in order to produce information assets. So we are seeing a twofold transformation in team dynamics: a broadening of perspectives associated with collaboration and, at the same time, a deepening of technical expertise associated with the processing in question – allowing us to answer much more difficult questions and so maintain competitive advantage or a value proposition; questions that are, at a foundational level, the result of a flexible framework for data processing.
Read More
Big Data Architectures: Deep Dive II
In my last post I showed how traditional database technology is founded upon the concept of a relation and discussed how this had implications in terms of limiting the number of questions we could expect to answer from any given database. In this post, I will explain some fundamental concepts that underpin the Hadoop technology stack – a Big Data platform that is gaining wide acceptance.
In their 2004 paper MapReduce: Simplied Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat described a model for writing computer programmes that would allow them to be automatically executed on any number of computers at the same time. The MapReduce framework, when combined with the ability to store files on any number of computers (known as HDFS) has become the de facto way programmes are written that are able to process massive amounts of data. In order to process more data using MapReduce, you can simply add more computers – known as “scale out” – in order to achieve results in pretty much the same time. By describing a general programming framework as the basis of data processing Ghemawat and Dean also radically shifted the boundaries of the questions we are able to ask data stores. Compared to relational databases, MapReduce programmes are particularly good at analysing unstructured data such as videos or text. This is because the programmes written using this framework are not limited by the constraints originally envisaged for relational databases. In their original article, Dean and Ghemewat described text pattern matching, URL frequency counting, web link graphing and word counts as examples, but much more sophisticated analyses are possible.
While not everything is possible using MapReduce, the arsenal available to technicians is much greater than for the techniques for querying relational databases and the ability to process very large amounts of unstructured data means that a much greater range of business questions can be answered using this technology.
In the next post, I will describe architectures for deploying MapReduce and Relational systems alongside one another and the changes that are happening within teams that generate information insight as a result.
Read More
Big Data Architectures: Deep Dive I
This is the first of a series of posts where I will investigate Big Data concepts at a very fundamental level. If you have been reading my earlier posts which are pretty high level and accepted, bear with me for this post and the following few which will be more exploratory, and reasonably technical. I’ll continue to post on more business-relevant topics in the near future, however this is a more technical post. I think this and the following few posts will be worth reading, as I will take you along a journey starting in the past and projecting the future, and I will do so in a way that explains the paradigm shift we are experiencing and will experience in BI processing at a fundamental level. I’ll explain along the way why I believe unstructured data is the primary reason for Big Data technologies. Again, please bear with me though, as these thoughts are quite fragmented and partial.
I will start with the fundamental concept of a relation which is the concept we refer to when we talk about relational databases. In his paper A Relational Model of Data for Large Shared Data Banks, Edgar Codd proposed a method for modelling data based on relation theory. Since many databases propose relationships between entities based upon database constraints, practitioners often believe that relational databases are called such because of these constraints. But this is not the case. The term “relational database” has its origins in the mathematical concept of a finitary relation.
A finatry relation is, loosely described, a set of tuples with a header. You can think of a tuple as a row of data- if you are unfamiliar with databases, think of a row in excel. Codd used these terms in the mathematical sense and for the current purposes I would like to point out that each tuple must be, by definition, unique within the relation. This characteristic of uniqueness has important implications. One implication is that we can treat it as a logical model and employ predicate logic against tuple attributes.
A further implication, and one I would like to emphasise here, is that any given relation is a finite entity. Although mathematical concepts must deal with infinity, finite relations are just that- finite. A relation can be simply defined as a cross product of the tuples that make it up. To ensure tuple uniqueness, Codd proposed his famous rules for normal forms and described what we know today as database normalisation.
As an aside, I would like to point out that almost all relational database engines do not implement Codd’s vision in the strict sense. The problem of disk seek latency means that they do not constrain to tuple uniqueness and the rules for normalisation as Codd proposed. In addition, many in-memory systems have been built to be backward compatible and so although they substitute pointer indirection and memory retrieval ops for disk seeks, they too do not constrain to Codd’s strict model.
As we take advantage of in-memory technologies, we can move closer to Codd’s vision and the advantages strictly applied relation theory has for answering our questions. We should however realise that these questions will too, be quite finite in scope – any relational system logically presupposes the questions that can be asked of it and so presupposes all possible contexts for these questions.
In the next post I will look at the generic BigData model as implemented in Hadoop, and compare this with the relation model.
Read More
Bridging the Insight Gap: Leveraging Machine Learning in Your Organisation
Today, many institutions have core machine learning use cases well met by utilising advanced statistical packages in the areas of risk and fraud detection, recommendation and cross-sell, segmentation, optimisation, computational advertising and many other areas. Many other organisations though, are faced with little capability in this area and feel a need to develop in the area of predictive analytics and machine learning for competitive advantage. The dual edged sword of increasing volumes and velocities of data generated and what is termed the insight gap – the difference in the ability of your organisation to meet the maximum possible return from information assets – make existing toolsets and skillsets irrelevant as requirements for information move from the traditional reporting and what-if scenario analysis to predictive, machine-learning based paradigms.
Fortunately a number of new tools have emerged that can really impact the adoption of predictive methods – including machine-learning based methods – that you should know about. Certainly two open source libraries stand out in this regard: the R library for statistical computing and the machine-learning library for Hadoop known as Mahout. Both have industrial strength applications and are widely distributed.
R was originally developed by the University of Auckland and offers simple download and installation for PC, Mac and Linux. It offers both processing and plotting and visualisation capabilities and is a command line driven interface with graphical output. R offers a powerful, functional language for statistical computations with a huge breadth of application. Since R is so well distributed and is well supported by both the open source and the academic communities, it is a safe bet if you are after a desktop tool that can connect to a wide variety of sources.
There are a number of tools that make R easier to use: RStudio, R Commander, Rattle are some well known ones. Also Microsoft has enabled R from its Visual Studio IDE.
Mahout is a java library from the apache foundation, also available for download, that is designed to run against hadoop clusters and so utilises the Map-Reduce paradigm for its implementation. Mahout offers machine learning scalable to “reasonably large datasets” in major areas such as classification, recommenders and clustering. Mahout is currently at 0.8 but has a number of well known applications and is well positioned as the machine learning library for distributed computing and Big Data.
(First published July 24, 2013)
Read More
Innovation, Architecture and Technical Debt
What do you think about IT architecture? To stakeholders, architecture may seem like a costly exercise that yields blueprints and little else. And yet, architecture is essential if you wish to maximise the value of your IT investments. I’ll explain a little about why this is the case in the following post. In fact, I can understand the queries made about IT architecture by stakeholders- they want results fast, wish to maximise ROI, and their concerns lay in the tangible IT assets that provide value. Questions as to the value of architecture lie in its invisible nature. The converse to architecture is known as technical debt which is also unseen. To make this distinction clear, we can look at their visible counterparts – bugs and features. Bugs are those issues that we can see or experience: everyone is familiar with programme features and bugs – features add value and bugs detract from value. So it is with architecture and technical debt – architecture is enabling, it adds value, whereas technical debt is crippling, it detracts from value – only these aspects are not immediately visible- often hidden in code, configurations, system interdependencies, misaligned designs or designs that do not promote business outcomes.

Having established the value of architecture to IT endeavours, its relevance to innovation becomes the next question. Since doesn’t additional lead time hamper overall timeframes and doesn’t formal pre-planning preclude creativity? In fact, architecture can both aid or hamper innovation efforts. Since it defines the possibilities or constraints of endeavours, establishing innovation as an agenda item for architects becomes key if this is a requirement. For formal Business Intelligence, innovation is closely linked to insight and potentially value- and so as not to reinvent the wheel – or incur technical debt, a flexible architecture is required.
Read More
