Data, data everywhere
There is an explosion in the amount of data available to almost everyone at every level. From logistics companies to astronomers, retail giants to genomics research, the amount of data being generated is growing at a far greater rate than we can deal with. The IDC (International Data Company – 10 points for imaginative names) published a whitepaper in 2007 detailing that the size of the digital universe was 281 billion gigabytes (281 000 000 000 gigabytes). The 2011 version of the report determined that the digital universe was now 1.8 zetabytes, or 1 800 000 000 000 gigabytes. The digital universe represents the amount of data being created and replicated in that year. Not the total data in storage, nor the running total of data ever created, simply the additional data being generated for a twelve month period. Below, there’s a ticker showing this data’s growth
This is a problem.
It means that the amount of data being created is growing at a greater than exponential rate. On the one hand, this represents a good thing for the IT industry – there’s going to be increasing demand for database managers, people with big data experience, and moreover, software developers dealing with these massive data sets are *really* going to need to understand where performance comes from, where bottlenecks arise, and how they can be avoided. SSDs might help reduce latency on database look ups, as should improved parallel architectures and distributed computing.
But there are more fundamental problems than this. We need to understand how to deal with this data. If we’re lucky, the methodologies being used will scale with the data, insomuch as they will still operate in an acceptable time-frame (although even this isn’t guaranteed). However, using “traditional” data analysis approaches become less effective and less relevant when you approach the data-set sizes being described, because the statistical variation gets lost in the overwhelming noise of a hundred million datapoints. There is a need to move to new, better ways of storing and representing data, to translate the data into scalable and reflexive models, rather than vast lookup tables. New structures which automatically re-arrange to reduce redundancy and dynamically optimize the underlying architecture, rather than relying on static structures which fail to organise. But most importantly, there is a need to determine new was to look at these new, humongous datasets. It’s a twofold problem, but they are very much interlinked.
There is some attempt to tackle the performance issues associated with data storage with relational and non-relational databases, but these typically represent implementation decisions, rather than underlying architectural design. What is needed more, I believe, is a combination of machine learning approaches to define a new way of thinking about information storage and retrieval. It’s hard to think about it in non-traditional database terms, but I think that might be what is necessary to really deal with the issue.
Whatever happens, there needs to be a change in the way data is stored and utilized, because the approaches being used are simply not sustainable for the coming decade.