WELCOME TO THE CONTINUITY CENTRAL ARCHIVE SITE

Please note that this is a page from a previous version of Continuity Central and is no longer being updated.

To see the latest business continuity news, jobs and information click here.

Business continuity information

Object storage for big unstructured data

By Tom Leyden.

‘Big data’ causes a lot of confusion. Big data is used for anything related storage these days, so people don’t know anymore what it exactly is. Is it Hadoop? Is it analytics? It doesn’t need to be that complicated though. There are two kinds of big data: big data for analytics and big unstructured data.

Big data for analytics is a paradigm that became popular in the previous decade. A lot of innovation was done for research projects. New technology enabled researchers in many different domains to capture data in a way they had never been able to do before. In agriculture, for example, ploughs would get sensors that would send little bits of information to a central system (over satellite). Every couple of feet these sensors would measure what’s in the ground (minerals for example), how humid the ground is etc. Based on that, large agriculture companies would then be able to make better decisions on where to grow which crop.

The problem was that traditional systems to store this massive amount of small data (relational databases) were no longer adequate to store this information. Systems like MapReduce and Hadoop were created as an alternative and would store these massive volumes of files as concatenated big files. Big data was born: big data for semi-structured data.

Today we are seeing a similar trend with unstructured data. Studies show that data storage requirements will increase x 30 over the next decade. 80 percent of that data will be large files: office documents, movies, music, pictures. Similar to how the databases in the previous decade, traditional storage (file systems) is not the best way to store this data. File systems will not scale sufficiently and actually become obsolete as applications will take over the role of the file system.

A nice example is what Google Picasa does: in the old days we would store pictures nicely organized in a file system (hopefully with some backups). One folder per year, one per month in each year, one per holiday or party. Today, we just dump all the pictures in one folder and Picasa will sort them for us based on date, location, face recognition or other metadata. With an intelligent query, we can display the right pictures very fast, much faster than browsing the file system. We don’t even have to worry about backups as we can store copies in the cloud automatically.

The new paradigm that will help us store these massive amounts of unstructured data is ‘object storage’. Object storage systems are uniformly scalable pools of storage that are accessible through a REST interface. Files (objects) are dumped into the pool and an identifier is kept to locate the object when it is needed. Applications that are designed to run on top of object storage will use these identifiers through the REST protocol. A good analogy is parking your car valet vs. self-park. When you self-park you have to remember the lot, the floor, the isle etc (file system); with valet you get a receipt when you give your keys and you will later use that receipt to get your car back.

So what is needed to build an object storage system? Basically just lots of disks, a REST API and a way to provide durability. This could be done with traditional systems like RAID but the problem is that RAID requires a huge amount of overhead to provide acceptable availability. The more data we store, the more painful it is to be needing 200 percent overhead as some systems do. The better way to provide durability for object storage is erasure encoding.

Erasure encoding stores objects as equations, which are spread over the entire storage pool: Data objects are split up in sub-blocks, from which equations are calculated. According to the availability policy, an overhead of equations is calculated and the equations are spread over as many disks are possible. This is also policy-defined. As a result, when a disk breaks, the system will always have sufficient equations to restore the original data block. If a disk is broken, the system can re-calculate equations as a background task to bring the number of available equations to a healthy level again. Some erasure coding systems use low power Atom processors to reduce power costs. As the entire system, all storage nodes, can recalculate missing equations as a background task. It is no longer necessary to use the high-end nodes that RAID systems need (to speed up restores and avoid performance losses).

Apart from providing a more efficient and a more scalable way to store data, erasure coding based object storage can save up to 70 percent on the overall TCO thanks to reduced raw storage needs and reduced power needs (less hardware plus low power devices save on power and cooling). Also, uniformly scalable storage systems with an automated healing mechanism drastically reduce the management effort and cost.

So what are the use cases for object storage? As data needs grow, object storage will become the storage paradigm of choice in more and more environments, but already today we see the need in a number of situations:

Building live archives
Object storage enables companies to re-activate their data. Currently, most companies see data more as a burden than anything else: the data will never be used again but needs to be archived for a whole lot of reasons. But this data actually has a lot of value. By using live archives, employees have faster access to older data and they can use those valuable resources. With traditional storage it would never be achievable to build disk based archives for this purpose as the overhead would make this too costly.

Online applications
Most of the data-intensive online (cloud) applications are built on public clouds such as Amazon S3, which are early implementations of object storage. The benefits for the application providers are plenty: a simple programming interface, low cost and fast time to market. As their data sets grow, those companies might move to private object storage implementations to reduce costs even more.

Media and entertainment
Traditionally, the M&E industry has been very much file-oriented but we’re seeing a growing interest in object storage to optimise efficiency and reduce costs, but also because this industry is already hitting the limits of their file systems.

These are just a few examples of object storage implementations for big unstructured data. Object storage was not built to replace any of the current storage architectures. Very much like NAS filers were designed in the 90ies because block storage (SAN was designed when databases were king) was not optimised for unstructured data, object storage will find its place next to those two for big unstructured data.

Author: Tom Leyden is director of alliances and marketing at Amplidata.

•Date: 26th January 2012 • Region: World •Type: Article • Topic: ICT continuity

Business Continuity Newsletter Sign up for Continuity Briefing, our weekly roundup of business continuity news. For news as it happens, subscribe to Continuity Central on Twitter.
   

How to advertise How to advertise on Continuity Central.

To submit news stories to Continuity Central, e-mail the editor.

Want an RSS newsfeed for your website? Click here