Object storage for big unstructured data By Tom Leyden. ‘Big data’ causes a lot of confusion. Big data is used for anything related storage these days, so people don’t know anymore what it exactly is. Is it Hadoop? Is it analytics? It doesn’t need to be that complicated though. There are two kinds of big data: big data for analytics and big unstructured data. Big data for analytics is a paradigm that became popular in the previous decade. A lot of innovation was done for research projects. New technology enabled researchers in many different domains to capture data in a way they had never been able to do before. In agriculture, for example, ploughs would get sensors that would send little bits of information to a central system (over satellite). Every couple of feet these sensors would measure what’s in the ground (minerals for example), how humid the ground is etc. Based on that, large agriculture companies would then be able to make better decisions on where to grow which crop. The problem was that traditional systems to store this massive amount of small data (relational databases) were no longer adequate to store this information. Systems like MapReduce and Hadoop were created as an alternative and would store these massive volumes of files as concatenated big files. Big data was born: big data for semi-structured data. Today we are seeing a similar trend with unstructured data. Studies show that data storage requirements will increase x 30 over the next decade. 80 percent of that data will be large files: office documents, movies, music, pictures. Similar to how the databases in the previous decade, traditional storage (file systems) is not the best way to store this data. File systems will not scale sufficiently and actually become obsolete as applications will take over the role of the file system. A nice example is what Google Picasa does: in the old days we would store pictures nicely organized in a file system (hopefully with some backups). One folder per year, one per month in each year, one per holiday or party. Today, we just dump all the pictures in one folder and Picasa will sort them for us based on date, location, face recognition or other metadata. With an intelligent query, we can display the right pictures very fast, much faster than browsing the file system. We don’t even have to worry about backups as we can store copies in the cloud automatically. The new paradigm that will help us store these massive amounts of unstructured data is ‘object storage’. Object storage systems are uniformly scalable pools of storage that are accessible through a REST interface. Files (objects) are dumped into the pool and an identifier is kept to locate the object when it is needed. Applications that are designed to run on top of object storage will use these identifiers through the REST protocol. A good analogy is parking your car valet vs. self-park. When you self-park you have to remember the lot, the floor, the isle etc (file system); with valet you get a receipt when you give your keys and you will later use that receipt to get your car back. So what is needed to build an object storage system? Basically just lots of disks, a REST API and a way to provide durability. This could be done with traditional systems like RAID but the problem is that RAID requires a huge amount of overhead to provide acceptable availability. The more data we store, the more painful it is to be needing 200 percent overhead as some systems do. The better way to provide durability for object storage is erasure encoding. Erasure encoding stores objects as equations, which are spread over the entire storage pool: Data objects are split up in sub-blocks, from which equations are calculated. According to the availability policy, an overhead of equations is calculated and the equations are spread over as many disks are possible. This is also policy-defined. As a result, when a disk breaks, the system will always have sufficient equations to restore the original data block. If a disk is broken, the system can re-calculate equations as a background task to bring the number of available equations to a healthy level again. Some erasure coding systems use low power Atom processors to reduce power costs. As the entire system, all storage nodes, can recalculate missing equations as a background task. It is no longer necessary to use the high-end nodes that RAID systems need (to speed up restores and avoid performance losses). Apart from providing a more efficient and a more scalable way to store data, erasure coding based object storage can save up to 70 percent on the overall TCO thanks to reduced raw storage needs and reduced power needs (less hardware plus low power devices save on power and cooling). Also, uniformly scalable storage systems with an automated healing mechanism drastically reduce the management effort and cost. So what are the use cases for object storage? As data needs grow, object storage will become the storage paradigm of choice in more and more environments, but already today we see the need in a number of situations: Building live archives Online applications Media and entertainment These are just a few examples of object storage implementations for big unstructured data. Object storage was not built to replace any of the current storage architectures. Very much like NAS filers were designed in the 90ies because block storage (SAN was designed when databases were king) was not optimised for unstructured data, object storage will find its place next to those two for big unstructured data. Author: Tom Leyden is director of alliances and marketing at Amplidata. •Date: 26th January 2012 • Region: World •Type: Article • Topic: ICT continuity
To submit news stories to Continuity Central, e-mail the editor. Want an RSS newsfeed for your website? Click here
| |