Hadoop Daily Updates: Big Data: Introduction

We live in data age. An age where data is growing in an enormous rate that it can surpass the available storage in the near future. According to The Economist, 'Despite the abundance of tools to capture, process and share all this information—sensors, computers, mobile phones and the like—it already exceeds the available storage space'. But then again, it is not like digital data was not previously stored in a storage device and the underlying technology to increase the amount of data stored in a device has also significantly improved from storing 1.44MB data in a floppy disk to storing TBs of data in a Hard disk drive. But the amount of data volume generated day by day has surpassed the pace at which the we are increasing our storage space.

So why is that all of a sudden we are worried about storage of high volume of data and what is the reason for the information explosion? There are many reasons for the information explosion. The most obvious one is technology. As the capabilities of digital devices soar and prices plummet, sensors and gadgets are digitizing lots of information that was previously unavailable. And many more people have access to far more powerful tools. For example, there are 4.6 billion mobile-phone subscriptions worldwide (though many people have more than one, so the world's 6.8 billion people are not quite as well supplied as these figures suggest), and 1 billion-2 billion people use the internet. Moreover, there are now many more people who interact with information. Between 1990 and 2005 more than 1 billion people worldwide entered the middle class. As they get richer they become more literate, which fuels information growth. The amount of digital information increases tenfold every five years. A vast amount of that information is shared. By 2013 the amount of traffic flowing over the internet annually reached 667 exabytes, according to Cisco, a maker of communications gear.

The problem is simple: while the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives— have not kept up. One typical drive from 1990 could store 1,370 MB of data and had a transfer speed of 4.4 MB/s,4 so you could read all the data from a full drive in around five minutes. Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.

“We are at a different period because of so much information,” says James Cortada of IBM, who has written a couple of dozen books on the history of information in society. Joe Hellerstein, a computer scientist at the University of California in Berkeley, calls it “the industrial revolution of data”. The effect is being felt everywhere, from business to science, from government to the arts. Scientists and computer engineers have coined a new term for the phenomenon: “big data”.

According to Wikipedia, Big data is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques. In most enterprise scenarios the data is too big or it moves too fast or it exceeds current processing capacity. Big data has the potential to help companies improve operations and make faster, more intelligent decisions.

Big data spans three dimensions: Volume, Velocity and Variety.

Volume: From Terabytes to Petabytes.
Velocity: From Batch to real Time analysis.
Variety: From structured to semi-structured and unstructured.

Hadoop Daily Updates

Saturday, March 21, 2015

Big Data: Introduction

No comments:

Post a Comment