Thursday 12 December 2013

A History of our Understanding of Data Computation



Statistics is a fairly new branch of mathematics tracing its roots to the mid seventeenth century when Thomas Bayes devised his famous Bayes theorem and revolutionized the way we view and perceive the world around us. It was a simpler time where numbers were merely mathematical objects used to count, people in those times the idea of using machines with software’s like Hadoop for the computation and analysis of big data clusters was inconceivable.

The science of statistics with its development over the years has allowed us to better understand and interpret data; now when someone tells us that the temperature outside is 30 degrees we unconsciously know that we will feel hot when we go outdoors. While this example might seem trivially irrelevant to us today it does have some merit. Today 30 degrees is no longer merely a number used to measure weather, in this day and age when we are told its 30 degrees we can make a number of assumptions based on that data such as that it would feel “hot” outside, that most people would be wearing half sleeved or sleeveless t-shirts, that we would need to use air conditioning, that our electricity bills would be higher and so on.

While the compilation and interpretation of data began as early as 1790 with the first population census conducted in the United States it was not until over a century later that for the first time simple machines were used in the tabulation of census. The first programmable digital electronic computer the Colossus was used to break the Lorenz cipher and changed our understanding and usage of data forever. The only thing that gave the Colossus an edge over its predecessors was volume of data it could compute. Since the advent of computers the amount of data we compute has gotten bigger and bigger, which while increasing our understanding of the world as well as the precision of our predictions has brought its own set of problems.

The continual advances in computer technology meant that computers could handle larger and larger calculations rendering previous versions obsolete. However the size of data to be calculated always seemed to be bigger i.e. the ability of computers to process data was not scaling as well as the size of our data and it soon became clear that a different approach to the problem was needed.

Hadoop named after a toy of its creator Doug Cuttings son, is an open-source software frame work used for the storage and processing of large scale data that emerged in 2005 as the answer to various tedious and time consuming problems faced by computer scientists in this day and age. Hadoop with its ingenuous infrastructure consisting of Hadoop Distribution File System (HDFS), Big Data Hadoop YARN and Hadoop Map Reduce works under the assumption that hardware failures are common and thus should be handled in software by the framework. In layman’s terms this means that Hadoop works under the assumption that the size of the data is too large for a computer or set of computers to compute at once causing the system to crash. Working with this fundamental assumption Intellipaat - Hadoop online training has created a system that allows its users to work around the problem empowering them to theoretically process infinite amounts of data at the same time. Online Hadoop tutorial is the easiest way to understand every thing deeply and clearly.

No comments:

Post a Comment

Thank You!!!!