Statistics is a
fairly new branch of mathematics tracing its roots to the mid seventeenth century
when Thomas Bayes devised his famous Bayes theorem and revolutionized the way
we view and perceive the world around us. It was a simpler time where numbers
were merely mathematical objects used to count, people in those times the idea
of using machines with software’s like Hadoop
for the computation and analysis of big
data clusters was inconceivable.
The science of
statistics with its development over the years has allowed us to better
understand and interpret data; now
when someone tells us that the temperature outside is 30 degrees we
unconsciously know that we will feel hot when we go outdoors. While this
example might seem trivially irrelevant to us today it does have some merit.
Today 30 degrees is no longer merely a number used to measure weather, in this
day and age when we are told its 30 degrees we can make a number of assumptions
based on that data such as that it
would feel “hot” outside, that most people would be wearing half sleeved or
sleeveless t-shirts, that we would need to use air conditioning, that our
electricity bills would be higher and so on.
While the
compilation and interpretation of data
began as early as 1790 with the first population census conducted in the United
States it was not until over a century later that for the first time simple
machines were used in the tabulation of census. The first programmable digital
electronic computer the Colossus was used to break the Lorenz cipher and
changed our understanding and usage of data
forever. The only thing that gave the Colossus an edge over its
predecessors was volume of data it
could compute. Since the advent of computers the amount of data we compute has gotten bigger
and bigger, which while increasing
our understanding of the world as well as the precision of our predictions has
brought its own set of problems.
The continual
advances in computer technology meant that computers could handle larger and
larger calculations rendering previous versions obsolete. However the size of data to be calculated always seemed to
be bigger i.e. the ability of
computers to process data was not
scaling as well as the size of our data
and it soon became clear that a different approach to the problem was needed.
Hadoop named
after a toy of its creator Doug Cuttings son, is an open-source software frame
work used for the storage and processing of large scale data that emerged in 2005 as the answer to various tedious and time
consuming problems faced by computer scientists in this day and age. Hadoop with its ingenuous
infrastructure consisting of Hadoop Distribution
File System (HDFS), Big Data Hadoop YARN and Hadoop Map Reduce works under the
assumption that hardware failures are common and thus should be handled in
software by the framework. In layman’s terms this means that Hadoop works under the assumption that
the size of the data is too large
for a computer or set of computers to compute at once causing the system to
crash. Working with this fundamental assumption Intellipaat - Hadoop online training has created a system that allows its users to work around
the problem empowering them to theoretically process infinite amounts of data at the same time. Online Hadoop tutorial is the easiest way to understand every thing deeply and clearly.
No comments:
Post a Comment
Thank You!!!!