It seems these days you cannot talk about Big Data without visions of a yellow elephant somehow entering the picture.
A common misconception out there is that Big Data and Hadoop are synonymous and the terms have unfortunately become somewhat clichéd and abused.
In the last 15 years there has been an explosion in the amount of data, both due to the World Wide Web and increased requirements from the Industrial sector. Search engines paved the way for large scale automation and projects like Google addressed ways of speeding up web search results by distributing data and calculations across several computers to accomplish multiple tasks simultaneously. This is how Hadoop was born and hence the yellow elephant (named after a yellow toy elephant that belonged to the son of one of the inventors).
What is Hadoop?
Hadoop is essentially a way of storing huge amounts of data across distributed clusters of servers. Hadoop is not a relational database, it is basically a file system. Additional tools are needed in order to actually process and analyze the data stored in Hadoop. The usefulness of Hadoop is often misunderstood. It does not replace an Enterprise Data Warehouse but rather assists in consolidating the data. Hadoop was not originally developed to perform interactive queries but rather large batch processing. It was also not intended for delivering reporting to end-users or for real-time stream processing.
What is Big Data?
Big Data refers to a collection of large datasets that cannot be processed using traditional computing techniques. Big Data is not only data, it has become a broad subject that involves various tools, techniques and frameworks, only one of which is Hadoop.
To take advantage of this ability to store large amounts of data within the “day-to-day” functioning of a business is known as Big Data.
There are four specific attributes that define Big Data, they are volume, variety, velocity, and veracity.
The sheer volume of data available out there is what gives “big” data its name.
More and more information is digitized everyday making the variety of information available to us one of the most interesting developments in technology.
Can we trust the data? We should always be aware that there are inherent discrepancies in all the data that we collect.
Velocity refers to the rate of incoming data that needs to be processed and is an important aspect when dealing with Big Data.
Our company has been dealing with Big Data for years, long before Hadoop was even available. We had to find innovative ways to use the storage space that was available to us, all the time treating every bit of data as extremely valuable. Through the years we developed advanced algorithms for compressing data objects in binary code, thereby treasuring every bit of information.
Big Data changed the idea that data should be carefully selected and only vital information stored for analysis.
“Just keep it all, you never know what you might need for future analysis”.