By Sid Richardson, PMP, CSM
I have been in the data warehousing practice since 1994, when I implemented a successful Distributed Data Warehouse for a flagship banking product, followed by co-developing Oracle’s Data Warehouse Methodology. In August 1997, I was invited to speak at the Data Warehouse Institute Conference in Boston.
Over the years, I’ve researched and implemented what I would consider some small scale/junior Big Data systems. I have an interest in Big Data and wanted to share my learnings on Big Data and Hadoop as a high-level overview for the layperson / busy executive.
What is Big Data?
Big Data defines an IT approach used to process the enormous amounts of available information from social media, emails, log files, text, camera/video, sensors, website clickstreams, Radio Frequency Identification (RFID) tags, audio, and other sources of information in combination with existing computer files and database data.
In the 1990s, three major trends occurred to make up Big Data: “Big” Transaction Data, “Big” Interaction Data, and “Big” Data Processing.
In 2001, Big Data was defined by Doug Laney, former Vice President and Distinguished Analyst with the Gartner Chief Data Officer (CDO) research and advisory team. Mr. Laney defined Big Data by the “three Vs”:
- Velocity – Speed of incoming data feeds.
- Variety – Unstructured data, social media, documents, images.
- Volume – Large quantities of data.
IBM decided to add two more Vs:
- Veracity – Accuracy of the data.
- Value – To define Big Data.
Why do we need Big Data?
In a nutshell: We need Big Data because there is a lot of data to process, for example:
- Google needs to index the entire web—daily.
- Yahoo! needs to perform similar data intensive processing.
- Facebook is home to approximately 845 million users and about 40+ billion photos.
- According to The Economist, in 2010, Walmart handled more than 1 million customer transactions every hour, feeding databases estimated at more than 2.5 petabytes—the equivalent of 167 times the books in America’s Library of Congress.
Also noted by The Economist, the abundance of data and tools to capture, process, and share all this information already exceeds the available storage space (and the number of eyes on the planet to review and analyze it all!)
According to Forbes’s 2018 article, “How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read,” there are 2.5 quintillion bytes of data created each day. And, over the last two years alone, 90 percent of the data in the world was generated.
Clearly, the creation of data is expanding at an astonishing pace—from the amount of data being produced to the way in which it’s re-structured for analysis and used. This trend presents enormous challenges, but it also presents incredible opportunities.
You’re probably thinking, alright, I get the big data thing, but why couldn’t data warehouses perform this role? Well, data warehouses are large, complex, and expensive projects that typically run approximately 12-18 month-long durations with high failure rates (The failure rate of data warehouses across all industries is high—Gartner once estimated that as many as 50 percent of data warehouse projects would have only limited acceptance or fail entirely).
A new approach to handle Big Data was born: Hadoop.
What is Hadoop?
In a nutshell, Hadoop is a Java-based framework governed by the Apache Software Foundation (ASF) that initially addressed the ‘Volume’ and ‘Variety’ aspects of Big Data and provided a distributed, fault-tolerant, batched data processing environment (one record at a time, but designed to scale to Petabyte-sized file processing).
Hadoop was created out of a need to substantially reduce the cost of storage for massive volumes of data for analysis and does so by emulating a distributed parallel processing environment by networking many cheap, existing commodity processors and storage together, rather than using dedicated hardware and storage solutions.
Why Hadoop?
- Eighty percent of the world’s data is unstructured, meaning it cannot fit neatly into a database. Video data is an example of this—most businesses don’t have the capability or resources to analyze this data.
- Before Hadoop, data storage was expensive. Hadoop, however, lets you store as much data as you want in whatever form you need, simply by adding more servers to a Hadoop cluster.
- Hadoop’s approach makes data storage cheaper than before; what made Hadoop so popular was the creation of massive volumes of storage for analysis.
The Challenges with Hadoop
There is a limited understanding about Hadoop across the IT industry. Hadoop has operational limitations and performance challenges—you need to resort to several extended components to make it work and to make it reliable. And, Hadoop is becoming more fragmented, pulled by different commercial players trying to leverage their own solutions.
In summary…
The Hadoop Framework addresses a number of previous challenges facing the processing of Big Data for analysis. The explosion in deployment of data capture devices across all industries world-wide necessitated a more cost-effective way to store and access the massive volumes of data accumulating by the second!
I hope this blog post has provided you with a better understanding of some key Big Data and Hadoop concepts and technologies. Have you worked with Big Data and/or Hadoop? Let us know your thoughts and experiences in the comments!
P.S. If you have gotten this far and are curious where the name Hadoop comes from, here you go! The name ‘Hadoop’ was coined by one of the sons of Doug Cutting, a software designer and advocate and creator of open-source search technology. Mr. Cutting’s son gave the name ‘Hadoop’ to his toy elephant and Mr. Cutting used the name for his open source project because it was easy to pronounce.
About the Author: Mr. Richardson’s passion is Data Warehousing, Business Intelligence, Master Data Management and Data Architectures. He has helped Fortune 500 companies in the US, Europe, Canada, and Australia lead large-scale corporate system and data initiatives and teams to success. His experience spans 30 years in the Information Technology space, specifically with experience in data warehousing, business intelligence, information management, data migrations, converged infrastructures and recently Big Data. Mr. Richardson’s industry experience includes: Finance and Banking, government, utilities, insurance, retail, manufacturing, telecommunications, healthcare, large-scale engineering and transportation sectors.