According to a recent report by IBM, we’re now generating more than 2.5 quintillion bytes of data every single day. That’s a huge volume of data, and it’s directly impacting industries ranging from retail and manufacturing to healthcare and education.
If you’re just getting started in the world of big data, all the terminology can be a bit confusing. So, let’s take a minute to break down a few of the key ideas that you’re most likely to come across:
Hadoop is an open-source framework for distributed storage and processing that allows organizations to store and query large amounts of data – companies like Yahoo and Facebook have Hadoop clusters that deal with tens or even hundreds of petabytes of data. At its simplest, Hadoop consists of two components: the Hadoop Distributed File System (HDFS) and MapReduce. Newer Hadoop deployments commonly have YARN installed as well as a dedicated resource management tool.
MapReduce is a distributed data processing model that runs on large clusters of commodity machines. It breaks down operations into a series of Map (filtering and sorting) and/or Reduce (summarization) functions. MapReduce enables the division of large jobs into tasks that can be run in parallel – enabling faster overall processing times.
YARN (Yet Another Resource Negotiator) is an open-source resource manager that is typically deployed on a Hadoop cluster. The primary task of YARN is to efficiently allocate resources and manage tasks across a Hadoop environment.
While Hadoop itself is an open-source technology, there are a variety of vendors who put out commercial distributions with various proprietary add-ons designed to help customize Hadoop environments. Common distributions include: Cloudera, MapR, and Hortonworks.
NoSQL databases like Cassandra and Google BigQuery are non-relational databases that allow for fast, ad-hoc analysis of large volumes of disparate data. They are characterized by flexible, schema-less data models as opposed to traditional relational data warehouses.
The concept of a single, consolidated repository that stores raw enterprise data – regardless of type or size. The term data lake is largely synonymous with Hadoop, however it should be noted that the two things are not one in the same.
Next, let’s talk a bit about data. In general, data can be broken down into three distinct categories: structured, unstructured, and semi-structured.
· Structured data has a defined data model and can be easily stored in a relational database.
· Unstructured data does not have a defined data model. It encompasses a wide range of content, including text and other types of media including audio, video, and images.
· Semi-structured data is a cross between the two – it’s structured, but lacks a strict data model.
As data volumes continue to grow, organizations are coming under increasing pressure to come to terms with two big questions: first, what’s the best, most economical way to store huge volumes of data? And second, how do you turn all that data into a real-world competitive advantage? These two questions are the driving force behind the thriving and rapidly evolving big data ecosystem.
Increasingly, organizations have a good handle on the question of how to economically store their data. Many have turned to Hadoop and other big data stores to consolidate their disparate data sources into a single enterprise data lake.
On the other hand, many organizations are still looking for elegant, scalable ways to analyze all of this information. At the end of the day keeping data around for the sake of it isn’t a winning business strategy. Only by making this data available to people throughout your organization can you turn your investments in big data from a cost center into a source of meaningful insight and lasting competitive advantage. The key to pulling that off? Analytics.
An enterprise-ready platform like MicroStrategy lets organizations take advantage of their existing investments in big data, while empowering people across the organization to access, explore, and analyze data from a wide range of sources. Without analytics it’s impossible to operationalize big data.
To learn more about how to leverage big data with MicroStrategy, check out our webcast, Adaptive Analytics: Transitioning from legacy systems to a modern platform with MicroStrategy and Cloudera.