But before we dive in, let’s establish a definition for “big data” itself—the most important term of them all. Big data is essentially a large volume of both structured and unstructured data that’s too large or complicated for traditional data warehouses. Wondering how to identify big data? Typically, there are “five V’s” used to identify it:

  1. Volume: the data takes up a petabyte or more
  2. Variety: there are different types of data that you can analyze. Often these types fall into three categories: structured, semi-structured, and unstructured data.
  3. Velocity: the necessary data processing is time-sensitive
  4. Veracity: the data is trustworthy
  5. Value: the data can be used to provide a clear business value

Now, let’s dive into the terminology:


An open-source framework for distributed storage and distributed processing that allows organizations to store and query large amounts of data (larger than what you can store in a traditional database). Hadoop has two components: Hadoop Distributed File System and MapReduce.

Hadoop Distributed File System (HDFS)

A file system used by Hadoop applications that runs on clusters of commodity machines. HDFS allows for the storage of large, imported files from applications outside of the Hadoop ecosystem. It also allows imported files to be processed by Hadoop applications.


A distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the MapReduce algorithm which breaks down all operations into Map (performs filtering and sorting) and/or Reduce (performs summary operations) functions. The biggest advantage of the MapReduce process is the scalability and fault-tolerance you can achieve for a variety of applications by optimizing the execution engine just one time.

Structured Data

Data whose structure is known. It resides in a fixed field within a file or record.

Unstructured Data

Information that does not have a defined data model or organization. It can be textual (such as the body of an email, instant messages, Word documents, PowerPoint presentations of PDFs) or non-textual (audio, video or image files).

Semi-Structured Data

A cross between structured and unstructured. This data is structured without a strict data model, for example event log data or strings of key-value pairs.

YARN (Yet Another Resource Negotiator)

A resource management platform that delivers operations, security, and data governance tools across Hadoop clusters for applications running on Hadoop.


A distributed data warehouse built on top of HDFS to manage and organize large amounts of data. Hive provides a schematized data store for housing large amounts of raw data. It also provides a SQL-like environment to execute analysis and query tasks on raw data in HDFS. This SQL-like environment is the most popular way to query Hadoop. All the leading Hadoop distributors like Cloudera, Hortonworks, MapR, and Amazon EMR offer Hive ODBC connectors.

NoSQL Databases

Non-relational databases (including MongoDB, HBase, and Apache Cassandra) that allow fast ad-hoc analysis of extremely high volume and disparate data. Its characteristics include a flexible, schema-less data model, horizontal scalability and distributed architectures. It’s ideal for storing and retrieving objects needed for web applications.


A distributed, column-oriented database. It uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and transactional interactive queries.