Introduction to Big Data
In big data terms, it is better to use a distributed system with several machines rather than doing the data processing in a single machine. It is better to understand the capabilities of the hardware like CPU, memory, SSD, and network to get a better knowledge about the big data. CPU is doing the main operation, memory store the data temporarily before the CPU does any operation, SSD is the place where you store the data in a long term, network is used to access to the outside the system. When you consider a 2.5GHz CPU, it could process 2.5 billion operations.
For example, your CPU can do the live character count by using just 0.01% of the CPU capacity. Although the CPU can process the data very quickly, the time taken to move the data from memory to the CPU is very high and it increases the processing time of the CPU. In the same way, if the data is processed from the SSD or magnetic disk, it takes longer to load the data than from memory. Memory is very expensive and several companies bought very high size memory to ease the processing of data. But, Google made a breakthrough of distributing the data to several systems. A distributed system is a cluster where there are several connected machines where each machine is called a node.
When you process a very large dataset in your computer, it hangs and causes the program to terminate. It is not because of the inability of the CPU to process the information but the CPU cannot load the data quickly from the memory and storage. Every time the CPU process a chunk of data, the data is moved from storage to memory and then to CPU. The CPU processes the information and write back the result to the memory and the process continues. This back and forth shifting of data from storage to memory and to CPU result in a problem called thrashing. If you distribute the data to several machines and ask the multiple machines to process the data, it reduces the problem of thrashing.
In parallel computing, several processes share the same memory whereas in distributed computing every processor has its own memory. In distributed computing, each machine is connected to the other machines across a network. Hadoop is a distributed system with several software utilities that support massive storage and processing capability. Hadoop framework has four main components:
- HDFS (Hadoop Distributed File System) is a data storage system which stores data in commodity machines
- MapReduce is a large scale data processing system
- Hadoop Yarn is a resource manager that schedules the computational workloads
Apache Pig uses SQL like scripting language that enables workers to write complex SQL transformations in Apache Hadoop without knowing Java. Apache Pig was developed by Yahoo. Hive is a data warehouse system used to analyze large datasets stored in HDFS. Hive is only used for structured data whereas pig is used with both structured and unstructured data. Both are used to analyze data in Map Reduce. In Map Reduce, the processed data is stored back to the disk which causes a slowdown in the Map Reduce jobs. This slowness has encouraged Matei Zaharia to create Spark. The intermediate results are not written to the disk in Spark and it decreases the processing time.
In Map Reduce, there are three main steps involved. They are the map, shuffle and reduce. In HDFS, the data is reduced in smaller chunks (partitions) and are stored in a distributed manner. In the map step, the data is mapped to key-value pairs (tuples). All the key-value pairs are shuffled in the next step where the same key-value pairs are grouped together and end up in the same machine or cluster. In the reduce step, all the shuffled intermediate results are reduced to the final result.
In distributed computing, the task to be performed is distributed to several machines (nodes) across a cluster. These nodes perform operations on different tasks and aggregate of the results is computed to get the final result. It is very important to have a master who distributes the task across the nodes. There are different cluster configurations to facilitate task distribution. The local mode is a mode where the task is done within the local machine with Spark APIs without the need for the cluster. A Cluster manager is a process that manages resources across the cluster. The most common cluster managers are Standalone, YARN and Mesos.
Spark is used in several ways. Spark is used to do the ETL process in data analytics. In Machine Learning, Spark is used to do iterative algorithms like logistic regression, Page Rank. Spark is designed to keep the iterative data in memory thereby reducing the processing time. Spark uses an efficient use of memory. The Hadoop ecosystem is a slightly older technology than the Spark ecosystem. In general, Hadoop MapReduce is slower than Spark because Hadoop writes data out to disk during intermediate steps. However, many big companies, such as Facebook and LinkedIn, started using Big Data early and built their infrastructure around the Hadoop ecosystem.