Spark

Abinaya Rajesh
2 min readJan 2, 2021

Spark is written in Scala which is a functional programming language. Spark can be accessed with Python which is a procedural programming language but you will see an influence of functional programming in Spark. Functional programming is better for a distributed system. Functional programming reduces the possibilities of error by having functions that produce constant results compared to procedural programming where there can be varied results in some cases. The functions should not have bad effects on the variables outside the scope as this might result in an error. This kind of function is called pure function.

It is necessary to maintain all the functions and sub-functions as a pure function because any change in the variables might mess up the entire process. If Spark takes a copy of the input at every function level, it will lead to a memory exception. Spark handles this problem by using lazy evaluation where it makes a graph of step by step process of finding what functions and data it will need. This is called a Directed Acyclic Graph(DAG). Because of lazy evaluation, Spark procrastinates the results at the intermediate steps until the last moment to find a way to streamline the output of the process.

It is very important to know the format of the data while loading the data into the Spark for processing. There are different formats like CSV where the data is comma-separated or separated by any other delimiter. The JSON files are used in web server log files where the data is stored in attribute-value pairs. The values stored are often very simple like numeric or arrays. The HTML or XML format has several tags where the data is encoded. It is very important to track the opening and closing tags to preserve the structure of the data.
In Spark, data is stored in a distributed manner (Distributed Data Store). The data is divided into several small chunks and distributed across several machines in the cluster. The distributed system stores the data in a fault-tolerant way where the data is replicated across different clusters. Smaller digestible chunks(64 MB or 128 MB) are stored in a distributed manner. In Hadoop, the data is stored in HDFS and you can use other data storage systems. If you use AWS, Simple Storage Service (S3) is used to store the raw data in several data buckets.

--

--