Common source file formats used with  Apache Spark

Common source file formats used with Apache Spark

Lets understand some of the most commonly used file formats with Apache Spark

  1. Parquet - Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

  2. ORC - Apache ORC is a columnar format which has more advanced features like native zstd compression, bloom filter and columnar encryption.

  3. JSON - Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset[Row]. This conversion can be done using SparkSession.read.json() on either a Dataset[String], or a JSON file.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object.

  1. CSV - Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset[Row]. This conversion can be done using SparkSession.read.json() on either a Dataset[String], or a JSON file.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information,

  1. Text Files - Spark SQL provides spark.read().text("file_name") to read a file or directory of text files into a Spark DataFrame, and dataframe.write().text("path") to write to a text file. When reading a text file, each line becomes each row that has string “value” column by default. The line separator can be changed as shown in the example below. The option() function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator, compression, and so on.

  2. AVRO - Apache Avro is an open-source, row-based, data serialization and data exchange framework for Hadoop projects, originally developed by databricks as an open-source library that supports reading and writing data in Avro file format. it is mostly used in Apache Spark especially for Kafka-based data pipelines. When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program.Spark provides built-in support to read from and write DataFrame to Avro file using “spark-avro” library.

  3. Binary Files - Since Spark 3.0, Spark supports binary file data source, which reads binary files and converts each file into a single record that contains the raw content and metadata of the file. It produces a DataFrame with the following columns and possibly partition columns:

path: StringType modificationTime: TimestampType length: LongType content: BinaryType