Spark and MR) initially support serialization and deserialization of CSV files and offer ways to add a schema while reading. Similarly, most batch and streaming frameworks (e.g. ➖ Problems with CSV import (for example, no difference between NULL and quotes) ĭespite limitations and problems, CSV files are a popular choice for data exchange as they are supported by a wide range of business, consumer, and scientific applications. ➖ There is no standard way to present binary data No difference between text and numeric columns Complex data structures have to be processed separately from the format
In CSV, the column headers are written only once For XML, you start a tag and end a tag for each column in each row. ➕ CSV can be processed by almost all existing applications ➕ CSV is human-readable and easy to edit manually One of the other properties of CSV files is that they are only splittable when it is a raw, uncompressed file or when splittable compression format is used such as bzip2 or lzo (note: lzo needs to be indexed to be splittable). In addition, the CSV format is not fully standardized, and files may use separators other than commas, such as tabs or spaces. Foreign keys are stored in columns of one or more files, but connections between these files are not expressed by the format itself. Data connections are usually established using multiple CSV files. CSV files may not initially contain hierarchical or relational data. Essentially, CSV contains a header row that contains column names for the data, otherwise, files are considered partially structured. CSV is a row-based file format, which means that each row of the file is a row in the table. CSVĬSV files (comma-separated values) are usually used to exchange tabular data between systems using plain text.
In this post, we will look at the properties of these 4 formats - CSV, JSON, Parquet, and Avro using Apache Spark. Common formats used mainly for big data analysis are Apache Parquet and Apache Avro. Apache Spark supports many different data formats, such as the ubiquitous CSV format and the friendly web format JSON.