Why you need a big data file format to store data
How do you get petabytes of data into Amazon S3 or your data warehouse for analytics? If you were to just load data in its original format, it wouldn’t be of much use. Storing data in its raw format would consume a lot of space and raw file formats cannot be accessed in a parallel manner. In data warehouses like Redshift and Snowflake, data is usually partitioned and compressed internally to make storage economical, make access fast and enable parallel processing. On Amazon S3, the file format you choose, compression mechanism and partitioning will make a huge difference in performance. In this blog, let us examine the 3 different formats Parquet, ORC and AVRO and look at when you use them.
About the three big data formats: Parquet, ORC and Avro
These three formats are typically used to store huge amounts of data in data repositories. They compress the data so you need less space to store your data, which can be an expensive exercise. Data stored in ORC, Avro and Parquet formats can be split across multiple nodes or disks which means they can be processed in parallel to speed up queries.
All three formats are self-describing which means they contain the data schema in their files. What does this mean? It means you can actually use an ORC, Parquet, or Avro file from one cluster and load it on a different system, and the system will recognize the data and be able to process it.
How do these file formats differ?
Parquet and ORC both store data in columns and are great for reading data, making queries easier and faster by compressing data and retrieving data from specified columns rather than the whole table. Parquet and ORC also offer higher compression than Avro.
Each data format has its uses. When you have really huge volumes of data like data from IoT sensors for e.g., columnar formats like ORC and Parquet make a lot of sense since you need lower storage costs and fast retrieval. But if you are considering schema evolution support or the capability of the file structure to change over time, the winner is Avro since it uses JSON in a unique manner to describe the data, while using binary format to reduce storage size. Avro has row-based data storage and excels at writing data.
A closer look at the three big data formats:
ORC or Optimized Row Columnar file format
ORC stands for Optimized Row Columnar (ORC) file format. This is a columnar file format and divided into header, body and footer.
File Header with ORC text
The header will always have the ORC text to let applications know what kind of files they are processing.
File Body contains data and indexes
The body will have the data and the indexes (which define the tables). The actual data is stored in Stripes which are simply rows of data. The Stripe size by default is 250 MB. Each stripe will further contain three sections – the index section, data section and a footer section. The index and data are both use columnar storage so you can access only columns where data is of interest. The index sections indicate max. and min. values for columns and row positions within each. ORC indexes help search Stripes based on the data needed and row groups. The footer of the Stripe indicates column encoding, directory of streams and their source.
File Footer stores metadata, information about Stripes, file versioning and compression type
The footer of the file has 3 sections- file metadata, file footer and postscript. The metadata will have statistical information about the Stripe while the footer will have details including the list of Stripes in the file, number of rows per Stripe, and the data type for each column. It also has aggregate counts for every column like min, max, and sum. The postscript section will have file details like the file footer and metadata length, file version, and the type of compression used along with the size of the compressed folder.
Amazon Athena performance with ORC
We have found that files in the ORC format with snappy compression help deliver fast performance with Amazon Athena queries.
A row-based binary storage format that stores data definitions in JSON
Avro is an open source object container file format. Unlike the other two formats, it features row-based storage. Avro stores data definition in JSON so data can be easily read and interpreted. It uses the JSON file format for defining the data types, protocols and serializes the data in a compact binary format, making for efficient, resource-sparing storage.
Data serialization capability and Schema evolution adds flexibility
Avro provides data serialization, so data can be serialized into files or messages. Avro is entirely schema-dependent with both data and schema stored in the same file or in a message. This allows the data to be sent to any destination and to be processed by any program. A great feature of Avro is Schema evolution that supports data schemas that undergo changes over a period. It deals with schema changes like missing fields, added fields and changed fields – this adds a lot of flexibility. New data can be read by old programs and old data can be read by new programs.
No coding generator required
Data stored with Avro can even be shared by programs using different languages. Sometimes a coding generator is needed by data exchange services to interpret data definition and to create code to access data. With Avro there is no need for this, making it a perfect candidate for scripting languages.
Parquet is ideal for big data
Parquet is an efficient columnar data storage format that supports complex nested data structures in a flat columnar format. Parquet is perfect for services like AWS Athena andAmazon Redshift Spectrum which are serverless, interactive technologies.
Parquet offers flexible compression options and efficient encoding schemes
Parquet is ideal for working with huge amounts of complex data and offers a variety of data compression and encoding options. This is especially useful while reading specific columns from giant tables because Parquet can then only read the required columns instead of the entire table leading to faster processing of data.
Columnar data storage means efficiency
Since it features columnar storage and compression, non-relevant data can be filtered out fast while querying. There are different codecs available for compressing data and different data files can have different compression types applied.
Parquet support for Amazon Redshift
Twice as fast and with six times more compression than text files
Apache Parquet is a incredibly versatile open source columnar storage format. It is 2x faster to unload and takes up 6x less storage in Amazon S3 as compared to text formats. It also allows you to save the Parquet files in Amazon S3 as an open format with all data transformation and enrichment carried out in Amazon Redshift.
Parquet is easy to load
You can use the COPY command to copy Apache Parquet files from Amazon S3 to your Redshift cluster.
Amazon Athena can be used for object metadata
Parquet is a self-describing format and the schema or structure is embedded in the data itself therefore it is not possible to track the data changes in the file. To track the changes, you can use Amazon Athena to track object metadata across Parquet files as it provides an API for metadata.
Automatic partitioning in Amazon S3
By specifying one or more partition columns you can ensure data that is loaded to S3 from your Redshift cluster is automatically partitioned into folders in your S3 bucket. This helps your queries run faster since they can skip partitions that are not relevant and benefit from partition pruning. This lowers cost and speeds up query performance. You can use Redshift Spectrum, Amazon EMR, AWS Athena or Amazon SageMaker to analyse data in S3.
Parquet support for Snowflake
Load structured and semi-structured data
Parquet files, including semi structured data can be easily loaded into Snowflake. Data from Snowflake can also be unloaded as Parquet files.
How Snowflake reads Parquet data
Parquet data is read by Snowflake into a single VARIANT column and the data can be queried in the VARIANT column, as you would with JSON data using similar commands and functions. Optionally you can select columns from a staged Parquet file and extract them into separate table columns by using a CREATE TABLE AS SELECT statement.
Supports efficient compression and encoding schemes
Parquet is built from ground up and is a good fit for Snowflake as it supports extremely efficient compression and encoding schemes. Compression schemes can be specified for every column and more encodings can be added in the future, making it future-proof.
ORC and Parquet for AWS
Query performances on Amazon Athena and Redshift Spectrum become cost-effective and very fast using open source columnar storage formats like Parquet and ORC. You also save on storage costs since data is compressed to a great degree.
How to convert data to columnar formats using an EMR cluster
With existing Amazon S3 data, you can create a cluster in Amazon EMR and convert it using Hive.
You can also get Amazon S3 inventory reports in Parquet or ORC format. Amazon S3 inventory gives you a flat file list of your objects and metadata. You can get the S3 inventory for CSV, ORC or Parquet formats.
Automate data conversion to Parquet or ORC with BryteFlow Ingest
The fastest and easiest way to convert your data is to do it automatically with BryteFlow Ingest. It automatically converts your data into Parquet or ORC formats when loading your data into Amazon S3, depending on the configuration settings.
It can load parquet files automatically to Redshift and Snowflake. With Snowflake, BryteFlow has the smarts to separate the data into separate columns rather than loading into a VARIANT column.
Automate data transformation of Parquet and ORC files with BryteFlow Blend
BryteFlow Ingest and Blend also automatically partitions the data on Amazon S3 and manages compression automatically using Snappy, gzip or bzip2.
BryteFlow Blend can automatically view Parquet or ORC formats without conversion. You can easily view the data and then use it for transformation,
Automate interfaces with Lake Formation and Glue Data Catalog
BryteFlow automatically interfaces with AWS Lake Formation, AWS Glue Data Catalog, using the API ( doesn’t wait for the crawlers), making your data available through the entire AWS eco-system.