Table Size Optimization

    Table Size Optimization for Small Datasets in Amazon Redshift

    1 mins read

    Amazon Redshift is a fast, fully managed, petabyte-scale Data Warehouse (DWH) solution. It is a columnar Database and facilitates massive parallel processing (MPP). In any DWH solution, we always have facts, Dimensions, Xref, etc. types of datasets which vary significantly in size. Hence, It is not wise to use default redshift setting for all sizes of datasets. 

    To design an efficient redshift based DWH solution one need to be aware of various Static and dynamic variables which control the size and performance of the environment.

    Type and numbers of Nodes with slices:  An Amazon Redshift data warehouse is a collection of computing resources called Nodes and Each Nodes has dedicated memory and disk storage (slices). The Master node manages the distribution of data and query processing tasks to the Slave Nodes. The disk storage for a Slave (Compute) node is divided into a number of slices. The number of slices per node depends on the node type. For example, each DS1.XL compute node has two slices, and each DS1.8XL compute node has 16 slices. Based on the distribution style data get stored into these slices as block store. Hence, the utilization of these nodes majorly depends on upon distribution style.

    Distribution style:  When a query is executed, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the impact of the redistribution step by physically locating the data where it needs to be before the query is executed.  There is three type of distribution style available in the Redshift.

    Compression and Vacuum strategy:   Compression and Vacuum strategy is very important for petabyte scale DWH system. To ensure the maximum performance and up to the mark space utilization, a periodic vacuum needs to be strategized.

    A data warehouse solution always consists of a variety of datasets and Redshift is designed for larger datasets hence smaller datasets need to be treated differently and this paper highlights that aspect.

    Click here to read the complete whitepaper and learn about the table size optimization in Amazon Redshift