Redshift table occupies exponentially more disk space in production cluster -
production cluster details:
- node type dc1.8xlarge
- nodes 25
- 2.56tb ssd storage per node
test cluster details:
- node type ds2.xlarge
- nodes 6
- 2tb hdd storage per node
when same table same ddl & encoding unloaded , copied production cluster test cluster, disk footprint reduces exponentially. has been tested multiple tables different distribution styles , sort key patterns.
example: table (no sort key, distsyle even) - size in production: 60gb; size in test: 0.6 gb
table b (sort key, diststyle key) - size in production: 96gb 100% sorted; size in test: 1.4 gb 100% sorted
any ideas can result in discrepancy? have read of redshift forums not able find reason issue. using admin view v_space_used_per_tbl (provided aws) calculating size of table.
if table row count small table size defined minimum table size cluster.
redshift allocates minimum of 1mb per column per slice in cluster. dc1.8xlarge
node has 32 slices on 25 node cluster that's ~800mb per column.
a 60gb minimum size table has ~72 user columns plus 3 internal [fewer if of columns large varchar()
]. ds2.xlarge
node has 2 slices on 6 node cluster minimum size of 72 column table ~900mb.
for large cluster in particular, should consider using diststyle all
if table "dimension" table (small-ish lookup table). reduces minimum size 1mb per column per node.
for detailed explanation please refer aws knowledge base article why table in amazon redshift cluster consume more disk storage space expected?
Comments
Post a Comment