What is Amazon Redshift ?
Amazon Redshift is Data warehouse facility in cloud. It is considered to be the fastest developing service of AWS. It is used for systematic workloads in an environment where it is easy to connect to clients using standard SQL as well as BI tools.
Need of Redshift emerged from the shortcomings of traditional warehousing. Traditional warehousing is expensive in terms of the time it takes for its implementation. Besides, if data size increases in future, which, for any growing organization, is sure to happen, there comes the need of investing in latest hardware for warehouse. Redshift helps in overcoming these shortcomings by taking only a few minutes to set up the cluster.
It is easily scalable as and if necessities change. It makes use of columnar storage mechanism. It provides quick Input-output facility irrespective of the size of dataset. Fast query processing is achieved by distribution and parallelization of queries through various nodes. A few of the key features are automation in terms of configuration, back up and security.
Columnar storage mechanism used in Redshift allows data to be stored as units of columns of data rather than traditional storage policies that are in the form of rows. One of the benefits of columnar storage is reduced need of joins. It is very helpful in big data processes. With parallelization, great number of processors can be made to perform computations simultaneously. It is similar to clustering. Any query issued to redshift is sliced and spread across various nodes of cluster. It also helps in achieving linear scalability.
Fast query execution:
Redshift makes use of an interesting approach to accelerate query running period. Query is compiled only once and the platform allocates the compiled query throughout the cluster. It also involves distribution of data after removal of additional information. It allows query execution at a time when it is required and since it has already been compiled, execution is much faster.
At the time of substantial workloads full capacity of cluster is utilized but when there is minimum workload clusters can’t be utilized fully. For these times, Redshift allows cluster scalability change either to up or down by adding or removing cluster nodes. It is very different from traditional resizing as in traditional resizing doesn’t provide the utility of adding or removing nodes.
Encryption and security:
IAM or identity and access management accounts are used to manage privileges. VPC or virtual private cloud feature provides cluster management with SSL encryption for transit data. No switching between encrypted-unencrypted clusters is another feature.
Data analysis support:
Redshift provides ETL operations facility. But that doesn’t stop the need for certain algorithms for data scientists to execute analysis operations. Redshift allows the data scientists to write their algorithms and later on integrate them according to the needs.
Why would you use it?
Data loading speed in Redshift is extremely fast. Apart from this query and Parallel Architecture discussed earlier make it super-fast. Speed is the same (fast) for normal as well as for complex queries.
As deliberated in the aforementioned idea, Redshift provides high performance by means of gigantic parallelism, well-organized data density, request optimization, distribution etc. MPP allows Redshift to parallelize statistics stacking, backup as well as refurbish process. Additionally, queries that you implement get dispersed across numerous nodes.
Scalability is real critical theme for a Data warehousing party and Redshift is no exception here. It provides horizontal scalability. You can add nodes to increase the space or storage. During this progression, your current cluster will persist for other operations like read etc. to maintain the functionality of currently running application.
This point goes in support of Redshift. For any substitute available in the market, it is significantly inexpensive. It provides two pricing prototypes one with reserved payment system and another pay as you go. You can opt for any prototype based on your needs.
Redshift permits clusters to be launched within the substructure Virtual Private Cloud (VPC). Henceforth you can framework VPC groups for security and restrict inbound-outbound availability and access to your clusters. Privilege grant as discussed earlier allows user access on a particular level. You can also define groups with privilege at different levels.
No provision to impose live app archive or distinctiveness on data, limited sustenance for parallel upload, requirement of understanding of sort and distribution keys are some of the limitations of Redshift. But these can be ignored when all the benefits are considered.