We were recently working with a customer who were starting a re-platform initiative, wherein they were looking to move a large chunk of their enterprise applications from being hosted on-premises to cloud services. The client was looking at various cloud provider options, one of which was AWS. We were tasked with looking at options for
their Data Warehouse and Analytics estate, which included a mix of Tableau for Desktop and Server/Dashboard analysis, and an Oracle database where their star schemas resided. They were keen to look at Redshift, in a bid to reduce costs and improve query performance. We decided to engage Sirocco Systems to design and build out a Proof of Concept infrastructure on which the client could carry out testing, to ensure they could move their existing star schemas as they were and continue to use Tableau for reporting. We’ll walk through the initial design and set up that Sirocco performed in this post, and cover some more detailed aspects of performance tuning later. The proposed infrastructure was to keep their Tableau Desktop environment on premises, and connect into AWS using a VPN:
The redshift cluster was made up of 8 dc1.large nodes, which provided an approximate equivalent to their existing Oracle RAC cluster in terms of CPU and Memory, as well as using SSD storage.
Migration to Redshift
In the interest of time, we decided to export one of the clients smaller star schemas in terms of number of dimensions, but one which contained a suitably large row set to give some meaningful performance results. The Star schema we used was based on site visit analysis, and contains 4 dimension tables, 1 bridge table, 1 lookup table and 1 fact table. There is also a view overlaying the fact table which provides some in-DB calculated fields. The model is displayed below.
To get the data into Redshift we generated a CSV export of each table, split these into 16 chunks each and then posted these to S3. We then used the Redshift copy command to load them into the a schema called tableau in the database. For the time being the tables in Redshift were creating using equivalent DDL as per their existing database, but without any performance enhancements, e.g. no compression, sort or distribution keys.
Configuring Tableau
The first thing we needed to do was to download the Redshift drivers. As the user base was large Apple Mac based we fetched the driver from http://docs.aws.amazon.com/redshift/latest/mgmt/install-odbc-driver-mac.html
Once installed, you can then Start up Tableau and create a connection to the database.
Once connected you can select the tableau schema from the drop down, and the list of available tables and views will display. We can then configure the Data Source by selecting the tables/views we want to perform analysis against, configure the join between them, and add any appropriate calculated fields and filters. We’ll start off by
building a straightforward analysis on the number of visits by property for 2014.
We’ve added two calculated fields, ‘Total Visits’ and ‘Visit Group’. Total Visits is simply a count against the table’s primary key. Visit Group is a way of classifying the popularity of each site, and uses some logic against the Total Visits field to identify if a site has a large number of visits. We define it as:
Finally, we add a filter against dim_date to restrict the output to 2014.
You can then test validity by clicking Update Now, which will return a rows set for the query.
Visualisation
Now that we have a data source defined, we can create a visualisation for it, using Tableau’s built in set of graphical options. If we click on Sheet1, we are presented with a blank worksheet, and the dimensions and measures from the data source we just defined:
We’ll start by creating a pie chart analysis show all the visits against sites classified as ‘Large’. For this we require at least 1 dimension and 1 measure (for this example we’ll use Property Name and Total Visits). We first drag the measures and dimensions into the Marks panel. We’ll display each property slice in a different colour, as well as show the property name next to the slice, so we move Property Name on the Color and Label sections. We would like the number of visits to be reflected by the size of the slice and also have this display next to the slice, so similarly we drag Total Visits onto the Size and Label sections. So our Marks panel now looks like this:
Next we drag the Visit Group onto the Filters Panel, which allows us to restrict by Large sites:
Finally, we want to have the year displayed above the Pie Chart, so we drag the Financial Year dimension into the Columns section, and we’ll add a title. The Pie Chart renders like this:
To ensure this is returning sensible results, we’ll run a query against the database to validate:
Looks good! So whilst this is only a basic analysis, it’s a good start to show that we can start running our existing reports against Redshift with little or no modification.