Cheap Data warehousing, without the warehouse

I was pretty quiet over the past couple of weeks. Part of it was a lot of travel, but another part was taking a deeper dive into MS-PowerPivot and MS-Analysis Services in Tabular mode. I got there, because my latest engagements exposed me to clients which had significant opportunities with reporting. Not the kind you print on paper, but more strategic analysis of business data and creating great information from these. Watching plenty of videos from Microsoft websites like http://channel9.msdn.com or http://www.microsoftvirtualacademy.com let me to the point that I wanted to check-out the reality of the claims I found on these websites with some real world examples.

Two clients of mine provided me with a live set of their Data stored in SQL Server 2005 and 2008. One of the company is a pork packer with approx. $250 Mio in Sales, the other a poultry organization in the $500 Mio revenue ballpark (both are private organizations and the numbers are really ballpark). In both instances, I created different report models, one for each company on sales data, one on each box on serial box data, just for the sake of trying to break the reporting tools. For those that don’t know, PowerPivot is an extension of MS-Excel, and a powerful one I may add. Yeah, and not to forget: It is freely available for download. In case this entry inspires you to try it out, please use the PowerPivot for SQL Server 2012, even if you run an older version of SQL Server. The name is a little misleading. PowerPivot for SQL 2012 is basically the version 2 of the tool, and as usual with MS products, versions 1 are normally next to useless.

In the first tests I build report models that where mimicking one of the reporting tools they used as part of their ERP System as an orientation. While you cannot really compare the two reporting systems, they have a lot of overlapping features, but each of them has additional capabilities. They complement each other.

I took a sales history dataset of approx. 1 Mio. Rows with their respective dimensions. The report in the installed ERP System on the corporate data center runs approx. 2 hours for a full detailed evaluation on the sales history in a traditional reporting tool.

PowerPivot works differently. It loads all data into memory and reports on the data from memory, either local memory installed on the computer (PowerPivot in MS-Excel) or the memory installed on the server (Analysis Services Tabular mode). In the example, I opted for the MS-Excel solution. Both engines are almost identical, except a few features that the server engine has in addition, such as row level security and partitioning. To get the data into memory, they need to be first extracted from the database. The extraction process in my local environment was approx. 15 Minutes for the entire dataset. Once the data was extracted, any reporting on this data I wanted to do was performed in less than 1 second, and I mean any. Changing sorting, slicing, dicing, calculating sophisticated measures, you name it. The data cube for the sales dataset in this example occupied approx. 100 MB in the Excel Spreadsheet, and approx. 500 MB of RAM while the spreadsheet was loaded.

Next, I really wanted to see the limits of the tools. MS-Excel has a maximum size of 2 GB per Spreadsheet. To kick it up a notch, I used the SSCC Data in the corporate database, which was pretty much the largest dataset I could find. The larger of the two clients produces in the neighborhood or 50,000 boxes each day, which are eventually sold or further processed. Each box has a row in that table containing the current status of the box, whether it has been sold and to who, lot date, production date, inventory location etc. In this example, the fact table for each individual box contained 23 Mio rows. In my local environment (SQL Server installed on the same computer that I used for reporting) the extraction of the entire set of detailed box data took approx. 1 hour. To do something with the data, I decided to calculate the average time between production-date and pick-date to do something with the data, plus of course totals weight, no of boxes, etc. This spreadsheet turned out to come in at approx. 1 GB in size and was occupying approx. 5 GB of ram while slicing and dicing. Notable was, that there was no real difference in how fluent the actual slicing and dicing of the data was.

I am still seeking a larger dataset, but it seems that the size that these reporting models can accommodate is large enough to deal with all kinds of report relevant data for most companies in the small to midsize space. In all reality, most companies can easily run on perhaps 10 of these reporting models, one for each major segment of their business. The very same reports built within the MS-Excel Environment can be easily migrated into a server environment, making the report model even more scalable.

The BI Semantic Model (BISM), this is how Microsoft calls this technology stack, is in my opinion a serious alternative to the traditional data warehousing, especially if you are a smaller organization dealing with result sets of less than 100 Mio rows. In an abstract way, the BISM contains a data extraction tool (what ETL does in the traditional data warehouse world) plus the management of all relations up to defining measures and numbers users shall report on. Extraction can be managed via partitioning of the data, so that large datasets don’t need to be refreshed in their entirety, but just the segments that have changed.

At the end of the day, people build data-warehouse solutions to enable users to report on corporate data while relieving the underlying transactional database from evaluation traffic so that transactions run still fast, while reporting is done on a different database. This is exactly what you can do with these BISM’s, and I really don’t need to build a data warehouse for this anymore. Data Warehouses started developing in the 60s. I think that the changes of available hardware over the past 5 decades may lead us to rethink the traditional way of reporting and that we see the beginning of the end for traditional data warehouse architectures.

This entry was posted in Reporting, Software and tagged , , , , . Bookmark the permalink.
Add Comment Register



Leave a Reply