Analysing the Power BI Data Model to Improve Refresh Rates with DAX Studio


I’ll start with some references:

Definitive Guide to DAX Book by Marco Russo and Alberto Ferrari (the definitive experts).
The Guy in the Cube video on model refresh slow:
Chat GPT (I don’t know who that is!)
My own testing and experience and demo using Power BI model using data from the Contoso Retail data warehouse

First things: Connecting DAX Studio to your data model in the Power BI service
1. Get the XMLA entry point URL from Power BI by selecting workspace settings > License info:
You will need to log in with your Power BI login to connect.


2. You can connect to your model in the service with DAX studio as follows (you may need to download and install it first). You can open it either independently or using the external tools within Power BI.



File and Model Size
Typical models use data imported into physical storage from a data warehouse and potentially other sources, such as Google Analytics. The Vertipaq engine stores the model in virtual memory. My Contoso Retail Power BI file is 193 MB (that’s not very big). It is compressed to reduce storage requirements.

Model Size in DAX Studio (Advanced > View Metrics > Summary)
When I look at the model summary from the advanced Metrics in DAX studio, the model loaded into virtual memory by the Vertipaq engine is 253 MB. Also of interest are the 37 tables and 366 columns. My model only has 8 tables and 1 calculated table, so it will be interesting later to understand what these other tables are.



So why is this different?

In-memory Compression: Power BI compresses data in the .pbix file to reduce storage, so the file size on disk is often smaller than its in-memory model size. However, once loaded, Power BI expands it for efficient query performance. There is still compression (next section), it’s just not as much as in the stored file.

Metadata Overhead: The model size shown in DAX Studio includes metadata, indexes, dictionaries, and structures needed for DAX calculations and relationships that Power BI uses while executing queries, which aren’t directly reflected in the file’s storage size.

Cache and Temp Data: DAX Studio may include caches or other temporary data generated during analysis, which increases the apparent size of the model in memory.

Unused Columns or Tables: Any tables or columns loaded but not used may also contribute to the model’s size in DAX Studio, while Power BI might not fully load them in the saved file.

Note: According to the model settings, the model is not using caching (which is off by default), so I can rule out one of the four possibilities.

Power BI License requirements based on model size

The Power BI Pro license allows for models up to 1GB in size, 10 GB of native storage, and 8 refreshes a day
The Power BI Premium license allows for models up to 100GB in size and 100 TB of native storage and up to 48 refreshes a day.

Note: Beyond just loading the raw model, Power BI needs additional memory overhead for processing, caching, and temporary storage during data refreshes and calculations. This means that even if your model fits within the 1 GB limit, you’re likely to experience slow performance or even errors during complex operations, especially if near that cap.

Semantic Model Refresh Speed (Vertipaq engine: imported tables)

We can easily access the model refresh times in the Power BI workspace (Select ‘Refresh history’ from the 3 dots menu next to the model entry. This gives us our total model refresh benchmarks (we can average them (or remove outliers as required at known busy times).

So what are the stages of the model refresh and which ones can we troubleshoot and optimize?

1. Data Import time: This one we can measure. Adam from Guy in the Cube video gives us a simple suggestion on how we can benchmark our table’s load times, by simply doing a SELECT * on each table in the data warehouse and making a benchmark of the time (repeat process at different times of the day (perhaps remove outliers at busy times) and average. I guess there is also a network latency speed to consider with large volumes of data, but let’s just do the: Total Refresh time – Table SELECT * time:

Here is an example from my Contoso data warehouse tables:
1. DimCurrency: Time 0: Rows: 28
2. DimCustomer: Time 0: Rows: 18,869
3. DimDate: Time 0: rows 648
4. DimGeography: Time 0: Rows 274
5. DimProduct: Time 0: Rows: 2467
6. DimProductCAtegory: Time 0: Rows 8
7. DimProductSubcategory: Time 0: Rows: 48
8. DimStore: Time 0: Rows 306
9. FactOnlineSales: 2 Mins 49 Secs, Rows: 12.6M

As you can see only the FactOnline sales table took any real time to SELECT.
But then you might say – hey, why are you using tables in your models and not views? If a new column appears the model refresh will fail.
So you’re right in which case you can run the tests on your views.

Now I don’t have a gateway setup to connect to my laptop today, so I need to set up something in the cloud to perform the test but let’s say the model refreshed in 4 minutes and your SELECT * benchmarks totaled 2 Minutes and 49 seconds, then you can say the for 1 Minute and 11 seconds the power bi service is doing something else.

So what is happening was the data has been imported? Well, that’s when we need to get into the Vertipaq engine. Note, that the engine is proprietary and so has secrets that we will never learn, but what we do know about the steps it performs are as follows:

Vertipaq engine steps
1. Loads the tables into columns
2. Check each column for the best sort order
3. Applys compression to create the smallest fastest model
4. Loads the model into Memory

Note: the Vertipaq engine only deals with imported data. Direct Queries are processed by the Direct Query engine.

DAX formula engine
1. Creates calculated columns and tables (in memory) – not compressed as after Vertipaq
2. Creates relationships between columns

So we need to learn more about what these two engines do.

The Vertipaq Engine
1. Stores imported data into a columnar database (for faster data retrieval).

2 . Sorts the columns by the best order for compression techniques (such as RLE)

3. Compresses the columns using various (proprietary techniques), such as:

a) Value encoding (e.g. deducting a constant value from each column entry to create a value with fewer bits and hence less space.

b) Hash encoding (dictionary coding), creating a table similar to normalization technique where an ID and the value are stored in a table in memory (e.g. product sizes, 0: small 1: medium, and 2: large). The 0, 1, and 2 will replace the sizes small, medium, and large resulting in fewer bits being stored in the table.

c) Run Length encoding (RLE): When values are repeated in a column entries 1-2000 are all Q1, then this can be put in a table of Q1: 2000 (first 50 rows are 50). Entries 2001 to 3500 are Q2. This can be put into the table as Q2: 1500 and so on.
Using this technique the Vertipaq engine can compute the results very quickly. Sorting is therefore very important for RLE-type compression. The engine does do sorting, but you can potentially help it by sorting the data by columns with the lowest cardinality first.
the
Factors impacting compression are therefore as follows:
1. Cardinality (the more unique values, the less compression can be done, hence increasing the size of the column in memory
2. Number of repetitive values – The more, the better the RLE compression possible.
3. The number of rows is important, but the cardinality and repetitive values are more important for compression
4. The data type – e.g. automatic date/time should be turned off as it created a larger entry size than the date field alone. Firstly the size will be smaller and if removing time results in much lower cardinality the column can be compressed much further using either RLE or Hash encoding.

Primary Keys such as user ID integer types will most likely use value encoding.
Numbers such as sales amounts that don’t have massive outliers will probably also be value encoded otherwise dictionary encoding may be used.

To determine the best sort order, a sample of the first x number of rows (I think its a million), is taken, so it’s important that the first x number of rows are of the best quality. You can help the Vertipaq engine therefore by ensuring your initial data is representative of the whole dataset or the engine will have to re-encode data e.g. if it finds outliers and finds its value compression doesn’t work (hence slowing down the process).

We can look at the type of compression used, by using DAX Studio (Advanced > View Metrics)


Firstly, you’ll see an overview of the tables. Each column in a table is encoded differently, so you’ll see the encoding type as ‘Many’.
You need to drill down to get to the column level (or you could just select the column tab on the left-hand side).

Now, let’s have a look at the FactOnlineSales table.

The table is sorted by default by the column size. This is the size in Bytes that the column is consuming in memory.

The first column: OnlineSalesKey has 12, 6127,608 rows and a cardinality of 12, 627,608 and as the name suggests is a unique key, hence it has the same cardinality (unique values) as rows.
It also is the largest column consuming 84MB of memory and 31.77% of the database.
It is encoding as VALUE type,

Rows: 12627608
Cardinarlity: 12627608
Col Size: 84156128 (Data + Dictionary + HierSize)
Data 33, 645560
Dictionary: 128
Hier Size: 50, 5101440
% Table: 33.07%

Unique keys are generally required fields in most tables, so it’s not something you can usually safe space by optimizing (unless you can reduce the size of the key somehow).

By applying the same analysis you can determine which columns and tables are having the biggest impact on your model size and hence slowing your model refresh time.