Category: Uncategorized

  • Testing Dual mode in Power BI and BigQuery



    In this test, we will try to see the performance improvement of using dual mode in a Power BI dimension table using a cloud database and a shared dataset, so we don’t get the performance issues experienced in the first dual mode test on a laptop.

    Test 1
    In the first test, we will use a single dimension table to test and a single fact table set to direct query mode.

    I will use a public dataset from Google BigQuery.
    Get Data > Google Big Query. I enter the project ID and am prompted for my Google login.

    BigQuery connection from Power BI

    I can then navigate to the dataset I want to work with.

    Navigate to BiQuery dataset
    Table Selction

    We choose a table from the New York taxi trips dataset, which has over 16 million rows.

    SQL row count.

    This is configured with a Direct Query connection.

    Selecting table import mode


    I will then add some simple measures for testing as follows:

    Measures

    Next, we’ll add the tax-zone_geom as a dimension table and add it with Direct Query mode initially. It has 263 rows.

    I find there are duplicates in my dimension, which is a pain, so I need a new view in BigQuery to fix this.
    I will create a simple view as follows (replace myproject_id with your own):

    CREATE OR REPLACE VIEW `customer-myprojectid.pbitesting.DimZone` AS

    SELECT
    DISTINCT Zone_id, Zone_name
    FROM
    bigquery-public-data.new_york_taxi_trips.taxi_zone_geom

    I can then connect Power BI to BigQuery using the different ProjectID and select my de-duped dimension table.

    Talbe Selection DimZone


    DimZone Distinct Values


    Next, I change the zone_id and pickup_location_id fields to whole numbers to save space.

    I will then connect my DimZone table with my tlc_green_trips_2016 table using zone_id and pick_up_location_id.

    Model relationships

    I then changed my tax_zone_geom table to dual mode.

    Selecting Dual Mode



    Then I publish my dataset to the Power BI service and we can connect to it with a new report.

    Get Data > Power BI semantic models >

    Selecting the semantic dataset

    When I try to connect, I get the following BigQuery authentication error:

    DataSource.Error: ADBC: The service bigquery has thrown an exception. HttpStatusCode is Unauthorized. Request is missing the required authentication credentials. Expected OAuth 2 access token, login cookie, or other valid authentication credential

    So I need to add the credentials for accessing BigQuery to my dataset as follows in the dataset settings, and then I will need to sign in with my Google Account.

    Configuring dataset credentials

    My settings then look like this:

    dataset credentials for BigQuery

    I can then return to my Power BI Desktop report and connect to my semantic model.
    I can now view the tables in my shared dataset.

    Dataset tables



    I can then build a simple matrix with my measures from my tlc_green_tips_2016 fact table and the Zone_name from my dimension table, as well as adding a Zone_name slicer to my model.
    It takes about 40 seconds to load.

    Matrix for taxi data

    Initial testing results are interesting. Even though caching is switched off in the dataset, we still see caching on testing with performance analyzer. Apparently this is due to session level caching, which can’t be switched off.

    Checking report query setting in power bi

    We didn’t see the session-level caching in the previous test, so we will have to select different selections each time to do the dual mode via import mode test. Also, we can’t do the full table test as this is also impacted by session-level caching.

    In the first test, we will use a one-dimensional table and a fact table in direct query mode. We will test the dimension table in dual mode and then in import mode, and see the impact on performance.

    Dual Mode Test 1 Results: 1 Dimension Table and 1 Fact Table in Direct Query Mode

    Dual mode results test 1

    After excluding outliers, we can see the import mode is trending faster than the dual mode.
    After outliers are removed, which look more like anomalies due to resource limitations, rather than as part of the norm,
    the T-test shows a significant difference between the 2 sets of observations.



    Test 2

    In the next test, we will test dual mode and 2 different fact tables. One is in import mode and the other is in direct query mode. This is where we should see the performance improvement of using dual mode.
    We start by adding an additional fact table to the model, but this time in import mode.

    Adding another table

    The tlc_green_trips_2017 table. This table is 12.7M rows.
    Then I rebuild the model. The 2016 is in direct query mode, the DimZone table is in dual mode, and the 2017 table. I then add some additional measures for each table to use in the Matrix test.

    Model 2 with dual mode


    My new test matrix now has some measures from both my fact table 2016 (import), 2017 (direct query)

    New Matrix design

    Test Results:
    The trend appears to favor the import mode again, which was kind of disappointing as i wanted to see a similar improvement as i saw using SQL server locally.

    Running a Test using Python in Power BI
    We can use the Python visual to create some Python graphs and a T-Test.
    I use AI to generate the code for me, and i add it into the Python script editor.
    And i add the Zone, mode, and load time to the value wells of the visualisations pane.

    The results are shown below, above the box plot, with the p-value = 0.0001 or a 1 in a thousand chance of this happening by chance. It seems that import mode is faster again. Perhaps BigQuery doesn’t work as well. It seems testing is the only way of finding out. Models may be faster in some circumstances than in others.


  • SQL Window Functions

    SQL Window Functions

    Summary


    1. Aggregate Window Functions
    These functions perform calculations across a set of table rows that are somehow related to the current row.

    SUM() OVER(…) – Running total or sum per partition
    AVG() OVER(…) – Average per partition
    COUNT() OVER(…) – Count rows in partition
    MIN() OVER(…) – Minimum value in partition
    MAX() OVER(…) – Maximum value in partition

    2. Ranking Window Functions
    These assign a rank or row number to each row within a partition.

    ROW_NUMBER() OVER(…) – Unique sequential number per row in partition
    RANK() OVER(…) – Rank with gaps for ties
    DENSE_RANK() OVER(…) – Rank without gaps for ties
    NTILE(n) OVER(…) – Divides partition into n buckets and assigns a bucket number

    3. Value Navigation (Analytic) Functions
    These functions return values from other rows in the window frame relative to the current row.

    LAG(expression, offset, default) OVER(…) – Value from a previous row
    LEAD(expression, offset, default) OVER(…) – Value from a following row
    FIRST_VALUE(expression) OVER(…) – First value in the window frame
    LAST_VALUE(expression) OVER(…) – Last value in the window frame
    NTH_VALUE(expression, n) OVER(…) – The nth value in the window frame


    Examples


    1. Aggregate Window Functions

    These perform calculations over a window of rows but return a value for each row.

    NameSalesDate
    Alice1002025-01-01
    Alice1502025-01-03
    Bob2002025-01-01
    Alice502025-01-05
    Bob3002025-01-04


    SUM() OVER()

    SELECT
      Name,
      Sales,
      SUM(Sales) OVER (PARTITION BY Name ORDER BY Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS RunningTotal
    FROM SalesData;


    The window function SUM(Sales) OVER (...) calculates a running total of sales for each Name.

    The window is partitioned by Name (so calculations are done separately for each person).

    The rows are ordered by Date within each partition.

    The frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW means the sum includes all rows from the first date up to the current row’s date.

    Output:

    NameSalesRunningTotal
    Alice100100
    Alice150250
    Alice50300
    Bob200200
    Bob300500


    AVG() OVER()

    Customer_idsales
    101200
    102150
    101300
    103400
    102250
    101100
    SELECT 
    customer_id, 
    sales, 
    AVG(sales) OVER (PARTITION BY customer_id) AS AvgSales 
    FROM orders;
    • The window function AVG(sales) OVER (PARTITION BY customer_id) calculates the average sales for each customer_id.
    • The average is computed over all rows with the same customer_id.
    • The result is shown on every row corresponding to that customer.
    customer_idsalesAvgSales
    101200200
    101300200
    101100200
    102150200
    102250200
    103400400


    COUNT(*) OVER

    departmentemployee_id
    Sales101
    Sales102
    HR201
    HR202
    HR203
    IT301
    SELECT
      department,
      employee_id,
      COUNT(*) OVER (PARTITION BY department) AS DeptCount
    FROM employees;
    • The window function COUNT(*) OVER (PARTITION BY department) counts the total number of employees in each department.
    • This count is repeated on every row for that department.
    • No ordering is required here because the count is the same for all rows in the partition.

    departmentemployee_idDeptCount
    Sales1012
    Sales1022
    HR2013
    HR2023
    HR2033
    IT3011


    2. Ranking Functions

    Assign ranks or row numbers within partitions.

    ROW_NUMBER() — Assigns unique row numbers

    employee_iddepartmentsalary
    101Sales5000
    102Sales7000
    103Sales6000
    201HR4500
    202HR4800
    301IT8000

    SELECT
      employee_id,
      salary,
      ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS RowNum
    FROM employees;

    employee_idsalaryRowNum
    10270001
    10360002
    10150003
    20248001
    20145002
    30180001

    The window function ROW_NUMBER() assigns a unique sequential number to each row within the partition defined by department.

    Rows are ordered by salary in descending order within each department.

    The highest salary in each department gets RowNum = 1, the next highest gets 2, and so on.

    RANK() — Assigns rank, with gaps for ties

    employee_iddepartmentsalary
    101Sales7000
    102Sales7000
    103Sales6000
    201HR4800
    202HR4500
    301IT8000
    302IT8000
    303IT7500
    SELECT
      employee_id,
      salary,
      RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS Rank
    FROM employees;

    The RANK() function assigns a rank to each employee within their department, ordering by salary descending.

    Employees with the same salary get the same rank.

    The next rank after a tie skips the appropriate number of positions (i.e., gaps in ranking).

    employee_idsalaryRank
    10170001
    10270001
    10360003
    20148001
    20245002
    30180001
    30280001
    30375003


    DENSE_RANK() — Like RANK but no gaps

    employee_iddepartmentsalary
    101Sales7000
    102Sales7000
    103Sales6000
    201HR4800
    202HR4500
    301IT8000
    302IT8000
    303IT7500
    SELECT
      employee_id,
      salary,
      DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS DenseRank
    FROM employees;
    • DENSE_RANK() assigns ranks within each department, ordered by salary descending.
    • Employees with the same salary get the same rank.
    • Unlike RANK()DENSE_RANK() does not skip ranks after ties. The next distinct value gets the next consecutive rank.
    employee_idsalaryDenseRank
    10170001
    10270001
    10360002
    20148001
    20245002
    30180001
    30280001
    30375002



    3. Value Navigation Functions

    Access values from other rows relative to the current row.

    LAG() — Value from previous row

    employee_iddepartmentsalary
    101Sales5000
    102Sales6000
    103Sales7000
    201HR4500
    202HR4800
    301IT8000
    SELECT
      employee_id,
      salary,
      LAG(salary, 1) OVER (PARTITION BY department ORDER BY salary) AS PrevSalary
    FROM employees;

    LAG(salary, 1) returns the salary value from the previous row within the same department, based on the ascending order of salary.

    For the first row in each department, there is no previous salary, so the result is NULL.

    employee_idsalaryPrevSalary
    1015000NULL
    10260005000
    10370006000
    2014500NULL
    20248004500
    3018000NULL



    LEAD() — Value from next row

    employee_iddepartmentsalary
    101Sales5000
    102Sales6000
    103Sales7000
    201HR4500
    202HR4800
    301IT8000

    SELECT
      employee_id,
      salary,
      LEAD(salary, 1) OVER (PARTITION BY department ORDER BY salary) AS NextSalary
    FROM employees;


    LEAD(salary, 1) returns the salary value from the next row within the same department, based on the ascending order of salary.

    For the last row in each department, there is no next salary, so the result is NULL


    5000
    6000
    10260007000
    1037000NULL
    20145004800
    2024800NULL
    3018000NULL


    LAST_VALUE() — Last value in the window frame

    employee_iddepartmentsalary
    101Sales5000
    102Sales7000
    103Sales6000
    201HR4500
    202HR4800
    301IT8000

    SELECT
      employee_id,
      salary,
      LAST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS MaxSalary
    FROM employees;
    
    • LAST_VALUE(salary) returns the last salary value in the ordered partition.
    • The window frame ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ensures the entire partition is considered (not just up to the current row).
    • When ordered by salary ascending, the last value is the highest salary in the department.
    employee_idsalaryMaxSalary
    10150007000
    10360007000
    10270007000
    20145004800
    20248004800
    30180008000