Author: CB

  • Power BI Report Builder Report Page formatting basics

    While many people use Power BI, but never try report builder, but report builder is a handy tool for certain projects. If you have large volumes of detailed data that would take up too much space in your data model, or you find that direct queries are too slow, paginated reports can provide an alternative.

    In this example, we’ll grab some data from Kaggle. I want a big customer file that I can work with. In real life, you’re most likely to be importing this from a Data Warehouse, but the only difference here is the connection and import; all the methods of work will be the same once you’ve imported the data into your dataset.
    Here is the file I will use, entitled Bank Customer Segmentation 1M+ rows.
    https://www.kaggle.com/datasets/shivamb/bank-customer-segmentation

    The file we have is from a fictitious bank; someone has randomly generated a date for.
    On inspection in Excel, the file is 1,048,569 rows

    report builder Transactions example


    You can download the report builder for free from here:

    https://www.microsoft.com/en-us/download/details.aspx?id=10594

    Open the report builder after you’ve installed it and select a blank report to get started:

    report builder creation

    No the first thing we want to do is get our data, so right click on the Datasets folder and select ‘Get Data’, and you can log in your Power BI account here. If you don’t have one, there is a free trial. I’ll let you figure that bit out.

    Once you’ve logged in, you’ll get the data source screen where you can select your data source. There are a whole host of data connector files here to choose from, including data warehouses such as Google BigQuery, SQL Server, Snowflake, as well as other tools such as Google Analytics, and in this case, because I’m lazy, I will just use the CSV file that I downloaded, but it doesn’t matter for learning report builder.

    report builder data source

    On the next screen, you need to enter your data connection information, but in this example, there is the option of uploading a file at the top below connection settings.

    The main difference here is that with a connection to a data warehouse or whatever, once you publish your report to the Power BI service, the report will refresh with the latest data when the report is reloaded, whereas I will need to upload my file again.

    report builder data source connection

    Once my file is uploaded, I get a preview of the data, and I can change a few things, like the delimiter, which is set to comma-delimited by default.

    importing data


    Next, I click on Transform Data, which takes me down to a scaled-down version of Power Query compared to the version included with Power BI Desktop.

    report builder power query



    Here you can do things like add custom columns and format your data.
    Click Create to import your dataset from Power Query into the Report Builder design window.

    Your first dataset


    So let’s start by building a simple paginated report. I will use a simple table: Insert a table, and then I will drag my fields into the table.

    report builder adding a table

    Design and formatting the paginated report

    First, we need to turn on the properties window (View > Properties), where we can easily edit things like fonts and background colours.

    I got our mutual friend to create a logo for me and supplied the background colours in RGB format,

    For the logo, I will add it to the head of the report (Insert Header). I add a rectangle and move the image into the rectangle, so I can control its properties. The design is now coming to life (Roar).

    Report builder adding a header

    I can preview the report to test how it works by selecting Home > Run:
    It takes a while to load as it loads the whole dataset, not page by page.

    Below is the screenshot of the report in preview mode. There are a million rows, so it breaks the page across pages by default: It paginates the report. The logo in the header displays across pages, but the headers do not, so I need to fix that.

    Report builder preview report

    Fixing Row Headers in place with Power BI Report Builder

    To fix the row headers in place, we need to do the following:
    Select the table (Tablix) and select advanced mode in the bottom right corner. The row and column groups will display at the bottom of the screen.

    manage rows in report builder


    In the row groups box (bottom left), I can control the positioning of the rows during pagination and scrolling.
    I set the following for the Title row and the headers row:

    FixedData = True. This will fix the row in position when I scroll.
    KeepWithGroup = After. This is to stop the row from breaking from the row below.
    RepeatOnNewPage = True. This I to repeat the row on each page.

    I set these on both the title row and the headers row (click on static for each, and the properties will show on the right). To turn on properties, click View> Properties from the main ribbon.

    Now, when I scroll, the header and title are fixed at the top of the page and also stay in place when I navigate pages.

    fixing the top row

    Now we have a basic report working, let’s publish it to the service. Remember, if we had a connection to a data warehouse table, the report would grab the latest data for us, but for now, we’re lazy and just using a static file that won’t change.

    To do so, Select File Publish and select the workspace you want to publish to:

    Once I’ve published my report, I can view it on the service and give others access:
    The report takes a while to load the 1 M-plus rows. Once it’s loaded, you can see some of the features available, such as export to Excel, which is a popular option. So can also set up a subscription to the report so the report will run and email it to you. This can be a useful option for some reports.

    exporting to excel
    subscribing to a report
  • SQL Window Functions

    Summary


    1. Aggregate Window Functions
    These functions perform calculations across a set of table rows that are somehow related to the current row.

      SUM() OVER(…) – Running total or sum per partition
      AVG() OVER(…) – Average per partition
      COUNT() OVER(…) – Count rows in partition
      MIN() OVER(…) – Minimum value in partition
      MAX() OVER(…) – Maximum value in partition

      2. Ranking Window Functions
      These assign a rank or row number to each row within a partition.

        ROW_NUMBER() OVER(…) – Unique sequential number per row in partition
        RANK() OVER(…) – Rank with gaps for ties
        DENSE_RANK() OVER(…) – Rank without gaps for ties
        NTILE(n) OVER(…) – Divides partition into n buckets and assigns a bucket number

        3. Value Navigation (Analytic) Functions
        These functions return values from other rows in the window frame relative to the current row.

          LAG(expression, offset, default) OVER(…) – Value from a previous row
          LEAD(expression, offset, default) OVER(…) – Value from a following row
          FIRST_VALUE(expression) OVER(…) – First value in the window frame
          LAST_VALUE(expression) OVER(…) – Last value in the window frame
          NTH_VALUE(expression, n) OVER(…) – The nth value in the window frame


          Examples


          1. Aggregate Window Functions

          These perform calculations over a window of rows but return a value for each row.

          NameSalesDate
          Alice1002025-01-01
          Alice1502025-01-03
          Bob2002025-01-01
          Alice502025-01-05
          Bob3002025-01-04


          SUM() OVER()

          SELECT
            Name,
            Sales,
            SUM(Sales) OVER (PARTITION BY Name ORDER BY Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS RunningTotal
          FROM SalesData;


          The window function SUM(Sales) OVER (...) calculates a running total of sales for each Name.

          The window is partitioned by Name (so calculations are done separately for each person).

          The rows are ordered by Date within each partition.

          The frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW means the sum includes all rows from the first date up to the current row’s date.

          Output:

          NameSalesRunningTotal
          Alice100100
          Alice150250
          Alice50300
          Bob200200
          Bob300500


          AVG() OVER()

          Customer_idsales
          101200
          102150
          101300
          103400
          102250
          101100
          SELECT 
          customer_id, 
          sales, 
          AVG(sales) OVER (PARTITION BY customer_id) AS AvgSales 
          FROM orders;
          • The window function AVG(sales) OVER (PARTITION BY customer_id) calculates the average sales for each customer_id.
          • The average is computed over all rows with the same customer_id.
          • The result is shown on every row corresponding to that customer.
          customer_idsalesAvgSales
          101200200
          101300200
          101100200
          102150200
          102250200
          103400400


          COUNT(*) OVER

          departmentemployee_id
          Sales101
          Sales102
          HR201
          HR202
          HR203
          IT301
          SELECT
            department,
            employee_id,
            COUNT(*) OVER (PARTITION BY department) AS DeptCount
          FROM employees;
          • The window function COUNT(*) OVER (PARTITION BY department) counts the total number of employees in each department.
          • This count is repeated on every row for that department.
          • No ordering is required here because the count is the same for all rows in the partition.

          departmentemployee_idDeptCount
          Sales1012
          Sales1022
          HR2013
          HR2023
          HR2033
          IT3011


          2. Ranking Functions

          Assign ranks or row numbers within partitions.

          ROW_NUMBER() — Assigns unique row numbers

          employee_iddepartmentsalary
          101Sales5000
          102Sales7000
          103Sales6000
          201HR4500
          202HR4800
          301IT8000

          SELECT
            employee_id,
            salary,
            ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS RowNum
          FROM employees;

          employee_idsalaryRowNum
          10270001
          10360002
          10150003
          20248001
          20145002
          30180001

          The window function ROW_NUMBER() assigns a unique sequential number to each row within the partition defined by department.

          Rows are ordered by salary in descending order within each department.

          The highest salary in each department gets RowNum = 1, the next highest gets 2, and so on.

          RANK() — Assigns rank, with gaps for ties

          employee_iddepartmentsalary
          101Sales7000
          102Sales7000
          103Sales6000
          201HR4800
          202HR4500
          301IT8000
          302IT8000
          303IT7500
          SELECT
            employee_id,
            salary,
            RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS Rank
          FROM employees;

          The RANK() function assigns a rank to each employee within their department, ordering by salary descending.

          Employees with the same salary get the same rank.

          The next rank after a tie skips the appropriate number of positions (i.e., gaps in ranking).

          employee_idsalaryRank
          10170001
          10270001
          10360003
          20148001
          20245002
          30180001
          30280001
          30375003


          DENSE_RANK() — Like RANK but no gaps

          employee_iddepartmentsalary
          101Sales7000
          102Sales7000
          103Sales6000
          201HR4800
          202HR4500
          301IT8000
          302IT8000
          303IT7500
          SELECT
            employee_id,
            salary,
            DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS DenseRank
          FROM employees;
          • DENSE_RANK() assigns ranks within each department, ordered by salary descending.
          • Employees with the same salary get the same rank.
          • Unlike RANK()DENSE_RANK() does not skip ranks after ties. The next distinct value gets the next consecutive rank.
          employee_idsalaryDenseRank
          10170001
          10270001
          10360002
          20148001
          20245002
          30180001
          30280001
          30375002



          3. Value Navigation Functions

          Access values from other rows relative to the current row.

          LAG() — Value from previous row

          employee_iddepartmentsalary
          101Sales5000
          102Sales6000
          103Sales7000
          201HR4500
          202HR4800
          301IT8000
          SELECT
            employee_id,
            salary,
            LAG(salary, 1) OVER (PARTITION BY department ORDER BY salary) AS PrevSalary
          FROM employees;

          LAG(salary, 1) returns the salary value from the previous row within the same department, based on the ascending order of salary.

          For the first row in each department, there is no previous salary, so the result is NULL.

          employee_idsalaryPrevSalary
          1015000NULL
          10260005000
          10370006000
          2014500NULL
          20248004500
          3018000NULL



          LEAD() — Value from next row

          employee_iddepartmentsalary
          101Sales5000
          102Sales6000
          103Sales7000
          201HR4500
          202HR4800
          301IT8000

          SELECT
            employee_id,
            salary,
            LEAD(salary, 1) OVER (PARTITION BY department ORDER BY salary) AS NextSalary
          FROM employees;


          LEAD(salary, 1) returns the salary value from the next row within the same department, based on the ascending order of salary.

          For the last row in each department, there is no next salary, so the result is NULL


          5000
          6000
          10260007000
          1037000NULL
          20145004800
          2024800NULL
          3018000NULL


          LAST_VALUE() — Last value in the window frame

          employee_iddepartmentsalary
          101Sales5000
          102Sales7000
          103Sales6000
          201HR4500
          202HR4800
          301IT8000

          SELECT
            employee_id,
            salary,
            LAST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS MaxSalary
          FROM employees;
          
          • LAST_VALUE(salary) returns the last salary value in the ordered partition.
          • The window frame ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ensures the entire partition is considered (not just up to the current row).
          • When ordered by salary ascending, the last value is the highest salary in the department.
          employee_idsalaryMaxSalary
          10150007000
          10360007000
          10270007000
          20145004800
          20248004800
          30180008000

        1. Testing Power BI DAX Measures Speed with DAX Studio

          DAX Expression > DAX query > Vertipaq query.

          Test 1: Testing the Impact of USERELATIONSHIP() in a DAX Measure

          Does using USERELATIONSHIP() with inactive relationships slow down a DAX measure? I’m not sure so I thought I would test it out.

          To test this I will create a measure using the existing between my date table and my FactOnlineSales table and then I created a copy of the date table and joined it in the same way to the FactOnlineSales table, but making the relationship inactive.

          I use DAX studio to create 2 measures for calculating total sales and then I throw in some filters for year and continent as follows:

          DEFINE
              MEASURE '_Measures'[test1] =
                  CALCULATE (
                      SUMX (
                          'FactOnlineSales',[Sales Amount]
                      ),
                      
                      'DimDate'[CalendarYear] = 2009,
                      'DimGeography'[ContinentName] = "Europe"
                      ) 
          
              MEASURE '_Measures'[test2] =
                  CALCULATE (
                      SUMX (
                          'FactOnlineSales',[Sales Amount]
                      ),
                      'DimDate (2)'[CalendarYear] = 2009,
                    	'DimGeography'[ContinentName] = "Europe",
                      USERELATIONSHIP('DimDate (2)'[Datekey], FactOnlineSales[DateKey])
                  )
          EVALUATE
          ROW (
              "test", [test2] 
          )

          Below is a screenshot of my DAX studio settings. As you can see I’ve got the ‘Clear cache on Run’ button selected. This will help ensure each test is equal. I also have the use ‘Server Timings’ button selected. Note be careful using ‘Clear cache on run’ on a connection to a production model as it will clear the cache (hence slowing down reports).

          For the original sale measure that has an active relationship to the data table, I get a time of 2,197 which is broken down into FE (Formula engine) is 624 ms and 1,573 ms for the SE (Storage engine -that’s the Vertipaq engine). The FE turns the DAX expression into a DAX query and creates a query plan that it sends to the SE to retrieve the results of the query. The FE also caches the results of queries (this is why I have turned on the clear cache when the DAX query is run, otherwise the subsequent queries will return results very quickly).

          Comparing the 2 measures
          Here we run the query on each measure 5 times to get an average and as you can see there was very little difference in performance. Indeed the first 3 tests on the measure with the active relationship took longer (but this may have been because my laptop was busy doing something). My results are below, I think they are pretty inconclusive, but they are interesting none the less, and the method would be interesting to test with a larger model.


          In the server timings you can view the Vertipaq query that is executed by the Analysis services:

          Each step of the execution is below:
          The first step selects the data from FactOnlineSales with the 2 filters returning 1.4M rows of data of 133 KB.
          The second step does the sum of the sales amount and returns one row. The third step returns the start and end time as well as other information (if you mouse over the query).

          The Par. field stands for Parrallism, which represents the number of parallel threads used by the storage engine to retrieve the result at each step.

        2. Analysing the Power BI Data Model to Improve Refresh Rates with DAX Studio


          I’ll start with some references:

          Definitive Guide to DAX Book by Marco Russo and Alberto Ferrari (the definitive experts).
          The Guy in the Cube video on model refresh slow:
          Chat GPT (I don’t know who that is!)
          My own testing and experience and demo using Power BI model using data from the Contoso Retail data warehouse

          First things: Connecting DAX Studio to your data model in the Power BI service
          1. Get the XMLA entry point URL from Power BI by selecting workspace settings > License info:
          You will need to log in with your Power BI login to connect.


          2. You can connect to your model in the service with DAX studio as follows (you may need to download and install it first). You can open it either independently or using the external tools within Power BI.



          File and Model Size
          Typical models use data imported into physical storage from a data warehouse and potentially other sources, such as Google Analytics. The Vertipaq engine stores the model in virtual memory. My Contoso Retail Power BI file is 193 MB (that’s not very big). It is compressed to reduce storage requirements.

          Model Size in DAX Studio (Advanced > View Metrics > Summary)
          When I look at the model summary from the advanced Metrics in DAX studio, the model loaded into virtual memory by the Vertipaq engine is 253 MB. Also of interest are the 37 tables and 366 columns. My model only has 8 tables and 1 calculated table, so it will be interesting later to understand what these other tables are.



          So why is this different?

          In-memory Compression: Power BI compresses data in the .pbix file to reduce storage, so the file size on disk is often smaller than its in-memory model size. However, once loaded, Power BI expands it for efficient query performance. There is still compression (next section), it’s just not as much as in the stored file.

          Metadata Overhead: The model size shown in DAX Studio includes metadata, indexes, dictionaries, and structures needed for DAX calculations and relationships that Power BI uses while executing queries, which aren’t directly reflected in the file’s storage size.

          Cache and Temp Data: DAX Studio may include caches or other temporary data generated during analysis, which increases the apparent size of the model in memory.

          Unused Columns or Tables: Any tables or columns loaded but not used may also contribute to the model’s size in DAX Studio, while Power BI might not fully load them in the saved file.

          Note: According to the model settings, the model is not using caching (which is off by default), so I can rule out one of the four possibilities.

          Power BI License requirements based on model size

          The Power BI Pro license allows for models up to 1GB in size, 10 GB of native storage, and 8 refreshes a day
          The Power BI Premium license allows for models up to 100GB in size and 100 TB of native storage and up to 48 refreshes a day.

          Note: Beyond just loading the raw model, Power BI needs additional memory overhead for processing, caching, and temporary storage during data refreshes and calculations. This means that even if your model fits within the 1 GB limit, you’re likely to experience slow performance or even errors during complex operations, especially if near that cap.

          Semantic Model Refresh Speed (Vertipaq engine: imported tables)

          We can easily access the model refresh times in the Power BI workspace (Select ‘Refresh history’ from the 3 dots menu next to the model entry. This gives us our total model refresh benchmarks (we can average them (or remove outliers as required at known busy times).

          So what are the stages of the model refresh and which ones can we troubleshoot and optimize?

          1. Data Import time: This one we can measure. Adam from Guy in the Cube video gives us a simple suggestion on how we can benchmark our table’s load times, by simply doing a SELECT * on each table in the data warehouse and making a benchmark of the time (repeat process at different times of the day (perhaps remove outliers at busy times) and average. I guess there is also a network latency speed to consider with large volumes of data, but let’s just do the: Total Refresh time – Table SELECT * time:

          Here is an example from my Contoso data warehouse tables:
          1. DimCurrency: Time 0: Rows: 28
          2. DimCustomer: Time 0: Rows: 18,869
          3. DimDate: Time 0: rows 648
          4. DimGeography: Time 0: Rows 274
          5. DimProduct: Time 0: Rows: 2467
          6. DimProductCAtegory: Time 0: Rows 8
          7. DimProductSubcategory: Time 0: Rows: 48
          8. DimStore: Time 0: Rows 306
          9. FactOnlineSales: 2 Mins 49 Secs, Rows: 12.6M

          As you can see only the FactOnline sales table took any real time to SELECT.
          But then you might say – hey, why are you using tables in your models and not views? If a new column appears the model refresh will fail.
          So you’re right in which case you can run the tests on your views.

          Now I don’t have a gateway setup to connect to my laptop today, so I need to set up something in the cloud to perform the test but let’s say the model refreshed in 4 minutes and your SELECT * benchmarks totaled 2 Minutes and 49 seconds, then you can say the for 1 Minute and 11 seconds the power bi service is doing something else.

          So what is happening was the data has been imported? Well, that’s when we need to get into the Vertipaq engine. Note, that the engine is proprietary and so has secrets that we will never learn, but what we do know about the steps it performs are as follows:

          Vertipaq engine steps
          1. Loads the tables into columns
          2. Check each column for the best sort order
          3. Applys compression to create the smallest fastest model
          4. Loads the model into Memory

          Note: the Vertipaq engine only deals with imported data. Direct Queries are processed by the Direct Query engine.

          DAX formula engine
          1. Creates calculated columns and tables (in memory) – not compressed as after Vertipaq
          2. Creates relationships between columns

          So we need to learn more about what these two engines do.

          The Vertipaq Engine
          1. Stores imported data into a columnar database (for faster data retrieval).

          2 . Sorts the columns by the best order for compression techniques (such as RLE)

          3. Compresses the columns using various (proprietary techniques), such as:

          a) Value encoding (e.g. deducting a constant value from each column entry to create a value with fewer bits and hence less space.

          b) Hash encoding (dictionary coding), creating a table similar to normalization technique where an ID and the value are stored in a table in memory (e.g. product sizes, 0: small 1: medium, and 2: large). The 0, 1, and 2 will replace the sizes small, medium, and large resulting in fewer bits being stored in the table.

          c) Run Length encoding (RLE): When values are repeated in a column entries 1-2000 are all Q1, then this can be put in a table of Q1: 2000 (first 50 rows are 50). Entries 2001 to 3500 are Q2. This can be put into the table as Q2: 1500 and so on.
          Using this technique the Vertipaq engine can compute the results very quickly. Sorting is therefore very important for RLE-type compression. The engine does do sorting, but you can potentially help it by sorting the data by columns with the lowest cardinality first.
          the
          Factors impacting compression are therefore as follows:
          1. Cardinality (the more unique values, the less compression can be done, hence increasing the size of the column in memory
          2. Number of repetitive values – The more, the better the RLE compression possible.
          3. The number of rows is important, but the cardinality and repetitive values are more important for compression
          4. The data type – e.g. automatic date/time should be turned off as it created a larger entry size than the date field alone. Firstly the size will be smaller and if removing time results in much lower cardinality the column can be compressed much further using either RLE or Hash encoding.

          Primary Keys such as user ID integer types will most likely use value encoding.
          Numbers such as sales amounts that don’t have massive outliers will probably also be value encoded otherwise dictionary encoding may be used.

          To determine the best sort order, a sample of the first x number of rows (I think its a million), is taken, so it’s important that the first x number of rows are of the best quality. You can help the Vertipaq engine therefore by ensuring your initial data is representative of the whole dataset or the engine will have to re-encode data e.g. if it finds outliers and finds its value compression doesn’t work (hence slowing down the process).

          We can look at the type of compression used, by using DAX Studio (Advanced > View Metrics)


          Firstly, you’ll see an overview of the tables. Each column in a table is encoded differently, so you’ll see the encoding type as ‘Many’.
          You need to drill down to get to the column level (or you could just select the column tab on the left-hand side).

          Now, let’s have a look at the FactOnlineSales table.

          The table is sorted by default by the column size. This is the size in Bytes that the column is consuming in memory.

          The first column: OnlineSalesKey has 12, 6127,608 rows and a cardinality of 12, 627,608 and as the name suggests is a unique key, hence it has the same cardinality (unique values) as rows.
          It also is the largest column consuming 84MB of memory and 31.77% of the database.
          It is encoding as VALUE type,

          Rows: 12627608
          Cardinarlity: 12627608
          Col Size: 84156128 (Data + Dictionary + HierSize)
          Data 33, 645560
          Dictionary: 128
          Hier Size: 50, 5101440
          % Table: 33.07%

          Unique keys are generally required fields in most tables, so it’s not something you can usually safe space by optimizing (unless you can reduce the size of the key somehow).

          By applying the same analysis you can determine which columns and tables are having the biggest impact on your model size and hence slowing your model refresh time.

        3. How to use the Window and OFFSET Functions in DAX Measures:


          The Window Function

          To use the Window function, we can start by creating a virtual table of customer sales by sales key using the CALCULATETABLE function and then creating a window on the table to return only the results we want by position. We can make use of DAX query to do this.

          
          DEFINE
          
          VAR CustomerSales =  CalculateTable(SUMMARIZECOLUMNS('V_FactOnlineSales'[CustomerKey], "Sales", SUMX('V_FactOnlineSales', V_FactOnlineSales[SalesAmount])), NOT(ISBLANK(V_FactOnlineSales[CustomerKey])))
          
          VAR CustomerSalesWindow = 
          
              WINDOW(
                  1,ABS,3,ABS,
                CustomerSales, 
                  ORDERBY([Sales], DESC), 
              )
              
          EVALUATE
          CustomerSalesWindow


          As you can see from the code above, we start off by creating a table in a variable called CustomerSales, which has a column for CustomerKey and a summarized sales column. The table is filtered, so that the customer is not blank.

          We then create a second variable called ‘CustomerSalesWindow’ to filter our CustomerSales table using the Window function.
          The initial customer sales table is as follows

          We then create the window using the following code:

          
          VAR CustomerSalesWindow = 
          
              WINDOW(
                  1,ABS,3,ABS,
                CustomerSales, 
                  ORDERBY([Sales], DESC), 
              )
          
          EVALUATE
          CustomerSalesWindow


          As you can see the original table is sorted by sales in descending order (we can also use partition, but not In this example).

          The syntax for the window function is as follows:
          WINDOW ( from[, from_type], to[, to_type][, <relation> or <axis>][, <orderBy>][, <blanks>][, <partitionBy>][, <matchBy>][, <reset>] )

          So you can see that we have:
          WINDOW(
          from: 1
          from-type: ABS
          to: 1
          to-type: ABS
          relation: CustomerSales
          ORDERBY([Sales[, DESC))


          As you can see the ORDERBY and PARTITION are very similar to SQL in that it is sorting the sales field in descending order.
          The positions from ABS: 1 to ABS: 3 give us the top 3 sales results.

          Of course, we could have used something like TOPN() function to get the top 3 sales, but if we wanted a specific location e.g. The 3rd to 5th positions, then the WINDOW() function would make it much easier.

          The OFFSET() Function

          The OFFSET() function ‘Returns a single row that is positioned either before or after the current row within the same table, by a given offset’.

          In this example, we use OFFSET() to display the previous month’s sales in a table. We start off by creating a _SummarySales table and then add a second column _SalesOffest, which uses -1 as the delta which shows the previous month’s sales when the ORDERBY clause on CalanderMonth is set to ascending order (ASC). We can then add the final SalesMonthOnMonth column to the table to get the final summary.

          AFDEFINE
          	-- Syntax:
          	-- OFFSET ( <delta>[, <relation> or <axis>][, <orderBy>][, <blanks>][, <partitionBy>][, <matchBy>][, <reset>] )
          	
          	-- Create Sales by moth table
          	
          	VAR _SummarySales = SUMMARIZECOLUMNS(
          		DimDate[CalendarMonth],
          		"Sales", SUMX(
          			RELATEDTABLE(FactOnlineSales),
          			FactOnlineSales[SalesAmount]
          		)
          	)
          
          -- Add the offset column by one month
          
          	VAR _SalesOffset =
          	ADDCOLUMNS(
          		_SummarySales,
          		"PreviousMonth", SELECTCOLUMNS(
          			OFFSET(
          				-1,
          				_SummarySales,
          				ORDERBY(
          					[CalendarMonth],
          					ASC
          				)
          			),
          			[Sales]
          		)
          	)
          
          -- Add a month on month difference column
          
          	VAR _SalesMonthOnMonth =
          	ADDCOLUMNS(
          		_SalesOffset,
          		"Month on Month", [Sales] - [PreviousMonth]
          	)
          
          EVALUATE
          	_SalesMonthOnMonth

          After adding the previous month column, we can then add a month-on-month column.
          This gives us a table as follows:



        4. How to filter DAX Measures In Power BI

          The application of various filter arguments is crucial to controlling the output of DAX measures when visual filters are present on a report page. When CALCULATE() is used in an expression, any filters applied will override any existing filters on the filter being applied.
          e.g. CALCULATE(SUM(V_FactOnlineSales[SalesAmount]), V_DimProduct[BrandName]=”Litware”)
          Any filter on V_DimProduct[BrandName]is removed and the new filter is applied. So a slicer on that column would stop working.

          In this example, we create a number of measures in a matrix, using different filter functions. There are 3 slicers that can be applied to the matrix: Year, BrandName, and ProductCategoryName. The original simple measure is sales amount, which is simply just a sum of sales amount from the FactOnlineSales table from the Contoso Retail Datawarehouse.



          Simple SUMX Measure:
          The original measure with no filter arguments, will just be affected by the current filter context applied by both slicers and matrix or table rows and columns
          Sales Amount =
          SUM(V_FactOnlineSales[SalesAmount])

          CALCULATE(): Using the powerful Calculate() function, we can apply filters to the original measure to change the way the visual filters affect the measure.

          Sales Amount (Filter Brand Litware) =CALCULATE(SUM(V_FactOnlineSales[SalesAmount]), V_DimProduct[BrandName]=”Litware”)

          In the next example, we apply a filter using calculate, but on the year, which is a filter in the matrix. As you can see, the filter on year is removed from the columns and the sum of sales for 2008 is repeated for each year from 2007 to 2009.

          Sales Amount (Filter Year 2008) = CALCULATE(SUM(V_FactOnlineSales[SalesAmount]), V_DimDate[CalendarYear]=2008).

          To fix this issue we can use KEEPFILTERS()

          KEEPFILTERS()
          : The KEEPFILTERS() function helps us keep the filters on Year in the matrix:
          Sales Amount (KeepFilters) =

          CALCULATE(SUM(V_FactOnlineSales[SalesAmount]), V_DimDate[CalendarYear]=2008, KEEPFILTERS(V_DimDate))

          REMOVEFILTERS(): By using the REMOVEFILTERS() function along with CALCULATE() we can remove all the filters applied to the report. As you can see in the table above, this removes the filters from the Year, Brand, and Product Category Name columns. Essentially giving us the total sales for the company for all time.

          Sales Amount (RemoveFilters: all) = CALCULATE(SUMX(V_FactOnlineSales, V_FactOnlineSales[SalesAmount]), REMOVEFILTERS())

          REMOVEFILTERS(‘TableName’
          ): We can also use the REMOVEFILTERS() function to remove filters only on a specific table (not all tables). In this example, we remove any filters on the V_DimProduct table. Other filters will continue to filter the measure:

          Sales Amount (RemoveFilters: Table) = CALCULATE(SUMX(V_FactOnlineSales, V_FactOnlineSales[SalesAmount]), REMOVEFILTERS(V_DimDate))

          REMOVEFILTERS(‘TableName'[ColumnName]): A more granular method is to remove filters just on specific columns. In this example, we remove any filtering applied by V_DimProductCategory'[ProductCategoryName]

          Sales Amount (RemoveFilters: Column) = CALCULATE(SUMX(V_FactOnlineSales, V_FactOnlineSales[SalesAmount]), REMOVEFILTERS(‘V_DimProductCategory'[ProductCategoryName]))

          ALL(‘TableName’): Removes all filters from a specified table. This is similar to REMOVEFILTERS(‘TableNAme’), but it works differently.

          Sales Amount (All: DimProductCategory Table) = CALCULATE(SUMX(V_FactOnlineSales, V_FactOnlineSales[SalesAmount]), ALL(V_DimProductCategory))

          ALL(‘ColumnName’): Remove all filters from specified columns:

          ALLEXCEPT(): Remove all filters except the specified column or columns. In this example, we remove all filters except from the V_DimProductCategory'[ProductCategoryName] column.

          Sales Amount (AllExcept: ProductCategoryName) = CALCULATE(SUMX(V_FactOnlineSales, V_FactOnlineSales[SalesAmount]), ALLEXCEPT(V_DimProductCategory, ‘V_DimProductCategory'[ProductCategoryName]))

          ALLNONBLANKROWS(): If your data has blank rows in it, then this may be useful to you. If it doesn’t then you don’t need it. You can apply it to a table or columns:

          SalesTableRows = SalesTableRows (non blank) = COUNTROWS(ALLNOBLANKROW(‘V_FactOnlineSales’))
          SalesOrderLines (non blank) = COUNTROWS(ALLNOBLANKROW(‘V_FactOnlineSales'[SalesOrderLineNumber]))

          ALLSELECTED(‘TableName’): Removes filters coming from within the visual. So in this example, ALLSELECTED() removes the filters on year within the matrix i.e the columns, but if you use the slicer on year, it will still work. You can also apply to columns.

          Sales Amount (AllSelected DimDate) = CALCULATE(SUMX(V_FactOnlineSales, V_FactOnlineSales[SalesAmount]), ALLSELECTED(V_DimDate))

          Using multiple Tables and Columns in Filter Expressions


          The above use of filter functions affecting visuals can be further expanded to include multiple tables and multiple columns as in the examples below:

          CALCULATE(SUM(V_FactOnlineSales[SalesAmount]), V_DimProduct[BrandName]=”Litware” &&  V_DimProduct[BrandName]=”Northwind Traders”)

          Sales Amount (Filter Multiple Year) =

          CALCULATE(SUM(V_FactOnlineSales[SalesAmount]), V_DimDate[CalendarYear] IN {2008, 2009})

          Sales Amount (RemoveFilters multiple) = CALCULATE(SUMX(V_FactOnlineSales, V_FactOnlineSales[SalesAmount]), REMOVEFILTERS(‘V_DimProductCategory'[ProductCategoryName], ‘V_DimProductCategory'[ProductCategoryDescription]))

          Sales Amount (All: multiple) = CALCULATE(SUMX(V_FactOnlineSales, V_FactOnlineSales[SalesAmount]), ALL(V_DimProductCategory), ALL(V_DimDate))

          Recommended Book: The Definitive Guide to DAX by Marco Russo and Alberto Ferrari













        5. How to Create Multiple Measures With Power BI DAX Query Editor

          Creating multiple measures in DAX Query Editor

          It may be useful to create templates for creating DAX models, using commonly used DAX measures. Here we create a template for commonly used sales measures, such as Sales YTD, Sales Last Year, Sales Year on Year, and Sales Year on Year %, we can then apply the same logic for QTD and MTD.

          The full code for creating the measures, I will add to the bottom of this page.

          For this example, I am using just 2 tables from the Contoso Retail Data warehouse: DimDate and FactSales. They are joined on the DateKey.

          We start with the Year sales measures as shown below in the DAX query editor. To add them to the model, we just click ‘Update model: Add new measure’, but first we want to format the code, using the Format Query button.

          Here is the code, with the DAX formatted.

          We can then click the 4 ‘Update mode: Add new measure’ texts and it will add the 4 measures to the model..


          We can then create similar measures for QTD and MTD as follows:

          Here is the code for the Quarterly measures:

          The code for creating the Monthly measures is as follows:

          That gives me the 12 measures in record time!


          As promised here is the full code that can be copied and pasted. Of course, you’ll need to change the table names as required. Note, I have created an empty ‘_Meusures’ table to act as a container for the measures.

          // Learn more about DAX queries at https://aka.ms/dax-queries
          DEFINE
          //Year Measures
          	MEASURE '_Measures'[Sales_YTD] = CALCULATE(
          			SUM(FactSales[SalesAmount]),
          			DATESYTD('DimDate'[Datekey])
          		)
          	MEASURE '_Measures'[Sales_LY_YTD] = CALCULATE(
          			SUM(FactSales[SalesAmount]),
          			SAMEPERIODLASTYEAR(DATESYTD('DimDate'[Datekey]))
          		)
          	MEASURE '_Measures'[Sales_YOY] = '_Measures'[Sales_YTD] - '_Measures'[Sales_LY_YTD]
          	MEASURE '_Measures'[Sales_YOY%] = ('_Measures'[Sales_YTD] - '_Measures'[Sales_LY_YTD]) / '_Measures'[Sales_LY_YTD]
          	
          //QTD Measures
          	MEASURE '_Measures'[Sales_QTD] = CALCULATE(
          			SUM(FactSales[SalesAmount]),
          			DATESQTD('DimDate'[Datekey])
          		)
          	MEASURE '_Measures'[Sales_LY_QTD] = CALCULATE(
          			SUM(FactSales[SalesAmount]),
          			SAMEPERIODLASTYEAR(DATESQTD('DimDate'[Datekey]))
          		)
          	MEASURE '_Measures'[Sales_QTD_YOY] = '_Measures'[Sales_QTD] - '_Measures'[Sales_LY_QTD]
          	MEASURE '_Measures'[Sales_QTD_YOY%] = ('_Measures'[Sales_QTD] - '_Measures'[Sales_LY_QTD]) / '_Measures'[Sales_LY_QTD]
          	
          	//MTD Measures
          	MEASURE '_Measures'[Sales_MTD] = CALCULATE(
          			SUM(FactSales[SalesAmount]),
          			DATESMTD('DimDate'[Datekey])
          		)
          	MEASURE '_Measures'[Sales_LY_MTD] = CALCULATE(
          			SUM(FactSales[SalesAmount]),
          			SAMEPERIODLASTYEAR(DATESMTD('DimDate'[Datekey]))
          		)
          	MEASURE '_Measures'[Sales_MTD_YOY] = '_Measures'[Sales_MTD] - '_Measures'[Sales_LY_MTD]
          	MEASURE '_Measures'[Sales_MTD_YOY%] = ('_Measures'[Sales_MTD] - '_Measures'[Sales_LY_MTD]) / '_Measures'[Sales_LY_MTD]
        6. How to use the Index Function in DAX Measures

          Demonstrating the use of the Index function with the Contoso Retail Data warehouse. We can start off by building a virtual table in the DAX query editor, which we can use to apply the Index function. The table creates is a list of the first 10 customers by customer key from the DimCustomer table.

          DEFINE
          	-- Syntax
          	-- INDEX(<position>[, <relation> or <axis>][, <orderBy>][, <blanks>][, <partitionBy>][, <matchBy>][, <reset>] )
          
          	VAR CustomerSalesSample =
          	CALCULATETABLE(
          		SELECTCOLUMNS(
          			DimCustomer,
          			"CustomerKey", DimCustomer[CustomerKey],
          			"FirstName", DimCustomer[FirstName],
          			"LastName", DimCustomer[LastName],
          			"Total Sales", SUMX(
          				RELATEDTABLE(FactOnlineSales),
          				FactOnlineSales[SalesAmount]
          			)
          		),
          		TOPN(
          			10,
          			DimCustomer,
          			DimCustomer[CustomerKey],
          			ASC
          		)
          	)
          
          VAR Ind = INDEX(
          		1,
          		CustomerSalesSample,
          		ORDERBY(
          			[Total Sales],
          			DESC
          		)
          	)
          EVALUATE
          	Ind

          If we evaluate the CustomerSalesSample table first we can see the table we are working with.

          The syntax for the INDEX function is as follows:

          — INDEX(<position>[, <relation> or <axis>][, <orderBy>][, <blanks>][, <partitionBy>][, <matchBy>][, <reset>] )
          In our example, we use the <position> of 1 to get the first sales amount, which is the highest sales amount as the Total Sales column is sorted in descending order (DESC)

          And then when we evaluate the Index expression (variable Ind), we get the following, which is the correct output we are looking for.

          To get the last position, we could either sort the data by ascending order (ASC) or we can use the INDEX of -1 as the following example:

          VAR Ind = INDEX(
          		-1,
          		CustomerSalesSample,
          		ORDERBY(
          			[Total Sales],
          			DESC
          		)
          	)

          When we use -1 as the index we get the blank sales returned, which isn’t what we wanted, so need to modify the code.

          One way of filtering out the blanks from the sales table is to add a filter on the table to filter out the blanks as an additional variable as int he below.

          DEFINE
          	-- Syntax
          	-- INDEX(<position>[, <relation> or <axis>][, <orderBy>][, <blanks>][, <partitionBy>][, <matchBy>][, <reset>] )
          
          	VAR CustomerSalesSample =
          	CALCULATETABLE(
          		SELECTCOLUMNS(
          			DimCustomer,
          			"CustomerKey", DimCustomer[CustomerKey],
          			"FirstName", DimCustomer[FirstName],
          			"LastName", DimCustomer[LastName],
          			"Total Sales", SUMX(
          				RELATEDTABLE(FactOnlineSales),
          				FactOnlineSales[SalesAmount]
          			)
          		),
          		TOPN(
          			10,
          			DimCustomer,
          			DimCustomer[CustomerKey],
          			ASC
          		) 
          
          	)
          	VAR CustomerSalesNotBlank = FILTER(CustomerSalesSample, NOT(ISBLANK([Total Sales])))	
          
          	VAR Ind = INDEX(
          		-1,
          		CustomerSalesNotBlank,
          		ORDERBY(
          			[Total Sales],
          			DESC
          		)
          	)
          EVALUATE
           Ind

          Evaluating this code now gives us the least total sales amount that is not blank:

        7. How to use the RankX Function in DAX Measures

          Using the RANK() Function in DAX

          To demonstrate the use of the DAX function, we’ll start by creating a simple sales table with the sales amount and the year. Then we’ll add additional columns to create examples of using the RANK function.

          In the first example (SimpleSalesRank), we’ll just create a simple ranking. The default function ranks the Sales column from the least amount to the highest amount and you’ll notice the blank sales value is included as rank one. Note, we can order by more than one column if partitioning by more than one column.

          SimpleSalesRank = RANK(ORDERBY(SalesRank[Sales]))
          

          The first thing we can do is move the blank value to the end of the ranking, using the LAST parameter.

          SimpleSalesRankBLANKLast = RANK(ORDERBY(SalesRank[Sales]), LAST)

          Then we can rank the sales values from the highest first, while still retaining blank in the last position.

          SimpleSalesRankDESC0 = RANK(ORDERBY(SalesRank[Sales], DESC), LAST)

          We can then partition the ranking by using the Year column.

          SimpleSalesRankPARTITION = RANK(DENSE, ORDERBY(SalesRank[Sales], DESC),  LAST, PARTITIONBY(SalesRank[Year]))

          Using RANKX() function in DAX

          In the next example, we create a similar Sales table with a Sales column and a year column. We can then use the RANK functions to create our rankings.

          First, we’ll create a simple ranking as before, but with RANKX(). Included here is the function syntax in the comments.
          As you can see from the table above, the RANKX function defaults to sorting the highest sales value first, whereas the RANK function sorts it last. The RANKX function also defaults to putting the blank last, whereas the RANK function ordered it first.

          RankXFunction = 
          --RANKX(<table>, <expression>[, <value>[, <order>[, <ties>]]])  
          RANKX(SalesRankX2, SalesRankX2[Sales])

          We can reverse the order with the following code, but you’ll notice we don’t use the ORDERBY clause in the code:

          RankXFunctionASC = RANKX(SalesRankX2, SalesRankX2[Sales],,  ASC)

          We can also apply the DENSE clause for the ties, as the default is to SKIP the ranking when there have been equal values, for example, there are 2 sales of value 94, which are both ranked 6, as the default treatment of ties is set to SKIP, the next rank value jumps to 8. With the DENSE clause, the next rank does not jump.

          RankXFunctionDENSE = 
          --RANKX(<table>, <expression>[, <value>[, <order>[, <ties>]]])  
          --note: defaul ties is SKIP
          RANKX(SalesRankX2, SalesRankX2[Sales], ,  ASC, DENSE)

          In the next example we use RANKX with RELATEDTABLE(). We start of by creating a sample of the DimCustomer table, joining it to the FactOnline Sales table and then adding a ranking table to the new customer table for total sales. We then check the ranking by adding a totalsales

          Step 1: Create a sample customer table. In the core below, we create a simple selection of CustomerKey, FirstName, LastName and then filter the table by the first 10 customers by customerkey. So this will give us a table of the first 10 customers in the DimCustomer table. These data come from the Contoso Retail Data Warehouse.

          CustomerSimple = CALCULATETABLE(
              SELECTCOLUMNS(
                  DimCustomer,
                  "CustomerKey", DimCustomer[CustomerKey],
                  "FirstName", DimCustomer[FirstName],
                  "LastName", DimCustomer[LastName]
              ),
              TOPN( 10, DimCustomer, DimCustomer[CustomerKey], ASC )
          )

          The table below is created (first 3 columns), then we add the SalesRank and CustomerTotalSalesCheck to demonstrate the use of the RANKX function with RELATEDTABLE function

          The code for ranking for sales is below. Then we add the CustoemrTotalSalesCheck to confirm that our ranking is working correctly.
          As you can see it is. The first rank of 1 is allocated to total sales of £3,932. The last value is blank ranked 10. You will notice there is no rank 8, so the default tie is to SKIP rank when then is a tie. We can change this by adding the DENSE clause

          SalesRank = RANKX(CustomerSimple, SUMX(RELATEDTABLE(FactOnlineSales), FactOnlineSales[SalesAmount]))

          Add the the CustomerTotalSales Check column:

          CustomerTotalSalesCheck = SUMX(RELATEDTABLE(FactOnlineSales), FactOnlineSales[SalesAmount]) 

          Next, we add the DENSE so the ties don’t jump (optional)

          SalesRankDENSE = RANKX(CustomerSimple, SUMX(RELATEDTABLE(FactOnlineSales), FactOnlineSales[SalesAmount]),,,Dense)

          Creating a Measure using RANKX()
          Now we have some good examples of using the RANK and RANKX function we can use them in a measure, which creates a temporary table in memory.

          To create the measure we can use the DAX Editor in Power BI, DAX Tabular Editor, or DAX Studio. My preference is the DAX Tabular editor, but the code will work in all three, allowing us to see the data as we build the measure. If you just try and build the measure, you can’t be sure what’s going on unless you build the table directly in Power BI, but they are slower to run.

          Here we borrow the code from the previous example creating a physical table, but ensure the table is set to DimCustomer in the Rank function.

          DEFINE
          VAR CustomerSample = 
          CALCULATETABLE(
              SELECTCOLUMNS(
                  DimCustomer,
                  "CustomerKey", DimCustomer[CustomerKey],
                  "FirstName", DimCustomer[FirstName],
                  "LastName", DimCustomer[LastName], 
          		"Rank", RANKX(DimCustomer, SUMX(RELATEDTABLE(FactOnlineSales), FactOnlineSales[SalesAmount]),,,Dense) ,
          	    "TotalSalesCheck",  SUMX(RELATEDTABLE(FactOnlineSales), FactOnlineSales[SalesAmount]) 
          		    ),
              TOPN( 10, DimCustomer, DimCustomer[CustomerKey], ASC )
          )
          
          EVALUATE
          CustomerSample
          

          Here is the output of the code in DAX Query View (inside Power BI, I haven’t used tabular editor this time).

          As you can see the top rank is 3932 as before.

          I can now for example filter the rankings table, in this example, I filter greater than 1 and less than 4

          DEFINE
          VAR CustomerSample = 
          CALCULATETABLE(
              SELECTCOLUMNS(
                  DimCustomer,
                  "CustomerKey", DimCustomer[CustomerKey],
                  "FirstName", DimCustomer[FirstName],
                  "LastName", DimCustomer[LastName], 
          		"RankNumber", RANKX(DimCustomer, SUMX(RELATEDTABLE(FactOnlineSales), FactOnlineSales[SalesAmount]),,,Dense) ,
          	    "TotalSalesCheck",  SUMX(RELATEDTABLE(FactOnlineSales), FactOnlineSales[SalesAmount]) 
          		    ),
              TOPN( 10, DimCustomer, DimCustomer[CustomerKey], ASC )
          )
          
          VAR TopThreeRankSales = FILTER(CustomerSample, AND([RankNumber] > 1, [RankNumber]< 4))
          
          EVALUATE
          TopThreeRankSales

          This gives me a new table:

          Next, we want to get the sum of the 2nd and 3rd-ranking sales, we can do this by using the GROUPBY function, but without grouping on any columns.

          DEFINE
          VAR CustomerSample = 
          CALCULATETABLE(
              SELECTCOLUMNS(
                  DimCustomer,
                  "CustomerKey", DimCustomer[CustomerKey],
                  "FirstName", DimCustomer[FirstName],
                  "LastName", DimCustomer[LastName], 
          		"RankNumber", RANKX(DimCustomer, SUMX(RELATEDTABLE(FactOnlineSales), FactOnlineSales[SalesAmount]),,,Dense) ,
          	    "TotalSalesCheck",  SUMX(RELATEDTABLE(FactOnlineSales), FactOnlineSales[SalesAmount]) 
          		    ),
              TOPN( 10, DimCustomer, DimCustomer[CustomerKey], ASC )
          )
          
          VAR TopThreeRankSales = FILTER(CustomerSample, AND([RankNumber] > 1, [RankNumber]< 4))
          VAR SalesOfSecondAndThird = GROUPBY(TopThreeRankSales, "SalesOfSecondAndThird", SUMX(CURRENTGROUP(), [TotalSalesCheck]))
          
          EVALUATE
          SalesOfSecondAndThird

          This gives the following output:

          We should now be able to use the DAX in a measure, so we transfer the DAX code into a Measure as follows:
          The code using DEFINE and EVALUATE is a DAX query in the DAX query editor. When we create a measure we are creating a DAX expression. DAX expressions are converted to DAX queries when they are evaluated. Here is the code for creating the measure below.

          _SecondAndThirdTotalSales = 
          VAR CustomerSample = 
          CALCULATETABLE(
              SELECTCOLUMNS(
                  DimCustomer,
                  "CustomerKey", DimCustomer[CustomerKey],
                  "FirstName", DimCustomer[FirstName],
                  "LastName", DimCustomer[LastName], 
          		"RankNumber", RANKX(DimCustomer, SUMX(RELATEDTABLE(FactOnlineSales), FactOnlineSales[SalesAmount]),,,Dense) ,
          	    "TotalSalesCheck",  SUMX(RELATEDTABLE(FactOnlineSales), FactOnlineSales[SalesAmount]) 
          		    ),
              TOPN( 10, DimCustomer, DimCustomer[CustomerKey], ASC )
          )
          
          VAR TopThreeRankSales = FILTER(CustomerSample, AND([RankNumber] > 1, [RankNumber]< 4))
          VAR SalesOfSecondAndThird = GROUPBY(TopThreeRankSales, "SalesOfSecondAndThird", SUMX(CURRENTGROUP(), [TotalSalesCheck]))
          
          RETURN
          SalesOfSecondAndThird

          We can then use the measure in a visualization as follows:

        8. An Example of Creating a Sales Report with SQL

          We can create a more interesting sales report using Window functions and additional calculations in the query, including the gross margin %, total profit, and ranks for product price, sales quantity, and sales rank. The query uses a Common Table Expression (CTE) function to group the data first (as we can’t group using Windows functions).

          This is the kind of report decision-makers will be interested in and as more questions come up, additional fields can be added.

          USE [ContosoRetailDW];
          -- Note: Qty * UnitPrice - Discount = TotalSalesAmount (small descrepency)
          -- CTE
          
          WITH sub1 AS
          (
          SELECT p.ProductName, s.UnitCost, s.UnitPrice, SUM(SalesQuantity) as TotalSalesQuantity, 
          (s.UnitCost * SUM(SalesQuantity)) as TotalCost, SUM(DiscountAmount) as TotalDiscount,
          SUM(SalesAmount) as TotalSalesAmount
          FROM [dbo].[FactOnlineSales] as s
          LEFT OUTER JOIN [dbo].[DimProduct] as p ON s.[ProductKey] = p.[ProductKey]
          LEFT OUTER JOIN [dbo].[DimDate] d on s.DateKey = d.DateKey
          LEFT OUTER JOIN [dbo].[DimStore] st ON s.StoreKey = st.StoreKey
          WHERE 
          d.CalendarYear = 2007 AND d.CalendarMonthLabel = 'January'
          AND
          st.StoreName = 'Contoso Europe Online Store'
          GROUP BY  p.ProductName, s.UnitCost, s.UnitPrice
          )
          
          --Main Query referencing the CT
          
          SELECT ProductName, UnitCost, UnitPrice, (UnitPrice-UnitCost)/UnitCost as Margin, 
          TotalSalesQuantity, format(TotalCost,'$0,,.0M') as TotalCost, format(TotalSalesAmount, '$0,,.0M') as TotalSales, 
          format(TotalSalesAmount-TotalCost, '$0,,.0M')  as TotalProfit,
          TotalDiscount,
          RANK() OVER(ORDER BY UnitPrice DESC) as PriceRank,
          RANK() OVER(ORDER BY TotalSalesQuantity DESC) as QtyRank,
          RANK() OVER(ORDER BY TotalSalesAmount DESC) as SalesRank
          FROM sub1
          ORDER BY SalesRank
          ORDER BY SalesRank
        9. Using Python for Simple Linear Regression

          Creating a simple Linear regression model and preparing for multi-linear regression.

          In this example, we use a sample of marketing spend data vs. sales and inspect the correlation between radio spend and total sales. The regression line is fitted using the ols function from statsmodels.formula.api

          You can download the original file from Kaggle here then just replace the location you save it to in the (df = pd.read_csv() line.

          
          import pandas as pd
          import seaborn as sns
          from statsmodels.formula.api import ols
          import statsmodels.api as sm
          import matplotlib.pyplot as plt
          pd.set_option('display.max_columns', None)
          df = pd.read_csv("marketing_sales_data.csv")
          
          # Drop rows with any null values
          df.dropna(inplace=True)
          
          # Check and handle duplicates if needed
          if df.duplicated().sum() > 0:
              df.drop_duplicates(inplace=True)
          
          #rename columns to snake 
          df.rename(columns = {'Social Media': 'Social_Media'}, inplace = True)
          
          # Simple order encoding
          tv_dict = {'Low': 1, 'Medium': 2, 'High': 3}
          df['TV'] = df['TV'].replace(tv_dict)
          
          # One-hot encoding for non-ordinal variable
          df = pd.get_dummies(df, columns=['Influencer'], dtype=int)
          
          
          # Define and fit the model
          ols_formula = "Sales ~ Radio"
          OLS = ols(formula=ols_formula, data=df)
          model = OLS.fit()
          summary = model.summary()
          print(summary)  #Prints off the statistical summary including R squared and the Beta coefficients.
          
          # Calculate residuals and predicted values
          residuals = model.resid
          y_pred = model.predict(df['Radio'])
          

          Results return from the model.summary() method from the OLS (ordinary least squares) function from the statsmodels module. R squared is calculated as 0.757 meaning 76% of the variability in y (sales) is accounted for by radio. However, if we look at other media, we will see that other variables (TV) also have a strong correlation.

          The Beta coefficient for radio spend is 8.17, which means that for every $1 million in Radio spend, we get $8.17 million in sales.

        10. How to check for model assumptions with Python Seaborn Graphics


          We need to check assumptions for models to give us confidence that are models have integrity and are not biased or overfitting the data.

          We check for three assumptions in this example, with sub-plotted seaborn graphics for a linear regression model.
          The code for creating the linear regression model can be found in this post
          You can run the code below once you have built the model. The model models the relationship between radio advertising spending and radio sales.

          assumption-checking

          Graph 1: Checking the Linearity Assumption
          The linearity assumption is as follows: ‘Each predictor variable (x) is linearly related to the outcome of variable y.’
          In the first graph, we plot radio advertising spend against radio sales and can see there is a linear relationship (first graph). So we can conclude the linearity assumption is met.

          Graph 2: Checking Homoscedacity assumption with a scatterplot

          The homoscedasticity assumption (extra points if you can spell it correctly) is as follows:
          y_pred are the predicted y values from a regression line.

          In the second graph, we plot the residuals of the model, which are the difference between actuals and model forecasts.

          Homoscedasticity means that the residuals have equal or almost equal variance across the regression line.
          By plotting the error terms with predicted terms we can check that there should not be any pattern in the error terms.’ Good homoscedacity is therefore a balanced graph of residuals above and below zero.

          Graph 3: Check for Normality Assumption
          In the third graph, the histogram is used to plot the residuals of the regression line (the actual y values vs. the predicted y values) for x. If the model is unbiased, the residuals should be normally distributed (and we see that).

          The fourth graph is a Q-Q plot which is also used to check the normality assumption.

          fig, ax = plt.subplots(2, 2, figsize=(18, 10))
          
          fig.suptitle('Assumption Checks')
          
          #Check for linearity
          ax[0, 0] = sns.regplot(
              ax=ax[0, 0],
              data = df,
              x = df['Radio'],
              y = df['Sales'], 
              );
          ax[0, 0].set_title('Radio Sales')
          ax[0, 0].set_xlabel('Radio Spend ($K)')
          ax[0, 0].set_ylabel('Sales ($)')
          #ax[0].set_xticks(range(0,10,10))
          #ax[0].set_xticks(rotation=90)
          
          
          #Check for Homeoscedacity
          # Plot residuals against the fitted values
          ax[0, 1] = sns.scatterplot( ax=ax[0, 1],x=y_pred, y=residuals)
          ax[0, 1].set_title("Residuals vs Fitted Values")
          ax[0, 1].set_xlabel("Fitted Values")
          ax[0, 1].set_ylabel("Residuals")
          ax[0, 1].axhline(0, linestyle='--', color='red')
          
          
          #Check for normality
          ax[1, 0] = sns.histplot(ax=ax[1, 0], x=residuals)
          ax[1, 0].set_xlabel("Residual Value")
          ax[1, 0].set_title("Histogram of Residuals")
          
          #Check for nomrmality QQ plot
          ax[1, 1] = sm.qqplot(residuals, line='s',ax = ax[1,1])
          ax[1, 0].set_title("Q-Q Plot")
          
          
          
          #sm.qqplot(test, loc = 20, scale = 5 ,  line='45')
          
          plt.show()

        11. A Summary of the data process used in Classification Models

          Introduction:

          Classification models are machine learning models that are used to predict binary outcome scenarios such as:

          Spam / Not Spam
          Fraudulent Transaction / Non-Fraudulant Transation
          customer churn/ customer will not churn
          custer high value / customer low value
          load approval /non-approval

          The Data Process

          Planning

          1. Understand the business requirements. Understand what the measure of success is and what needs to be measured.
            e.g. In binary outcomes (precision, recall, or f1 score). Identify Type 1 and Type 2 errors.
          2. Identify the key stakeholders and subject matter experts relevant to the project
          3. Understand where the data is and how it can be accessed. For larger data projects. If the data is from many different data sources, can the data be brought together in a data warehouse such as Google Big Query?
          4. Understand the technology required for the project. Are extra resources required?
          5. Is there a data dictionary describing all the data field types and purposes?

          Exploratory Data Analysis (Python)

          1. Explore the data, and list the number of columns rows, and data types. If there are any questions, these may need to be referred back to the business.
          2. Explore the data ranges (df. describe() ). Are the data counts complete? Do the means and ranges make sense, do the min and max statistics flag any potential errors or outliers in the data?
          3. Explore null values. If there are null values, either look to fill the data or drop the rows.
          4. Remove or adjust outliers.
          5. Summarize and graph the data:
          6. Use boxplots to look for outliers in columns.
          7. Use histograms to understand the distributions of data.
          8. Use a correlation matrix and pair plot to understand co-variance between columns.
          9. Visualize the data with interactive tools such as Tableau or Power BI for the initial analysis of data for clients

          Model Selection (Classification)

          Classification Models: Naive Bayes, Logistic Regression, Decision Tree, Random Forests, XG Boost

          1. Identify the variable to be predicted: Is it a continuous variable or a categorical variable?
          2. Select the machine learning model relevant to the task and 1 or 2 additional models to compare results to.
          3. Confirm the assumptions required by the model and check the data to confirm they meet the requirements.
          4. Feature selection: Select the features (Independent variables and the dependent variable (columns)) to be used in the model.
          5. Feature transformation: transform categorical data into numeric data using:
            a) one-hot encoding, for non-ordered categories e.g. departments
            b) ordinal encoding for orderly data such as ‘low’, ‘medium’, ‘high’
            Where numeric data across the features has high variance e.g. small numbers 0-1 and large numbers 100+ consider applying scaling to the data:
            a) Log normalization (using a logarithm of the data)
            b) Standardization (which converts data to Z false on a standard distribution, with a mean of zero within each feature)
          6. Feature extraction: Create new features from existing features examples are weekly hours.

          7. Check for Class inbalance in the data. In the case of a binary dependent variable (True, False), we would ideally like an even split, but the minimum should be 10% of the smaller of the two. Class inblances can be address with either:
          a) Downsampling to level the major segment down by random sampling.
          b) Upsampling to level the smaller segment up by random sampling.

          Model Execution

          1. Setup the X variables and Y variables in a separate data frame.
          2. Decide whether to use a
          3. Use test_train_split to create training and test sets and potentially a 3rd validation set of data. The test dataset should be around 25%, but can be larger if the dataset it large. Should the split be stratified?
          4. Select the hyperparameter requirements of the model. The GridSearchCV is the powerful sklearn function takes a list of hyper-parameters, scoring metrics and X values to run multiple models to find the best model
          5. Build and run the model.

          Model Validation

          In a simple scenario, the best performing model is ran against the hold out sample (e.g. 25% of the data), that the model has not been trained on.

          a) In cross-validation, samples of data are taken from the main training data and tested against the test data.
          Cross-validation is slower.

          b) Separate validation set: random samples are taken from the test training set and the model tested against a set number of times e.g. 5. This is advantageous when the main dataset is small and we don’t want to keep sampling from the training data.

          Measurement
          1. Merics: Accuracy, Precision, Recall, F1, AUC Score, Classification Report.

          2. Visualise: Confusion Matrix, ROC Curve

          Check Assumptions:

          Linearity: All Independent and dependent variables should exhibit a degree of linearity (use pairplot)
          Independent Observations: This information is business specific, so this requires understanding about how the variables are generated.
          No Multicolinearity: There should be no colinearity between independent variables or this will reduce the impact of the model.
          No extreme outliers: extreme outliers should be exlcuded from the model.



        12. An Example of Using K-Means Cluster modelling

          Supermarket Example

          Import libraries and read in CSV file to data frame
          The data comes from kaggle here (mall data)

          import pandas as pd
          import matplotlib.pyplot as plt
          import seaborn as sns
          from sklearn.cluster import KMeans
          
          df = pd.read_csv("Supermarket Customers.csv"
          df.info()


          First, we run some basic checks on the data to check for data integrity.
          Data includes 200 rows of data by 5 columns



          We can change Gender to numeric with the following

          df['Gender'].replace(0, 'Female',inplace=True)
          df['Gender'].replace(1, 'Male',inplace=True)
          df.describe()





          Then we check for nulls as follows:

          
          df.isnull().sum() #check for nulls - gives a count of nulls values by column name.

          Get counts of unique values

          
          len(df['CustomerID'].unique()) #how many unique values are there


          CustomerID: 200
          Gender: 2
          Age: 51
          Annual Income (k$): 65
          Spending Score (1-100): 85

          Fixing Null Values
          Annual Income: We have 2 values missing from the Annual Income field. We can remove these rows from the data or use the averages to fill in the gaps.

          Drop rows where annual income (k$) is null

          df.dropna(subset=['Annual Income (k$)'], inplace=True)
          df.isnull().sum() #re-run check for missing data

          Then do the same for spending score:

          df.dropna(subset=['Spending Score (1-100)'], inplace=True)
          df.isnull().sum() #re-run check for missing data
          df.info() # the number of rows is reduced to 197.

          We can run the seaborn pairplot to plot the graphs of the combination of variables.

          pairplot = sns.pairplot(df)
          plt.show()


          From here we can see there are some very interesting distinct-looking clusters around
          annual income and spending score. They also makes sense that they would be related, so we can use them in the k-means model.It

          For the K-means model, we need to determine the value of K which is the number of clusters we want to identify.
          We use the elbow method to do this as follows:

          # Extracting features for clustering
          X = df[['Annual Income (k$)', 'Spending Score (1-100)']]
          
          # Using the elbow method to determine the optimal number of clusters<br>wcss = []
          for i in range(1, 11):
          kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)
          kmeans.fit(X)
          wcss.append(kmeans.inertia_)
          # Plotting the elbow graph
          plt.plot(range(1, 11), wcss)
          plt.title('Elbow Method')
          plt.xlabel('Number of clusters')
          plt.ylabel('WCSS')
          plt.show()

          The graph produced from the elbow method is below. For the value of K , we select the point where the rate of variance drops dramatically (the elbow) to WCSS. In this case, we select 5 clusters.

          K-Means Model

          Now we have the number of clusters to be used in the model, we can run the K-Means model.

          ##Use 5 clusters based on elbow graph
          
          # Fitting K-Means to the dataset with the optimal number of clusters (assuming 3 for this example)
          kmeans = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10, random_state=42)
          y_kmeans = kmeans.fit_predict(X)
          
          # Visualizing the clusters
          plt.scatter(X.iloc[y_kmeans == 0, 0], X.iloc[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
          plt.scatter(X.iloc[y_kmeans == 1, 0], X.iloc[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
          plt.scatter(X.iloc[y_kmeans == 2, 0], X.iloc[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
          plt.scatter(X.iloc[y_kmeans == 3, 0], X.iloc[y_kmeans == 3, 1], s=100, c='orange', label='Cluster 4')
          plt.scatter(X.iloc[y_kmeans == 4, 0], X.iloc[y_kmeans == 4, 1], s=100, c='purple', label='Cluster 5')
          
          # Plotting the centroids of the clusters
          plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
          plt.title('Clusters of customers')
          plt.xlabel('Annual Income (k$)')
          plt.ylabel('Spending Score (1-100)')
          plt.legend()
          plt.show()

          The graph is below. The identified 5 clusters are colored accordingly. Once the clusters are identified, we can use the values to segment our data, which can then be used to determine, for example, the best marketing campaigns for, by using A/B testing and t-test significance testing.

        13. How to Select the Correct Statistical Testing Test (A Table)

          Significance Tests (normal distribution assumed)

          GroupsRequirementsTestTailsStatisticDistributionDegrees of freedomExample
          1normal dist, n > 30 and known variance z-test1 and 2z-scoreNormalNAA sample of the population is tested for height. Do they match the known population?
          Population mean and standard deviation are known, sample > 30
          1n < 30 and unknown varianceOne-sample t-test1 and 2t-scoreStudents tn-1 A company wants to compare the average weekly sales of a new product to the historical average weekly sales of similar products, which is $5000.
          2Same 2 groups (before and after).
          2 Dependent samples.
          Paired t-test 1 and 2t-scoreStudents t

          n-1A company implements a new training program and wants to determine its effectiveness by comparing employee performance before and after the training.
          2*2 Independent groups, equal varianceIndependent t-test (equal variance – pooled)1 or 2t-scoreStudents t

          n1​+n2​−2A retailer wants to compare the average sales of two different stores located in different regions.
          2*2 Independent groups, unequal varianceIndependent t-test (Unequal/Welch’s)1 and 2t-scoreStudents t

          n1​+n2​−2
          3+3+ groups – look at one variableOne-Way ANOVA1 OnlyF-scoreFbetween groups
          k-1 (where k is num of groups).
          Three groups are given different drugs in a test to see the improvement in blood sugar.
          3+3+ groups – look at multiple variablesTwo-way ANOVA1 Only
          F-scoreFEffect of water and sun on height of sunflowers, where 3+ combinations of water and sun (3 + groups)
          2 Categories2 Categories – comparison is to see if they are relatedPearsons Chi-Squareneither, just differentChi-SquaredChi-Squared(no. of rows – 1) * (no. of columns – 1). Individuals that received social media vs. those that received email – 2 Groups purchased and not purchased.
          Correlation
          2CorrelationPearsons Correlation Coefficient (R)1 or 2t-statistic
          t-distribution table
          n−2Relationship between daily hours of sunlight and temperature. Best for continuous data with linear relationships.

          Provides both the direction (positive or negative) and the strength of a linear relationship between two variables. Note for simple linear regression. R squared is the Pearson r squared.
          It ranges between -1 and 1
          2 Correlation Spearmans Rank1 or 2t-statistic?Spearmans Rho?n-2Relationship between the ranks of employees’ performance and their years of experience.
          Best for ordinal data or when data does not meet Pearson’s assumptions.
          r Squared (used in regression analysis)

          One (simple linear regression) or
          more independent variables
          (Multiple linear regression) + 1 dependent
          variable
          Not used in significance testing directlyNAExplaining the variance in academic performance based on study hours, attendance, and other factors.
          Used in the context of regression analysis to explain the proportion of variance in the dependent variable.

          Indicates the proportion of the variance in the dependent variable that can be explained by the independent variable(s). It does not provide information about the direction of the relationship.
          Ranges from 0 to 1
          Regression AnalysisF-StatisticF-DistributionFit a regression model and computer squared.
          Perform an F-test to determine overall significance.
          Non-Parametric TestsNot Normal distribution
          2 Compares distributions of the 2 groupsMann Whitney U
          U Statistic Mann-Whitney U DistributionNA2 Independent groups (like the t-test)
          2Compares Medians of the 2 groupsWilcoxen signed rankT StatisticWilcoxen signed rank tableNAWhen comparing two related samples, matched samples, or repeated measurements on a single sample.

          * Note: Uses Levenes test to determine equal/unequal variance

          Significant Testing Steps

          1. Null hypothesis: There is no significant difference between variables between the two groups.

            Examples:
            1. In a drug efficacy study, the null hypothesis might state that the drug does not affect patients compared to a placebo.
            2. The mean cholesterol levels of patients taking the new drug are equal to the mean cholesterol levels of patients taking a placebo.
            3. The mean sales after the advertising campaign are equal to the mean sales before the campaign.
            4. The mean sales of the product with the new packaging design are equal to the mean sales of the product with the old packaging design.

            Alternative hypothesis: There is a significant difference between the means of the groups.
          2. The significance level (𝛼α) is the probability threshold for rejecting the null hypothesis. Commonly used significance levels include 0.05, 0.01, and 0.10. e.g. 5%: The result of the test only has a 5% chance of occurring.
          3. Calculate test statistic (t-value, z-value, etc). The test statistic is used to look up the p-value from the relevant distribution tables.

          4. Calculate the p-value using the test statistic. The p-value is used to determine if the test is significant. e.g. a p-value of less than 5% (0.05) is significant and we can reject the null hypothesis. If the p-value is higher than the significance level, then the results are reported as statistically significant.

          1 or 2 Tailed Tests
          One-tailed test if measuring greater or less than Left or Right Sided
          Two-tailed if measuring the difference

          Type 1 and Type 2 Errors
          Type 1 Error: We reject the null hypothesis even though it was true (Alpha)
          Type 2 Error: We accept the null hypothesis even though it was false (Beta)

          Two Methods of accepting or rejecting Null Hypothesis
          1. Critical Values (taken from normal or t-distribution)
          2. P-values (if p < or > than 0.05)


        14. Python Notes: Pandas from getting started Tutorials

          Notes from Getting Started Tutorial:
          https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html

          1. Create a Pandas Series and DataFrame
          2. Create Graphs with Matplotlib
          3. Adding new Columns from Existing Columns (inc. split)
          4. Summary Statistics including groupby
          5. Reshape layout of tables incl. Sorting and Pivot tables
          6. Sub-sets and filtering
          7. Combining multiple tables data (Concatonate (union) and merge (Join)
          8. Working with timeseries (pd.to_datetime(), dt, DatetimeIndex and Resample)
          9. Textual data (.str and .replace and split)


          1. Creating a Dataframe and Series

          import pandas as pd
          
          #Create a series
          books= pd.Series(["Papillion", "Great Expectations", "Magic"], name="Books") #Create a series
          
          #Create DataFrame with Key value pairs. In this case the values are lists of values.
          
          df = pd.DataFrame(
              {
                  "Name": [
                      "Humpfries, Mr. Bill",
                      "Derby, Mrs. Sheila",
                      "Winfrey, Miss. Elizabeth",
                  ],
                  "Age": [44, 66, 42],
                  "Gender": ["male", "female", "female"],
                  "City": ["London", "Manchester", "Leeds"],
                  "Occupation": ["Doctor", "Architect", "Analyst"]
              }
          )
          
          #Using the list
          df["Age"] #Output Age Column
          #df["Age"].max() #outputs max age

          2. Create Plots with Matplotlib

          import pandas as pd
          import matplotlib.pyplot as plt
          
          #Create Pandas dataframe from csv file
          air_quality = pd.read_csv("air_quality_no2.csv", index_col=0, parse_dates=True)
          air_quality["station_paris"].plot() #craate plot of subset from DataFrame
          plt.show() 
          
          #Plot with Matplot lib
          air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)
          plt.show() 
          
          #Create scatter from DataFrame
          air_quality.plot.box() #Create box plot
          plt.show() #Display boxplot
          
          #subplots
          fig, axs = air_quality.plot.area(figsize=(10, 4), subplots=True)
          plt.show()
          


          3. Add new columns from existing columns

          Import pandas as pd
          #index_col=0, sets first column to index, parse_dates converts dates to date type
          air_quality = pd.read_csv("air_quality_no2.csv", index_col=0, parse_dates=True)
          
          #Add new column based on existing column - this adds column to DataFrame
          air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882
          #Another example, this time a %
          air_quality["ratio_paris_antwerp"] = (air_quality["station_paris"] / air_quality["station_antwerp"]
          
          #renaming columns
          air_quality_renamed = air_quality.rename(columns={"station_antwerp": "BETR801","station_paris": "FR04014", "station_london": "London Westminster"})
          

          4. Summary Statistics


          import pandas as pd
          #read_csv function - also read_ excel, sql, json, parquet and more
          
          titanic = pd.read_csv("titanic.csv") #read_csv function reads from csv
          titanic["Age"].mean() #average of Age column
          titanic[["Age", "Fare"]].median() #median of Age and Fare Coluns
          titanic[["Age", "Fare"]].describe() #summary stats of 2 columns
          titanic.agg({"Age": ["min", "max", "median", "skew"],"Fare": ["min", "max", "median", "mean"]})
           #multiple stats on Age and Fare columns
           
           #groupby to create summary tables
           titanic[["Sex", "Age"]].groupby("Sex").mean()
          
           #Average age by sex
           titanic.groupby("Sex").mean(numeric_only=True) 
           #average of all numberics by Sex
           titanic.groupby(["Sex", "Pclass"])["Fare"].mean() #table average fare by sex and PClass
           
           titanic["Pclass"].value_counts() #count passengers in Pclass
           titanic.groupby("Pclass")["Pclass"].count() #count longer method


          5. Re-shape layout of tables

          import pandas as pd
          titanic = pd.read_csv("titanic.csv") #read_csv function reads from csv
          air_quality = pd.read_csv("air_quality_long.csv", index_col="date.utc", parse_dates=True)
          
          #Sorting Tables
          sort1 = titanic.sort_values(by="Age").head() 
          
          ##sort by Age
          sort2 = titanic.sort_values(by=['Pclass', 'Age'], ascending=False) #sort by Pclas then Age descending 
          
          # filter for no2 data only
          no2 = air_quality[air_quality["parameter"] == "no2"]
          
          # filter for no2 data only<br>no2_subset = no2.sort_index().groupby(["location"]).head(2)
          no2_subset
          #pivot table wide format
          air_quality.pivot_table(values="value", index="location", columns="parameter", aggfunc="mean")
          air_quality.pivot_table(values="value",index="location",columns="parameter",aggfunc="mean",margins=True)
          
          #wide to long format
          no2_pivoted = no2.pivot(columns="location", values="value").reset_index()
          no2_pivoted.head(5)
          #pandas.melt() method converts pivot table to long format
          no_2 = no2_pivoted.melt(id_vars="date.utc")<br>no_2.head(5)


          6. Sub-sets and Filtering

          import pandas as pd
          #read csv file to pandas DatafFrame
          titanic = pd.read_csv(“titanic.csv”)
          air_quality = pd.read_csv(“air_quality_no2.csv”, index_col=0, parse_dates=True)

          #save as excel, index=False removes row index numbers from export
          titanic.to_excel(“titanic.xlsx”, sheet_name=”passengers”, index=False)


          #attributes and methods of DataFrame
          titanic.shape #row and columns count
          titanic.info #
          titanic.columns #
          titanic.dtypes #data types
          titanic.head(5) #head of df
          titanic.tail(5) #tail of df
          titanic.describe() #statistics of file

          #Create Sub-set of DataFrame as series
          ages = titanic[“Age”] #create sub-set from dataframe: note single bracket creates a pandas series.
          type(titanic[“Age”]) #confirm type of Age object

          #Create Sub-set of DataFrame as DataFrame
          age_sex = titanic[[“Age”, “Sex”]] #inner brackets are list, outer is DataFrame

          #Filtering rows getpassenger list with Age > 35
          above_35 = titanic[titanic[“Age”] > 35] #return rows with age over 35
          #Filter passenger list with Pclass = 2 or 3 (2 ways of doing this)
          titanic[“Pclass”].isin([2, 3]) #checks if 2 or 3 is in Pclass and returns Boolean
          titanic[“Pclass”].isin([2, 3])

          #Filter passenger list with Age not NA
          age_no_na = titanic[titanic[“Age”].notna()]
          #Filter passenger liset and return list of Names
          adult_names = titanic.loc[titanic[“Age”] > 35, “Name”]

          #Filter specified rows and columns
          titanic.iloc[9:25, 2:5]

          #change the first 3 elements in row 3 to “anonymous”
          titanic.iloc[0:3, 3] = “anonymous”



          7. Combining Data from tables

          import pandas as pd
          import os

          #check current working director
          cwd = os.getcwd()
          cwd

          #Import data into DataFrames
          titanic = pd.read_csv(“titanic.csv”)
          air_quality_no2 = pd.read_csv(“air_quality_no2_long.csv”, parse_dates=True)
          air_quality_pm25 = pd.read_csv(“air_quality_long.csv”, parse_dates=True)
          stations_coord = pd.read_csv(“air_quality_stations.csv”)
          air_quality_parameters = pd.read_csv(“air_quality_parameters.csv”)

          #Create Sub-sets
          air_quality_no2 = air_quality_no2[[“date.utc”, “location”,
          “parameter”, “value”]]

          air_quality_pm25 = air_quality_pm25[[“date.utc”, “location”,
          “parameter”, “value”]]

          #Concatonate Sub-sets – creates a union between the 2 tables.
          #the Keys argument adds another hierachical index to show which table the data comes from
          air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0, keys=[“PM25”, “NO2”])
          air_quality_no2.head(5)
          air_quality_pm25.head(5)
          air_quality_no2.shape #2068 rows 4 columns
          air_quality_pm25.shape #5274 rows, 4 columns
          air_quality.shape #7340 rows, 4 columms

          air_quality = air_quality.sort_values(“date.utc”)
          air_quality.head(25)
          air_quality.to_csv(“checkcombine.csv”)

          #note you can reset the index to a column
          air_quality.reset_index(level=0)

          stations_coord = pd.read_csv(“air_quality_stations.csv”)

          stations_coord.head()

          #merge the columns from the stations_coord dataframe to the air_quality df
          #use location as the key to join the two dataframes
          #how=”left” is left join from air_quality on the left
          air_quality = pd.merge(air_quality, stations_coord, how=”left”, on=”location”)
          air_quality.head(10)
          air_quality.to_csv(“checkqualitymerge.csv”)

          air_quality_parameters.head(5)
          #merge left join using parameter column on left column and id on right
          air_quality = pd.merge(air_quality, air_quality_parameters,
          how=’left’, left_on=’parameter’, right_on=’id’)

          8. Timeseries

          import pandas as pd

          import matplotlib.pyplot as plt

          air_quality = pd.read_csv(“air_quality_no2_long.csv”)

          air_quality = air_quality.rename(columns={“date.utc”: “datetime”})
          air_quality.head(5)
          air_quality.columns

          #unique cities in air_quality df
          air_quality.city.unique()

          #read air_qualitydataframe to a datetime object instead of text
          air_quality[“datetime”] = pd.to_datetime(air_quality[“datetime”])

          air_quality[“datetime”]

          #can also do this to read the dates into datetime format
          #pd.read_csv(“../data/air_quality_no2_long.csv”, parse_dates=[“datetime”])

          #check start and end datetime
          air_quality[“datetime”].min(), air_quality[“datetime”].max()

          #datetime difference using pandas timestamp
          air_quality[“datetime”].max() – air_quality[“datetime”].min()

          #Add a month column using the
          air_quality[“month”] = air_quality[“datetime”].dt.month

          #Average no2 concentration by day of week by location
          air_quality.groupby([air_quality[“datetime”].dt.weekday, “location”])[“value”].mean()

          #Plot bar graph using hour on the x axis and average no2 on the y
          fig, axs = plt.subplots(figsize=(12, 4))

          air_quality.groupby(air_quality[“datetime”].dt.hour)[“value”].mean().plot(
          kind=’bar’, rot=0, ax=axs)
          plt.xlabel(“Hour of the day”); # custom x label using Matplotlib
          plt.ylabel(“$NO_2 (µg/m^3)$”);

          #create pivot table
          no_2 = air_quality.pivot(index=”datetime”, columns=”location”, values=”value”)
          no_2.head(5)

          #use index.year and index.weekday of the datetime object
          no_2.index.year, no_2.index.weekday
          #plot using a time range
          no_2[“2019-05-20″:”2019-05-21”].plot();

          #resample method of datetime object works like groupby. It takes an
          #aggregation function and the “M” is time based grouping for Month (month end)
          monthly_max = no_2.resample(“M”).max()
          monthly_max

          monthly_max.index.freq #tells you the frequence attribute – e.g. MonthEnd

          #Create table plot of average no2 levels for each station
          no_2.resample(“D”).mean().plot(style=”-o”, figsize=(10, 5));

          9. Textual Data

          import pandas as pd

          titanic = pd.read_csv(“titanic.csv”)

          titanic[“Name”].str.lower()

          #Split the Name field by comma separated
          titanic[“Name”].str.split(“,”)
          #create series surname using the first element of the split name field
          titanic[“Surname”] = titanic[“Name”].str.split(“,”).str.get(0)
          titanic[“Surname”]

          #Return name series with true false if contains Countess
          titanic[“Name”].str.contains(“Countess”)

          #return rows where name contains Countess
          titanic[titanic[“Name”].str.contains(“Countess”)]

          #string length
          titanic[“Name”].str.len()
          #find max name length
          titanic[“Name”].str.len().idxmax()
          #Return longest name
          titanic.loc[titanic[“Name”].str.len().idxmax(), “Name”]

          titanic[“Sex_short”] = titanic[“Sex”].replace({“male”: “M”, “female”: “F”})

          titanic[“Sex_short”]

        15. Useful Power BI DAX Code Examples

           DAX Code Examples:
          1. Using Variables
          2. FORMAT()
          3. HASONEVALUE()
          4. AND, &&
          5. CALCULATETABLE() and SUMMARIZE()
          6. USERELATIONSHIP()
          7. SWITCH()
          8. ISFILTERED() and making visual transparent
          9. SELECTEDVALUE() and creating a dynamic Graph Title
          10. FILTER and ADDCOLUMNS
          11. RANK()

          VAR: Using Variables

          Running Total =
          VAR MaxDateInFilterContext = MAX ( Dates[Date] ) //variable 1 max date#
          VAR MaxYear = YEAR ( MaxDateInFilterContext ) //variable 2 year of max date
          VAR DatesLessThanMaxDate = //variable 3 filter dates > variable 1 and variable 2
          FILTER (
          ALL ( Dates[Date], Dates[Calendar Year Number] ),
          Dates[Date] <= MaxDateInFilterContext
          && Dates[Calendar Year Number] = MaxYear
          )
          VAR Result = //variable 4 total sales filtered by variable 3
          CALCULATE (
          [Total Sales],
          DatesLessThanMaxDate
          )
          RETURN
          Result //return variable 4

          FORMAT: Formatting Numbers
          actual = if(sum[actual] >1000000, “FORMAT(SUM([actual], “#, ##M”), IF(SUM([actual]>=1000, “FORMAT(SUM(actual]), “#,,.0K”))

          FORMAT(min(column, “0.0%”)
          FORMAT(min(column, “Percent”)

          eg. if matrix is filtered, 
          IF(ISFILTERED(field], SELECTEDVALUE([column])

          HASONEVALUE: Check if column has one value in if
          Valuecheck = if(HASONEVALUE([column], VALUES(field))


          FILTER table by related field = united states and sumx salesamount_usd
          = SUMX(FILTER( ‘InternetSales_USD’ , RELATED(‘SalesTerritory'[SalesTerritoryCountry]) <>”United States” ) ,’InternetSales_USD'[SalesAmount_USD])

          AND, can also use &&

          Demand =
              SUMX (
                  FILTER (
                      RELATEDTABLE ( Assignments ),
                      AND (
                          [AssignmentStartDate] <= [TimeByDay],
                          [TimeByDay] <= [AssignmentFinishDate]
                      )
                  ),
                  Assignments[Av Per Day]
              )


          CALCULATETABLE, SUMMARIZE

          Calculate Table with Summarize and Filter

          Order Profile =
          CALCULATETABLE (
          SUMMARIZE (
          ‘Sales Table’,
          ‘Sales Table'[Order_Num_Key],
          Customer[Sector],
          “Total Value”, SUM ( ‘Sales Table'[Net Invoice Value] ),
          “Order Count”, DISTINCTCOUNT ( ‘Sales Table'[Order_Num_Key] )
          ),
          YEAR ( DimDate[Datekey] ) = YEAR ( TODAY () )
          )


          )

          USERELATIONSHIP Uses inactive relationship between tables

          CALCULATE (
          [Sales Amount],
          Customer[Gender] = "Male",
          Products[Color] IN { "Green", "Yellow", "White" },
          USERELATIONSHIP ( Dates[Date], Sales[Due Date] ),
          FILTER ( ALL ( Dates ), Dates[Date] < MAX ( Dates[Date] ) )


          SWITCH

          SWITCH(<expression>, <value>, <result>[, <value>, <result>]…[, <else>])
          = SWITCH([Month], 1, “January”, 2, “February”, 3, “March”, 4, “April” , 5, “May”, 6, “June”, 7, “July”, 8, “August” , 9, “September”, 10, “October”, 11, “November”, 12, “December” , BLANK() ) //place on separate lines

          SWITCH with Measure

          = SWITCH(TRUE(), 
                  [measure] = “turnover”, [turnover]
                  [measure] = “Profit”, “[Profit]

          , BLANK()

          )

          Visuals
          ISFILTERED()
          Check Filtered = ISFILTERED([column])

          Dynamic Visual 

          MakeTransparent = 
          IF([check filtered], “FFFFF00” # returns transparent – note hex 7
          “White”)

          Message = IF([check Filtered), “”, “please select a row”)

          Dynamic Graph Title
          Graph year title = selectedvalue([columnname])& ” –  My graph Title”

          ADDCOLUMNS
          New Table>
          Creates a new table and adds columns to it (and in this case also filters it) 

          2013Sales = FILTER(ADDCOLUMNS(FactInternetSales, “Dates”, FactInternetSales[OrderDate], “Sales2”, SUM(FactInternetSales[SalesAmount])), FactInternetSales[OrderYear=2013)

          RANK by 2 Columns (calculated column)
          Measure = RANKX( Filter(all(‘Table’), ‘Table'[customer] = EARLIER(‘Table'[Customer])), ‘Table'[Txn Number],,DESC, DENSE)

          Creates a rank for each customer based on the txn number for each customer

        16. Python Notes: Python Lists Examples


          Create a list
          Simple list
          mylist = [“apple”, “orange”, “banana”, “mango”, “pineapple”]
          mylist[2] # returns ‘banana’
          mylist[-1] # returns ‘pineapple’
          mylist[2:4] #returns ‘banana’, ‘mango’, ‘pineapple
          if “apple” in thislist:
            print(“Yes, ‘apple’ is in the fruits list”)
          mylist.insert(2, “watermelon”) #insert at position speified
          mylist.append(‘grapes’) #adds to end of list
          mylist = [“apple”, “banana”, “cherry”]
          tropical = [“mango”, “pineapple”, “papaya”]
          mylist.extend(tropical) # adds tropical list to thislist
          my.remove(“banana”) #removes first occurrence
          mylist.pop(1) #removes specified instance
          del mylist[0] #also removes instance
          del mylist # deletes list
          clear(mylist) # clears the list

          for i in thislist: #look list
            print(i)

          i = 0
          while i < len(thislist): #while loop
            print(thislist[i])
            i = i + 1

          newlist = []
          for x in fruits: #adds list items containing a to new list
            if “a” in x:
              newlist.append(x)

          thislist.sort() #sort alphabetically or numerically depending on data type
          thislist.sort(reverse = True) #sort in reverse
          thislist.sort(key = str.lower) #sort lower case first
          thislist.reverse() #sort in reverse
          mylist = thislist.copy() #copy list
          mylist = list(thislist) #also makes a list
          list3 = list1 + list2 #concetenate lists





          6 rows, 2 columns

          b1 = [[2308, 6], [2408, 6.2], [2508, 5.8],[2608, 5.6], [2708, 5.9]] #create the list
          print(b1) #print the list
          print(len(b1)) #print length  of list
          print(type(b1)) #print data type of list
          print(b1[:2]) #print the first 2 elements in the list (note doesn’t use zero)
          print(b1[3:]) #print from the 3rd elements and after

          Reference Index in array

          b1[3] #returns second element
          b1[0:2] # returns first and second element of list
          b1[-1] # returns last element index

          Add and Delete, Update
          Delete
          del(b1[0]) # delete first element
          or
          b1.pop(1) #remove second element


          Insert
          b1.insert(1,[24082, 111]) #insert element at position 1
          Update
          b1[6]= [2808,6.7] #update value

          Append
          Additions = [[29008, 6.6], [3008, 6], [3108, 6.1]]
          b1.extend(Additions) #Adds the Additions list to b1
          b1.append([2808,6.6]) # add 1 element (only 1)

          Clear the list
          b1.clear() #empties the list

          Check if element in array

          if [2308,6.2] in b1: print(“yes”)
          Loop
          for x in b1: print(x)

          def country_select(countrylist):
          count = 1
          for i in countrylist:
          print(f”{i}: {countrylist[i]}”)
          count = count + 1
          return countrylist



          Numpy

          baseball = [180, 215, 210, 210, 188, 176, 209, 200]
          import numpy as np
          np_baseball = np.array(baseball)
          print(type(np_baseball))
          mean = np_baseball.mean()
          print(“mean is: “+ str(mean))
          med = np.median(np_baseball)
          print(“median is: “+str(med))
          std = np.std(np_baseball)
          print(“standard deviation is: “+str(std))

        17. Simple Linear Regression Example with Python

          Linear Regression using Salary and Years of Experience Data

          Data Source: Salary_dataset.csv Kaggle
          The salary data set includes 2 columns: Years Experience which will be our independent variable (X) and Salary (Y).

          Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The primary goal of linear regression is to predict the value of the dependent variable based on the values of the independent variables (Chat GPT)

          For this example: First, we want to see if there is a correlation between the 2 variables by building a regression line and calculating r squared. Then we want to assess the significance of the relationship using the p-value to test the null hypothesis that there is no relationship between X and Y (X does not predict Y).

          Simple Linear Regression
          Formula
          Formula: Y = b + aX + e

          1. Dependent Variable (Y): The outcome or the variable we aim to predict or explain.
          2. Independent Variable(s) (X): The variable(s) used to predict or explain changes in the dependent variable.
          3. a: is the slope change in Y for a one-unit change in X
          4. b: is the intercept of the Y axis. This represents the value of Y when X is zero.

            The following Python code (which I used ChatGTP to optimize) calculates the regression line, and p-value and evaluates the null hypothesis.

            The steps are as follows:
            1. Use Pandas to import the CSV file to a data frame and convert each series to an array.
            2. Fit a linear regression model using Sklearn that returns the slope and intercept of the regression line.
            3. The ‘Stats’ module library is then used to calculate the R-squared value and p-value of the slope.
            4. The null hypothesis is then evaluated based on the p-value
            5. Scipy.stats and Matplotlib.pyplot are then used to calculate the used to plot the regression line on a graph.
          
          import pandas as pd
          import numpy as np
          import matplotlib.pyplot as plt
          from sklearn.linear_model import LinearRegression
          import statsmodels.api as sm
          from scipy import stats
          
          def load_data(csv_location):
              """Load CSV data into a pandas DataFrame."""
              df = pd.read_csv(csv_location)
              return df
          
          def prepare_data(df):
              """Prepare independent and dependent variables for regression."""
              X = df[['YearsExperience']].values  # Extract as 2D array
              y = df['Salary'].values  # Extract as 1D array
              return X, y
          
          def fit_sklearn_model(X, y):
              """Fit a linear regression model using sklearn."""
              model = LinearRegression()
              model.fit(X, y)
              return model
          
          def fit_statsmodels_ols(X, y):
              """Fit a linear regression model using statsmodels OLS."""
              X_with_const = sm.add_constant(X)  # Add an intercept to the model
              model = sm.OLS(y, X_with_const).fit()
              return model
          
          def plot_regression_line(df, intercept, slope):
              """Plot the regression line along with data points."""
              plt.scatter(df['YearsExperience'], df['Salary'], color='blue', label='Data points')
              plt.plot(df['YearsExperience'], intercept + slope * df['YearsExperience'], color='red', label='Regression line')
              plt.title("Salary by Years of Experience")
              plt.xlabel('Years of Experience')
              plt.ylabel('Salary')
              plt.legend()
              plt.show()
          
          def main():
              csv_location = "salary_dataset.csv"
              df = load_data(csv_location)
          
              # Display basic statistics
              #print(df.describe())
          
              X, y = prepare_data(df)
          
              # Fit the model using sklearn
              sklearn_model = fit_sklearn_model(X, y)
              intercept, slope = sklearn_model.intercept_, sklearn_model.coef_[0]
              
              print("Calculation of Regression Line:\n")
              print(f"Intercept is: {intercept}")
              print(f"Slope is: {slope}")
          
              # Fit the model using statsmodels to get p-values and R-squared
              statsmodels_model = fit_statsmodels_ols(X, y)
            #  print(statsmodels_model.summary())
          
              # Extract R-squared and p-values
              r_squared = statsmodels_model.rsquared
              p_values = statsmodels_model.pvalues
          
              print(f"R-squared: {r_squared}")
              #print(f"P-values: {p_values}")
          
              # Extracting specific p-values by index
              intercept_p_value = p_values[0]  # First p-value (intercept)
              slope_p_value = p_values[1]  # Second p-value (YearsExperience)
          
              #print(f"Intercept p-value: {intercept_p_value}")
              print(f"p-value (YearsExperience): {slope_p_value}")
          
              print("\nThe p-value is the probability of observing a t-statistic as extreme as, or more extreme than, the one calculated from your sample data, under the assumption that the null hypothesis is true.") 
              print("This is obtained from the t-distribution with n−2 degrees of freedom ")
              print("where n is the number of observations\n")
          
              if slope_p_value > 0.05:
                  print("P-value is not signficant and therefore we accept the null hypothesis")
              if slope_p_value < 0.05:
                  print("P-value is less than 0.05 and therefore we reject the null hypothesis. This means there is strong evidence that the predictor 𝑋 has a statistically significant effect on the outcome 𝑌")
              # Plotting the regression line
              plot_regression_line(df, intercept, slope)
          
              # Fit a linear regression line using scipy.stats (for comparison)
              slope, intercept, r_value, p_value, std_err = stats.linregress(df['YearsExperience'], df['Salary'])
             # plt.text(df['YearsExperience'].min(), df['Salary'].max(), f'y = {slope:.2f}x + {intercept:.2f}', ha='left')
          
          if __name__ == "__main__":
              main()
          



        18. Python Notes: Pandas Code Examples

          Pandas Quick Guide

          Modules

          import pandas as pd #pandas module
          import scipy.stats as stats #stats sub-module
          from bs4 import BeautifulSoup #scrape module
          import requests #API module
          import whois as whois #whois module


          Files
          df = pd.read_csv(“CookieData.csv”) #import CSV file to dataframe df

          #define index column and convert dates to date format

          air_quality = pd.read_csv(“air_quality_long.csv”, index_col=”date.utc”, parse_dates=True)

          air_quality.to_csv(“checkcombine.csv”) #save dataframe to csv file
          tests.to_excel(“tests.xlsx”, sheet_name=”expt”, index=False)



          Dataframe info

          df.describe() #output descriptive statistics of dataframe (numeric columns)

          df.info() #display dataframe columns datatypes No-Null
          df.dtypes #output datatypes of dataframe

          df.shape #outputs rows and columns
          df.sample(5) #output sample of 5 rows

          df.columns #ouput columns
          titanic.head(5) #print head of dataframe (top five rows)
          titanic.tail(5) #print tail of dataframe (bottom five rows)
          type(df) #output object type

          Convert
          gg[‘YearString’] = gg[‘Year’].astype(“string”) #convert series to string
          gg[“YearInterval”] = gg[“YearString”].str[:3] #LEFT type function, get first 3 elements of string

          gg[“YearInterval”] = gg[“YearInterval”].astype(int) #convert series integer



          Delete Columns
          del df2[“Outlier_Flag”]

          Sub-set

          df2 = df[[“version”, “retention_1”]]

          Combine Sub-sets

          #Concatonate Sub-sets – creates a union between the 2 tables. The Keys argument adds another hierachical index to show which table the data comes from

          air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0, keys=[“PM25”, “NO2”]) # creates union

          merge the columns from the stations_coord dataframe to the air_quality df. Use location as the key to join the two dataframes, how=”left” is left join from air_quality on the left
          air_quality = pd.merge(air_quality, stations_coord, how=”left”, on=”location”)

          merge left join using parameter column on left column and id on right

          air_quality = pd.merge(air_quality, air_quality_parameters,
          how=’left’, left_on=’parameter’, right_on=’id’)



          Dataframe Aggregate Functions
          titanic[“Age”].mean()
          titanic[[“Age”, “Fare”]].median()


          Sort Dataframe
          air_quality = air_quality.sort_values(“date.utc”)
          sort1 = titanic.sort_values(by=”Age”).head()
          sort2 = titanic.sort_values(by=[‘Pclass’, ‘Age’], ascending=False)~
          no2_subset = no2.sort_index().groupby([“location”]).head(2) #group and sort by index


          Filtering Dataframe

          df2 = df[df.Outlier_Flag == 0]

          gate_30 = df[df.version == “gate_30”]
          no2 = air_quality[air_quality[“parameter”] == “no2”]
          above_35 = titanic[titanic[“Age”] > 35]
          class_23 = titanic[(titanic[“Pclass”] == 2) | (titanic[“Pclass”] ==3)]

          age_no_na = titanic[titanic[“Age”].notna()] #return column not NA

          #Return Names of passengers over 35

          adult_names = titanic.loc[titanic[“Age”] > 35, “Name”]

          titanic.iloc[9:25, 2:5] #Return specified columns and rows
          titanic.iloc[0:3, 3] = “anonymous” #Change the 3rd element of the row to anonymous

          def city_select(country): #function with country as parameter
          df2 = df[df[“country”] == country] #slice dataframe where country = country e.g. United States
          df2.sort_values(by=[“population”], ascending=False) #sort dataframe by population descending
          df3 = df2.iloc[0:5] #new dataframe sorted by top 5 population
          df4 = df3[“city”] #select city series (top 5 cities by population)
          return df4

          def latlong(country, city): #define function
          selection = df[(df[“country”] == country) & (df[“city”] == city)] #filter dataframe by 2 criteria
          return selection #return filtered dataframe

          def get_lat(selection): #define function
          latitude = selection[“lat”] #latitude = series with “lat” as column
          lat = latitude.iloc[0] #get first row of latitude series
          return lat #return first row



          Groupby

          df.groupby(“version”).count() #Version with Count version
          df.groupby(“version”).sum() #Version with Sum columns
          air_quality.groupby([air_quality[“datetime”].dt.weekday, “location”])[“value”].mean() #avg by day and location

          air_quality.groupby(air_quality[“datetime”].dt.hour)[“value”].mean().plot(kind=’bar’, rot=0, ax=axs)

          gg3 = gg.groupby([‘YearInterval’, ‘Month’], as_index=False)#Version with Sum columns



          Create Pivot Table

          table = pd.pivot_table(gate_30, values=’sum_gamerounds’, index=[‘version’], columns=[‘retention_1’],
          aggfunc=”mean”, margins=”true”) #note margins give totals

          no_2 = air_quality.pivot(index=”datetime”, columns=”location”, values=”value”)
          air_quality.pivot_table(values=”value”, index=”location”, columns=”parameter”, aggfunc=”mean”) #wide format


          DateTime
          df[“datetime”] = pd.to_datetime(air_quality[“datetime”]) #change dates in dataframe to datetime formate

          pd.read_csv(“../data/df_long.csv”, parse_dates=[“datetime”]) #does the same
          df[“datetime”].min(), df[“datetime”].max() #start and end datetime
          df[“datetime”].max() – df[“datetime”].min() #difference in Datetime between 2 dates
          df[“month”] = df[“datetime”].dt.month #add column month based on other column
          df.index.year, df.index.weekday #return year and weekday of datetime object.
          df_pivoted = df.pivot(columns=”location”, values=”value”).reset_index() #reset pivot table index
          df = df_pivoted.melt(id_vars=”date.utc”)

          gg[“YearMonth”] = pd.to_datetime(dict(year=gg.Year, month=gg.Month, day=1)) #create date from columns year, month and 1 for first

          gg3 = gg.groupby([‘YearInterval’, ‘Month’], as_index=False)#Version with Sum columns
          gg4 = gg3.max() #change group by to max group by


          Plot

          df[“2019-05-20″:”2019-05-21”].plot(); #plot timerange



          Working with Columns

          air_quality.city.unique() #return unique cities in column cities
          air_quality = air_quality.rename(columns={“date.utc”: “datetime”}) #rename columns
          Split Columns in Dataframe
          titanic[“Name”].str.split(“,”) #split Name column by comma
          titanic[“Surname”] = titanic[“Name”].str.split(“,”).str.get(0) #split name column by comma and return the first value in the array.

          df[“Name”].str.lower() #change column name to lowercase
          df[“Name”].str.contains(“Countess”) #return true if column contains “Countess

          df[“Name”].str.len() #return rows with string lenth of the Name column
          df[“Name”].str.len().idxmax() #return length of longest name
          df.loc[titanic[“Name”].str.len().idxmax(), “Name”] #return longest name in column using loc
          #rename gender column and replace genders to M and F in dataframe
          df[“gender_short”] = df[“gender”].replace({“male”: “M”, “female”: “F”}) #replace names in gender column with M and F
          titanic[“Pclass”].isin([2, 3]) #checkss if 2 and 3 is in the PClass columns
          adult_names = titanic.loc[titanic[“Age”] > 35, “Name”] #return Names of passengers over 35


          Print

          print(“\n1 Day Retention by Version”) #\n for new line
          print(f”You searched for: {domain_name}\n”) #f allows you to include placeholders in print

          Loops

          for link in a:
          print(link.get

          Input
          domain = input(“Enter domain to search for: “)