Category: Uncategorized

  • Comparison of Full Data Pipelines from Data Ingestion to Data Science

    A comparison of three types of data pipelines.

    Technology data flow
    Code data flow

    Technology Data flow

    Stage Path 1 — Microsoft / Fabric Path 2 — Snowflake + dbt (Cloud-agnostic) Path 3 — Google Cloud (GCP)
    Sources & Ingestion
    Azure Data Factory (ADF)
    Fabric Dataflows Gen2
    Event Hubs / IoT Hub (stream)
    ADF Copy Activity, REST, ODBC/JDBC
    Snowpipe (auto-ingest) + Stages
    Fivetran / Stitch / Airbyte
    Kafka / Kinesis via connectors
    AWS Glue jobs (optional)
    Cloud Data Fusion (GUI ETL)
    Pub/Sub (stream)
    Dataflow (Beam) ingestion
    Storage Transfer / Transfer Service
    Raw Landing / Data Lake
    Azure Data Lake Storage Gen2
    OneLake (Fabric)
    Delta/Parquet zones: /raw /bronze
    External Stages on S3/Azure/GCS
    Internal Stages (Snowflake-managed)
    Raw files (CSV/JSON/Parquet)
    Google Cloud Storage (GCS)
    Raw buckets (landing)
    Formats: Avro/Parquet/JSON
    Orchestration
    ADF Pipelines & Triggers
    Fabric Pipelines
    Azure Functions (events)
    Azure DevOps/GitHub Actions (runs)
    Airflow / Dagster / Prefect
    Snowflake Tasks & Streams
    dbt Cloud scheduler
    CI via GitHub Actions
    Cloud Composer (Airflow)
    Workflows / Cloud Scheduler
    Dataform (dbt-like) scheduling
    Transform (ELT / ETL)
    Fabric Data Engineering (Spark)
    Azure Databricks (Delta)
    T-SQL in Fabric Warehouse
    Synapse SQL/Spark (legacy)
    dbt models (SQL + Jinja)
    Snowflake SQL (MERGE/Tasks)
    Snowpark (Python/Scala)
    Streams for CDC
    BigQuery SQL (ELT)
    Dataflow (Beam) for heavy lift
    Dataproc (Spark) when needed
    Dataform/dbt for modeling
    Curated / Serving Warehouse
    Fabric Warehouse / Lakehouse
    Dedicated SQL Pools (Synapse)
    Delta tables (silver/gold)
    Snowflake (Databases/Schemas)
    Time Travel, Cloning
    Materialized Views
    BigQuery Datasets
    Partitioned & clustered tables
    Materialized Views
    Semantic Layer / Modeling
    Power BI Datasets (Tabular)
    Calculation Groups (TE)
    Row-Level Security (RLS)
    Power BI Deployment Pipelines
    dbt semantic models & metrics
    Headless BI (Cube/Virt.)
    RLS via Snowflake roles/policies
    DirectQuery/Live connections
    Looker (LookML semantic layer)
    Looker Explore/Views/Models
    BigQuery Authorized Views
    Row/column policy tags
    BI / Visualization & Analysis
    Power BI (Desktop/Service)
    Paginated Reports (RDL)
    Excel over Power BI
    Power BI / Tableau / Looker Studio
    Sigma / Mode (optional)
    Embedded analytics
    Looker (first-class)
    Looker Studio (lightweight)
    Data Catalog-linked exploration
    Data Science / ML
    Azure ML (AutoML, MLOps)
    Databricks ML + MLflow
    SynapseML / ONNX
    Snowpark ML / UDFs
    External: SageMaker / Databricks
    Feature Store via Snowflake/Feast
    Vertex AI (AutoML, pipelines)
    BigQuery ML (in-SQL models)
    Feature Store (Vertex)
    Data Quality / Governance
    Microsoft Purview (Catalog/Lineage)
    Power BI lineage & sensitivity
    Great Expectations (optional)
    Snowflake RBAC, Tags, Masking
    dbt tests, Great Expectations
    Monte Carlo/Bigeye (obs.)
    Dataplex (governance)
    Data Catalog (metadata)
    DQ via Dataform tests / GE
    DevOps / CI-CD & Infra
    Azure DevOps / GitHub Actions
    Power BI Deployment Pipelines
    IaC: Bicep / Terraform
    GitHub Actions + dbt CI
    schemachange / SnowChange
    IaC: Terraform / Pulumi
    Cloud Build / Cloud Deploy
    Dataform CI, dbt CI
    IaC: Terraform
    Monitoring / Cost Control
    Azure Monitor / Log Analytics
    Fabric Workspace metrics
    Cost Mgmt + Budgets
    Snowflake Resource Monitors
    Query History, Access History
    3rd-party cost dashboards
    Cloud Monitoring & Logging
    BigQuery INFORMATION_SCHEMA
    Budgets + Alerts

    Code Data Flow

    Stage Microsoft / Fabric Snowflake + dbt Google Cloud (GCP)
    Ingestion Code
    Python ETL (requests, pyodbc)
    ADF / Fabric pipeline JSON
    Dataflow Gen2 JSON
    CREATE PIPE / CREATE STAGE
    Airbyte / Fivetran configs (YAML)
    COPY OPTIONS
    Apache Beam (Py/Java)
    Cloud Data Fusion JSON
    Pub/Sub schema JSON
    Raw Landing Config
    ADLS / OneLake folder layout
    Parquet / Delta write options
    Access policies (JSON)
    Stages & File format DDL
    CSV / JSON / Parquet
    Grants & policies
    GCS bucket layout
    Lifecycle rules JSON
    BQ external table DDL
    Orchestration Code
    ADF pipeline JSON + triggers
    Fabric Pipeline YAML
    Azure Functions (Python)
    Airflow DAGs (Python)
    Prefect flows (Python)
    Snowflake TASKS SQL
    Cloud Composer DAGs (Python)
    Cloud Scheduler jobs
    Dataform schedules
    Transform / Modeling
    Databricks notebooks (Py/Spark)
    Delta Live Tables pipelines
    T-SQL stored procs
    dbt models (*.sql)
    dbt Jinja macros (*.sql)
    Snowpark (Python) UDFs
    BigQuery SQL models (*.sql)
    Dataform/dbt *.sqlx + yaml
    Dataproc Spark notebooks
    CDC / Merge to Curated
    MERGE INTO (T-SQL)
    PySpark notebook jobs
    Delta OPTIMIZE/VACUUM
    MERGE INTO curated.* SQL
    Streams for CDC
    Materialized Views
    MERGE INTO USING staging
    Partition / Cluster DDL
    Stored procedures
    Semantic Layer
    Tabular model (TMDL)
    Calc groups (TE script)
    RLS DAX expressions
    dbt semantic models (YAML)
    metrics.yaml / exposures
    Masking policies (SQL)
    LookML view/model files
    Explores & joins
    Policy tags
    BI / Report Code
    Power BI PBIX / PBIT
    Paginated RDL XML
    PowerQuery M scripts
    Tableau / Power BI
    BI SQL views
    Sigma workbooks
    Looker dashboards (lkml)
    Looker Studio reports
    BQ UDFs (JS)
    Data Science Code
    Azure ML notebooks (Python)
    MLflow tracking code
    ONNX export
    Snowpark-ML notebooks (Py)
    UDF registration SQL
    MLflow registry
    Vertex AI notebooks (Python)
    BQML CREATE MODEL SQL
    Vertex pipelines (YAML)
    Tests & Data Quality
    Great Expectations suites
    Power BI model tests (DAX)
    Custom pytest checks
    dbt tests (schema.yml)
    Great Expectations suites
    SQL anomaly checks
    Dataform tests (assertions)
    Great Expectations in Beam
    INFORMATION_SCHEMA queries
    CI/CD Config
    GitHub Actions YAML
    Power BI Deployment Pipelines
    Bicep steps
    dbt Cloud job YAML
    GitHub Actions for dbt
    Terraform scripts
    Cloud Build YAML
    BQ deploy scripts
    Terraform modules
    Infra as Code
    Bicep / Terraform templates
    Azure DevOps variable groups
    Terraform (Snowflake provider)
    SnowChange / schemachange
    Terraform (GCS, BQ, VPC)
    IAM/Secrets configs