Accueil›Blog›Test technique Azure Synapse Analytics : SQL pools, Spark, intégration

Guide recrutement data

Test technique Azure Synapse Analytics : SQL pools, Spark, intégration

Azure Synapse Analytics unifie SQL analytique et Spark dans une même plateforme. En entretien, on évalue la capacité à choisir le bon pool selon le cas d usage.

Data Builder·Juin 2025·6 min de lecture·Data Engineer

Sommaire

Architecture Synapse
Dedicated SQL Pool
Serverless SQL Pool
Spark Pool
Intégration Azure Data Lake
Synapse vs Databricks vs Fabric
Grille

1Architecture Synapse : les 3 moteurs

Question discriminante

Quels sont les 3 moteurs de calcul dans Azure Synapse ? Quand utilisez-vous chacun ?

Dedicated SQL Pool — entrepôt de données MPP (Massively Parallel Processing). Performances maximales pour les requêtes SQL analytiques sur de gros volumes. Facturé à l heure même si inactif
Serverless SQL Pool — requêtes SQL ad-hoc sur des fichiers dans Azure Data Lake (Parquet, CSV, JSON, Delta). Paiement à la requête. Toujours disponible, zéro infrastructure
Spark Pool — cluster Apache Spark managé pour la transformation de données, le ML, le traitement distribué

2Dedicated SQL Pool : MPP pour le BI

Question discriminante

Comment optimisez-vous les performances d un Dedicated SQL Pool ?

-- Distribution des tables : clé de performance critique
-- HASH : distribuer selon une colonne de jointure fréquente
CREATE TABLE fct_orders
WITH (
    DISTRIBUTION = HASH(customer_id),  -- évite le data shuffle sur les jointures
    CLUSTERED COLUMNSTORE INDEX        -- optimal pour les requêtes analytiques
)
AS SELECT * FROM source_table;

-- ROUND_ROBIN : pour les tables de staging
CREATE TABLE stg_orders_raw
WITH (DISTRIBUTION = ROUND_ROBIN)
AS SELECT * FROM ...

-- Statistics : mettre à jour après chargement
UPDATE STATISTICS fct_orders;

-- Pause automatique (économie de coûts)
ALTER DATABASE mydw PAUSE;  -- via PowerShell ou ADF

Distribution HASH — distribuer les grandes tables sur la colonne de jointure la plus fréquente pour éviter le mouvement de données
Columnstore Index — compression maximale et performances optimales pour les requêtes analytiques
Pause/Resume — mettre en pause le Dedicated Pool la nuit et le week-end. Économise 70% des coûts

3Serverless SQL Pool : requêter le Data Lake

Question discriminante

Comment requêtez-vous des fichiers Parquet dans ADLS avec le Serverless SQL Pool ?

-- Requête directe sur des fichiers Parquet dans ADLS
SELECT
    year,
    region,
    SUM(amount) as revenue
FROM
    OPENROWSET(
        BULK 'https://monstorage.dfs.core.windows.net/datalake/orders/**',
        FORMAT = 'PARQUET'
    ) AS orders
WHERE year = 2024
GROUP BY year, region;

-- Créer une vue externe (évite de répéter OPENROWSET)
CREATE OR ALTER VIEW vw_orders AS
SELECT *
FROM OPENROWSET(
    BULK 'https://monstorage.dfs.core.windows.net/datalake/orders/**',
    FORMAT = 'PARQUET'
) AS r;

-- Requête sur Delta Lake
SELECT TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://monstorage.dfs.core.windows.net/datalake/delta/orders',
        FORMAT = 'DELTA'
    ) AS delta_orders;

4Spark Pool : transformations et ML

Question discriminante

Dans quel cas utilisez-vous le Spark Pool plutôt que le SQL Pool dans Synapse ?

Transformations complexes — logique Python/Scala non exprimable en SQL, manipulation de données non structurées
ML avec MLflow — Synapse Spark intègre MLflow nativement pour le tracking des expériences
Delta Lake — lire et écrire des tables Delta dans ADLS depuis Spark
Notebooks partagés — notebooks Spark dans Synapse Studio, collaboratifs avec Git
Auto-pause — le Spark Pool se suspend après X minutes d inactivité. Moins coûteux que le Dedicated Pool

5Intégration Azure Data Lake Storage Gen2

Question discriminante

Comment organisez-vous votre Data Lake dans ADLS pour une utilisation avec Synapse ?

Zones recommandées — raw/ (données brutes), curated/ (données transformées), enriched/ (prêt pour la BI)
Format Parquet ou Delta — Parquet pour les données statiques, Delta Lake pour les tables qui évoluent
Linked Service — connexion sécurisée entre Synapse et ADLS via Managed Identity (pas de clés de compte)
Access Control — RBAC Azure + ACLs ADLS pour contrôler l accès par répertoire et par équipe

6Synapse vs Databricks vs Microsoft Fabric

Question discriminante

Comment positionnez-vous Synapse, Databricks et Fabric ?

	Azure Synapse	Databricks	Microsoft Fabric
SQL analytique	Excellent (MPP natif)	Bon (SQL Warehouse)	Excellent (OneLake)
Spark	Bon	Excellent (optimisé)	Bon
ML/AI	Partiel	Excellent (MLflow natif)	Partiel
Intégration Microsoft	Natif	Bonne	Natif (Power BI)
Tendance 2025	Mature, remplacé par Fabric	Standard ML enterprise	Nouveau, stratégie Microsoft

-- Synapse Serverless SQL Pool : interroger Parquet dans ADLS Gen2 directement
SELECT order_date, region, SUM(amount) as revenue
FROM OPENROWSET(
    BULK 'https://myadls.dfs.core.windows.net/lake/orders/year=*/month=*/*.parquet',
    FORMAT = 'PARQUET'
) WITH (order_date DATE, region VARCHAR(50), amount FLOAT) AS r
WHERE order_date >= '2025-01-01'
GROUP BY order_date, region;

-- Créer une External Table sur le lake (vues persistentes)
CREATE EXTERNAL TABLE orders_ext
WITH (LOCATION = 'orders/', DATA_SOURCE = MyAdls, FILE_FORMAT = ParquetFmt)
AS SELECT * FROM orders_staging;

-- Synapse Link CosmosDB -> requêtes analytiques sans impact OLTP
SELECT TOP 100 * FROM OPENROWSET(
    PROVIDER = 'CosmosDB',
    CONNECTION = 'Account=myaccount;Database=ecommerce',
    OBJECT = 'orders',
    SERVER_CREDENTIAL = 'CosmosDBCredential'
) AS orders;

Serverless vs Dedicated Pool — Serverless : payer par TB scanné, idéal pour l'exploration ad-hoc. Dedicated Pool : performance garantie, coût fixe mensuel, pour le BI prod
Synapse Link — réplication zero-ETL depuis CosmosDB ou Dataverse vers Synapse. Analytique sans impacter la source OLTP
Spark Pool intégré — Spark managé dans Synapse. Partage des tables avec le SQL Pool via des tables Delta/Parquet dans le lake
vs Databricks — Synapse : intégration Azure native (ADF, Power BI), moins cher petites équipes. Databricks : developer experience supérieure, Delta Lake natif, MLflow
Sécurité — Row-Level Security et Dynamic Data Masking dans Dedicated Pool. Managed Private Endpoints pour connexions sécurisées sans exposer les données sur internet

7Grille par niveau

Niveau	Maitrise	Signal GO	NO-GO
Confirmé	Serverless SQL Pool, Dedicated Pool basique, ADLS	A requêté des Parquet avec OPENROWSET, comprend les 3 pools	Ne sait pas la différence entre Serverless et Dedicated
Senior	Distribution HASH, Spark Pool, Delta Lake, comparaison Synapse/Databricks	A optimisé un Dedicated Pool (distribution, statistics), justifie Synapse vs Databricks	Ne sait pas ce qu est la distribution HASH

1Synapse Architecture: the 3 engines

Discriminating question

What are the 3 compute engines in Azure Synapse? When do you use each one?

Dedicated SQL Pool — MPP (Massively Parallel Processing) data warehouse. Maximum performance for analytical SQL queries on large volumes. Billed per hour even when idle
Serverless SQL Pool — ad-hoc SQL queries on files in Azure Data Lake (Parquet, CSV, JSON, Delta). Pay per query. Always available, zero infrastructure
Spark Pool — managed Apache Spark cluster for data transformation, ML, distributed processing

2Dedicated SQL Pool: MPP for BI

Discriminating question

How do you optimize the performance of a Dedicated SQL Pool?

-- Table distribution: critical performance key
-- HASH: distribute based on a frequent join column
CREATE TABLE fct_orders
WITH (
    DISTRIBUTION = HASH(customer_id),  -- avoids data shuffle on joins
    CLUSTERED COLUMNSTORE INDEX        -- optimal for analytical queries
)
AS SELECT * FROM source_table;

-- ROUND_ROBIN: for staging tables
CREATE TABLE stg_orders_raw
WITH (DISTRIBUTION = ROUND_ROBIN)
AS SELECT * FROM ...

-- Statistics: update after loading
UPDATE STATISTICS fct_orders;

-- Automatic pause (cost savings)
ALTER DATABASE mydw PAUSE;  -- via PowerShell or ADF

HASH Distribution — distribute large tables on the most frequent join column to avoid data movement
Columnstore Index — maximum compression and optimal performance for analytical queries
Pause/Resume — pause the Dedicated Pool at night and on weekends. Saves 70% of costs

3Serverless SQL Pool: querying the Data Lake

Discriminating question

How do you query Parquet files in ADLS with the Serverless SQL Pool?

-- Direct query on Parquet files in ADLS
SELECT
    year,
    region,
    SUM(amount) as revenue
FROM
    OPENROWSET(
        BULK 'https://monstorage.dfs.core.windows.net/datalake/orders/**',
        FORMAT = 'PARQUET'
    ) AS orders
WHERE year = 2024
GROUP BY year, region;

-- Create an external view (avoids repeating OPENROWSET)
CREATE OR ALTER VIEW vw_orders AS
SELECT *
FROM OPENROWSET(
    BULK 'https://monstorage.dfs.core.windows.net/datalake/orders/**',
    FORMAT = 'PARQUET'
) AS r;

-- Query on Delta Lake
SELECT TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://monstorage.dfs.core.windows.net/datalake/delta/orders',
        FORMAT = 'DELTA'
    ) AS delta_orders;

4Spark Pool: transformations and ML

Discriminating question

In what cases do you use the Spark Pool rather than the SQL Pool in Synapse?

Complex transformations — Python/Scala logic not expressible in SQL, unstructured data manipulation
ML with MLflow — Synapse Spark natively integrates MLflow for experiment tracking
Delta Lake — read and write Delta tables in ADLS from Spark
Shared notebooks — Spark notebooks in Synapse Studio, collaborative with Git
Auto-pause — the Spark Pool suspends after X minutes of inactivity. Less expensive than the Dedicated Pool

5Azure Data Lake Storage Gen2 Integration

Discriminating question

How do you organize your Data Lake in ADLS for use with Synapse?

Recommended zones — raw/ (raw data), curated/ (transformed data), enriched/ (ready for BI)
Parquet or Delta format — Parquet for static data, Delta Lake for tables that evolve
Linked Service — secure connection between Synapse and ADLS via Managed Identity (no account keys)
Access Control — Azure RBAC + ADLS ACLs to control access by directory and by team

6Synapse vs Databricks vs Microsoft Fabric

Discriminating question

How do you position Synapse, Databricks, and Fabric?

	Azure Synapse	Databricks	Microsoft Fabric
Analytical SQL	Excellent (native MPP)	Good (SQL Warehouse)	Excellent (OneLake)
Spark	Good	Excellent (optimized)	Good
ML/AI	Partial	Excellent (native MLflow)	Partial
Microsoft Integration	Native	Good	Native (Power BI)
2025 Trend	Mature, replaced by Fabric	Enterprise ML standard	New, Microsoft strategy

-- Synapse Serverless SQL Pool: query Parquet in ADLS Gen2 directly
SELECT order_date, region, SUM(amount) as revenue
FROM OPENROWSET(
    BULK 'https://myadls.dfs.core.windows.net/lake/orders/year=*/month=*/*.parquet',
    FORMAT = 'PARQUET'
) WITH (order_date DATE, region VARCHAR(50), amount FLOAT) AS r
WHERE order_date >= '2025-01-01'
GROUP BY order_date, region;

-- Create an External Table on the lake (persistent views)
CREATE EXTERNAL TABLE orders_ext
WITH (LOCATION = 'orders/', DATA_SOURCE = MyAdls, FILE_FORMAT = ParquetFmt)
AS SELECT * FROM orders_staging;

-- Synapse Link CosmosDB -> analytical queries without OLTP impact
SELECT TOP 100 * FROM OPENROWSET(
    PROVIDER = 'CosmosDB',
    CONNECTION = 'Account=myaccount;Database=ecommerce',
    OBJECT = 'orders',
    SERVER_CREDENTIAL = 'CosmosDBCredential'
) AS orders;

Serverless vs Dedicated Pool — Serverless: pay per TB scanned, ideal for ad-hoc exploration. Dedicated Pool: guaranteed performance, fixed monthly cost, for prod BI
Synapse Link — zero-ETL replication from CosmosDB or Dataverse to Synapse. Analytics without impacting the OLTP source
Integrated Spark Pool — managed Spark in Synapse. Shares tables with the SQL Pool via Delta/Parquet tables in the lake
vs Databricks — Synapse: native Azure integration (ADF, Power BI), cheaper for small teams. Databricks: superior developer experience, native Delta Lake, MLflow
Security — Row-Level Security and Dynamic Data Masking in Dedicated Pool. Managed Private Endpoints for secure connections without exposing data on the internet

7Level grid

Level	Mastery	GO Signal	NO-GO
Confirmed	Serverless SQL Pool, basic Dedicated Pool, ADLS	Has queried Parquet files with OPENROWSET, understands the 3 pools	Does not know the difference between Serverless and Dedicated
Senior	HASH Distribution, Spark Pool, Delta Lake, Synapse/Databricks comparison	Has optimized a Dedicated Pool (distribution, statistics), justifies Synapse vs Databricks	Does not know what HASH distribution is

Vous recrutez un Data Engineer Azure ?

Premier entretien gratuit. Rapport GO/NO-GO sous 48h.

Tester gratuitement Reserver un appel