Trino (ex-Presto) est le moteur SQL de référence pour interroger plusieurs sources de données depuis une seule requête. En entretien, on évalue la capacité à l architecturer et l optimiser.
1Architecture Trino : coordinator et workers
Question discriminante
Comment est architecturé Trino ? En quoi diffère-t-il de Spark sur ce point ?
- Coordinator — parse et optimise les requêtes SQL (query planner), génère un plan d'exécution distribué, distribue les fragments aux workers
- Workers — exécutent les fragments de requête, communiquent directement entre eux (pas via le coordinator). Échange de données en mémoire via pipeline
- Pull model — les workers tirent les données des connecteurs à la demande, sans charger tout en mémoire d'abord
- Pas de stockage — Trino ne stocke rien. Tout est lu depuis les connecteurs à chaque requête. C'est un moteur de requête, pas un stockage
- vs Spark — Trino : latence faible (<1s possible), SQL interactif, query federation native. Spark : batch massif, ML, streaming, Python-first
# Déploiement Trino minimal (docker-compose)
version: '3'
services:
trino-coordinator:
image: trinodb/trino:latest
ports: ["8080:8080"]
volumes:
- ./etc/coordinator:/etc/trino
environment:
- TRINO_NODE_TYPE=coordinator
trino-worker:
image: trinodb/trino:latest
volumes:
- ./etc/worker:/etc/trino
depends_on: [trino-coordinator]
deploy:
replicas: 3
2Connecteurs : la richesse de Trino
Question discriminante
Quels connecteurs Trino supportez-vous ? Comment les configurez-vous ?
## etc/catalog/iceberg.properties
connector.name=iceberg
hive.metastore.uri=thrift://hive-metastore:9083
## etc/catalog/postgres.properties
connector.name=postgresql
connection-url=jdbc:postgresql://postgres:5432/analytics
connection-user=trino_user
connection-password=${ENV:POSTGRES_PASSWORD}
## etc/catalog/kafka.properties
connector.name=kafka
kafka.nodes=kafka1:9092,kafka2:9092
kafka.table-description-dir=/etc/trino/kafka
## etc/catalog/bigquery.properties
connector.name=bigquery
bigquery.project-id=mon-projet-gcp
-- Requête fédérée : join entre Iceberg (S3) et PostgreSQL (OLTP)
SELECT i.order_id, i.amount, p.customer_email, p.segment
FROM iceberg.analytics.fct_orders i
JOIN postgresql.crm.customers p ON i.customer_id = p.id
WHERE i.order_date >= DATE '2025-01-01'
AND p.country = 'FR';
- Connecteurs disponibles — Hive, Iceberg, Delta Lake, PostgreSQL, MySQL, Kafka, MongoDB, Elasticsearch, BigQuery, Redshift, S3 (Hive connector)
- Catalog = connecteur configuré — chaque catalog mappe vers une source. Notation : catalog.schema.table
- Predicate pushdown — Trino pousse les filtres aux connecteurs qui le supportent (PostgreSQL, MySQL). Réduit les données transférées
3Query Federation : les pièges de performance
Question discriminante
Quels sont les pièges de performance dans une requête Trino qui joint plusieurs sources ?
- Pas de pushdown cross-sources — un filtre sur la table PostgreSQL ne peut pas être poussé vers la table Iceberg. Trino doit rapatrier les données des deux côtés
- Broadcast join automatique — Trino broadcaste automatiquement la petite table. Si les deux sont grandes : shuffle massif et lent
- Matérialiser les jointures coûteuses — pré-joindre les sources inter-catalogs dans une table Iceberg. Requêter la table pré-jointe
- ANALYZE régulier — les statistiques de table (row count, NDV par colonne) guident le query planner. Sans statistiques, le planner choisit mal
- Limiter les colonnes transférées — ne SELECT que les colonnes nécessaires. Trino ne facture pas à l'octet mais le réseau est le goulot d'étranglement
4Optimisation des requêtes Trino
Question discriminante
Comment optimisez-vous une requête Trino lente ? Quels outils utilisez-vous ?
-- 1. EXPLAIN ANALYZE : voir le plan réel d'exécution avec statistiques
EXPLAIN ANALYZE
SELECT region, SUM(amount)
FROM iceberg.analytics.orders
WHERE order_date >= DATE '2025-01-01'
GROUP BY region;
-- Chercher : partition pruning, broadcast vs hash join, spill to disk
-- 2. Forcer le broadcast join (quand Trino choisit mal)
SELECT /*+ BROADCAST(dim) */ f.*, dim.category_name
FROM iceberg.analytics.fct_orders f
JOIN iceberg.analytics.dim_categories dim ON f.category_id = dim.id;
-- 3. Mettre à jour les statistiques
ANALYZE iceberg.analytics.orders;
-- 4. Éviter les fonctions sur les colonnes partitionnées
-- MAL : WHERE YEAR(order_date) = 2025 (pas de pruning)
-- BIEN : WHERE order_date >= DATE '2025-01-01' (pruning actif)
-- 5. Requêtes via l'interface Web (port 8080)
-- Query details, stage timing, operator stats
5Trino vs Spark SQL : quand utiliser quoi
Question discriminante
Quels sont les critères de décision entre Trino et Spark SQL ?
| Critère | Trino | Spark SQL |
| Latence | <1s possible | Plusieurs secondes minimum |
| Query federation | Natif (multi-sources) | Limité, complexe |
| Très gros volumes batch | Possible mais sous-optimal | Excellent |
| Streaming | Non | Oui (Structured Streaming) |
| ML / Python | Non | Oui (MLlib, PySpark) |
| Mémoire | En-mémoire uniquement (spill possible) | Spill to disk nativement |
| Cas typique | BI interactive, exploration, federation multi-sources | ETL batch, ML, Streaming |
6Cas d'usage Trino en production
Question discriminante
Dans quels contextes concrètement déployez-vous Trino ?
- Remplacement de Hive — Trino est 10-100x plus rapide que Hive pour les requêtes ad-hoc sur S3/HDFS. Migration souvent la première étape vers le cloud
- Query layer unifié — une seule interface SQL pour PostgreSQL, S3/Iceberg, MongoDB, Kafka. Les data analysts font leurs joins cross-sources en SQL standard
- Amazon Athena — basé sur Trino (ex-Presto). Serverless, payer par TB scanné. Idéal pour les organisations AWS sans envie de gérer Trino
- Starburst Galaxy — version managée cloud de Trino. Pour les équipes sans Ops data
- Data mesh layer — Trino comme couche de virtualisation qui expose des données de domaines différents sans les centraliser physiquement
- Amazon Athena - base sur Trino (ex-Presto). Serverless, payer par TB scanne. Ideal pour les organisations AWS sans envie de gerer un cluster Trino
- Starburst Galaxy - version managee cloud de Trino. Pour les equipes sans Ops data qui veulent la federation multi-sources
- Data mesh layer - Trino comme couche de virtualisation qui expose des donnees de domaines differents sans les centraliser physiquement
- ANALYZE pour les stats - les statistiques de table (row count, NDV par colonne) guident le query planner. Sans statistiques, le planner choisit mal les strategies de join
- Remplacement Hive - Trino est 10-100x plus rapide que Hive pour les requetes ad-hoc sur S3/HDFS. Migration souvent la premiere etape vers le cloud
7Grille par niveau
| Niveau | Maîtrise | Signal GO | NO-GO |
| Confirmé | Architecture coordinator/workers, connecteurs, requêtes inter-sources | A écrit des requêtes fédérées Trino, comprend le pull model et le catalog system | Confond Trino et Spark, ne sait pas ce qu'est un catalog Trino |
| Senior | Optimisation, statistics, broadcast hints, architecture Trino vs Spark | Utilise EXPLAIN ANALYZE, matérialise les jointures coûteuses, justifie Trino vs Spark selon le cas | Ne sait pas pourquoi une jointure cross-sources est lente dans Trino |
1Trino Architecture: coordinator and workers
Discriminating question
How is Trino architected? How does it differ from Spark on this point?
- Coordinator — parses and optimizes SQL queries (query planner), generates a distributed execution plan, distributes fragments to workers
- Workers — execute query fragments, communicate directly with each other (not via the coordinator). In-memory data exchange via pipeline
- Pull model — workers pull data from connectors on demand, without loading everything into memory first
- No storage — Trino stores nothing. Everything is read from connectors on each query. It is a query engine, not a storage engine
- vs Spark — Trino: low latency (<1s possible), interactive SQL, native query federation. Spark: massive batch, ML, streaming, Python-first
# Minimal Trino deployment (docker-compose)
version: '3'
services:
trino-coordinator:
image: trinodb/trino:latest
ports: ["8080:8080"]
volumes:
- ./etc/coordinator:/etc/trino
environment:
- TRINO_NODE_TYPE=coordinator
trino-worker:
image: trinodb/trino:latest
volumes:
- ./etc/worker:/etc/trino
depends_on: [trino-coordinator]
deploy:
replicas: 3
2Connectors: the richness of Trino
Discriminating question
Which Trino connectors do you support? How do you configure them?
## etc/catalog/iceberg.properties
connector.name=iceberg
hive.metastore.uri=thrift://hive-metastore:9083
## etc/catalog/postgres.properties
connector.name=postgresql
connection-url=jdbc:postgresql://postgres:5432/analytics
connection-user=trino_user
connection-password=${ENV:POSTGRES_PASSWORD}
## etc/catalog/kafka.properties
connector.name=kafka
kafka.nodes=kafka1:9092,kafka2:9092
kafka.table-description-dir=/etc/trino/kafka
## etc/catalog/bigquery.properties
connector.name=bigquery
bigquery.project-id=my-gcp-project
-- Federated query: join between Iceberg (S3) and PostgreSQL (OLTP)
SELECT i.order_id, i.amount, p.customer_email, p.segment
FROM iceberg.analytics.fct_orders i
JOIN postgresql.crm.customers p ON i.customer_id = p.id
WHERE i.order_date >= DATE '2025-01-01'
AND p.country = 'FR';
- Available connectors — Hive, Iceberg, Delta Lake, PostgreSQL, MySQL, Kafka, MongoDB, Elasticsearch, BigQuery, Redshift, S3 (Hive connector)
- Catalog = configured connector — each catalog maps to a source. Notation: catalog.schema.table
- Predicate pushdown — Trino pushes filters to connectors that support it (PostgreSQL, MySQL). Reduces data transferred
3Query Federation: performance pitfalls
Discriminating question
What are the performance pitfalls in a Trino query that joins multiple sources?
- No cross-source pushdown — a filter on the PostgreSQL table cannot be pushed to the Iceberg table. Trino must fetch data from both sides
- Automatic broadcast join — Trino automatically broadcasts the small table. If both are large: massive and slow shuffle
- Materialize expensive joins — pre-join cross-catalog sources into an Iceberg table. Query the pre-joined table
- Regular ANALYZE — table statistics (row count, NDV per column) guide the query planner. Without statistics, the planner makes poor choices
- Limit transferred columns — only SELECT necessary columns. Trino does not charge per byte but the network is the bottleneck
4Trino query optimization
Discriminating question
How do you optimize a slow Trino query? What tools do you use?
-- 1. EXPLAIN ANALYZE: see the actual execution plan with statistics
EXPLAIN ANALYZE
SELECT region, SUM(amount)
FROM iceberg.analytics.orders
WHERE order_date >= DATE '2025-01-01'
GROUP BY region;
-- Look for: partition pruning, broadcast vs hash join, spill to disk
-- 2. Force broadcast join (when Trino chooses poorly)
SELECT /*+ BROADCAST(dim) */ f.*, dim.category_name
FROM iceberg.analytics.fct_orders f
JOIN iceberg.analytics.dim_categories dim ON f.category_id = dim.id;
-- 3. Update statistics
ANALYZE iceberg.analytics.orders;
-- 4. Avoid functions on partitioned columns
-- BAD: WHERE YEAR(order_date) = 2025 (no pruning)
-- GOOD: WHERE order_date >= DATE '2025-01-01' (active pruning)
-- 5. Queries via Web interface (port 8080)
-- Query details, stage timing, operator stats
5Trino vs Spark SQL: when to use which
Discriminating question
What are the decision criteria between Trino and Spark SQL?
| Criterion | Trino | Spark SQL |
| Latency | <1s possible | Several seconds minimum |
| Query federation | Native (multi-source) | Limited, complex |
| Very large batch volumes | Possible but suboptimal | Excellent |
| Streaming | No | Yes (Structured Streaming) |
| ML / Python | No | Yes (MLlib, PySpark) |
| Memory | In-memory only (spill possible) | Spill to disk natively |
| Typical use case | Interactive BI, exploration, multi-source federation | Batch ETL, ML, Streaming |
6Trino production use cases
Discriminating question
In which concrete contexts do you deploy Trino?
- Hive replacement — Trino is 10-100x faster than Hive for ad-hoc queries on S3/HDFS. Migration often the first step toward the cloud
- Unified query layer — a single SQL interface for PostgreSQL, S3/Iceberg, MongoDB, Kafka. Data analysts do their cross-source joins in standard SQL
- Amazon Athena — based on Trino (ex-Presto). Serverless, pay per TB scanned. Ideal for AWS organizations that don't want to manage Trino
- Starburst Galaxy — managed cloud version of Trino. For teams without data Ops
- Data mesh layer — Trino as a virtualization layer that exposes data from different domains without physically centralizing them
- Amazon Athena - based on Trino (ex-Presto). Serverless, pay per TB scanned. Ideal for AWS organizations that don't want to manage a Trino cluster
- Starburst Galaxy - managed cloud version of Trino. For teams without data Ops who want multi-source federation
- Data mesh layer - Trino as a virtualization layer that exposes data from different domains without physically centralizing them
- ANALYZE for stats - table statistics (row count, NDV per column) guide the query planner. Without statistics, the planner makes poor join strategy choices
- Hive replacement - Trino is 10-100x faster than Hive for ad-hoc queries on S3/HDFS. Migration often the first step toward the cloud
7Level grid
| Level | Mastery | GO signal | NO-GO |
| Confirmed | Coordinator/workers architecture, connectors, cross-source queries | Has written federated Trino queries, understands the pull model and catalog system | Confuses Trino and Spark, does not know what a Trino catalog is |
| Senior | Optimization, statistics, broadcast hints, Trino vs Spark architecture | Uses EXPLAIN ANALYZE, materializes expensive joins, justifies Trino vs Spark based on the use case | Does not know why a cross-source join is slow in Trino |