AWS Glue ETL Tutorial: Serverless Spark for Data Pipelines
Build serverless ETL jobs with AWS Glue. Learn the Data Catalog, crawlers, Spark and Python shell jobs, partitioning, bookmarks, and how to avoid surprise DPU bills.
What you'll learn
- ✓What Glue is, and how the Data Catalog ties everything together
- ✓How crawlers infer schema and partitions from S3
- ✓How to author a Spark ETL job that reads, transforms, and writes Parquet
- ✓How job bookmarks make incremental loads safe
- ✓Cost pitfalls and production patterns
Prerequisites
- •Comfort with SQL and basic Python; familiarity with S3 and Athena helps
AWS Glue is the serverless data-integration backbone of AWS. It bundles a metadata catalog, schema-inference crawlers, and a managed Spark runtime into one service. You pay per DPU-hour while jobs run — no clusters to keep warm — and the Data Catalog acts as a Hive-compatible metastore for Athena, Redshift Spectrum, and EMR.
What and Why
Three pieces matter.
- Data Catalog: a central metastore of databases, tables, columns, and partitions. Most AWS analytics services read it.
- Crawlers: scheduled jobs that scan S3 (or JDBC sources), infer schema, and write tables into the Catalog.
- Jobs: serverless Spark or Python shell scripts that read sources, transform data, and write outputs.
Why pick Glue? You get Spark without operating it, integrated lineage and bookmarks for incremental processing, and Catalog-driven discovery so a downstream Athena query “just works” after a crawler runs. Why avoid it? Cold-start times of a minute or more, opinionated DynamicFrame APIs that differ from plain Spark, and DPU costs that surprise teams who treat it like Lambda.
Mental Model
Imagine a three-lane assembly line.
- Raw files land in S3 under
s3://lake/raw/<source>/dt=YYYY-MM-DD/. - A crawler runs once a day, registering new partitions in the Catalog.
- A Glue job reads those new partitions (using a bookmark), transforms them, and writes Parquet to
s3://lake/curated/<table>/.
The bookmark is the secret sauce: it tracks which files have already been processed so you can rerun safely.
Hands-on Example
A minimal PySpark Glue job that reads JSON, cleans nulls, and writes partitioned Parquet:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glue = GlueContext(sc)
job = Job(glue); job.init(args["JOB_NAME"], args)
src = glue.create_dynamic_frame.from_catalog(
database="lake_raw", table_name="events",
transformation_ctx="src") # bookmark key
cleaned = DropNullFields.apply(frame=src)
glue.write_dynamic_frame.from_options(
frame=cleaned,
connection_type="s3",
connection_options={"path": "s3://lake/curated/events/",
"partitionKeys": ["dt"]},
format="parquet",
transformation_ctx="dst")
job.commit()
Create the job with --job-bookmark-option job-bookmark-enable and schedule it via a Glue trigger or EventBridge cron.
S3 raw/events/dt=2026-06-28/*.json
|
v
Glue Crawler ---> Data Catalog (lake_raw.events)
|
v
Glue ETL Job (Spark)
reads via bookmark
drops nulls, casts
|
v
S3 curated/events/dt=2026-06-28/*.parquet
|
v
Athena / Redshift Spectrum query directly After the job runs, point Athena at the Catalog and query SELECT count(*) FROM lake_curated.events WHERE dt='2026-06-28' — no infrastructure touched.
Common Pitfalls
- Crawlers running every hour on huge buckets. Crawler runtime scales with file count. Use exclusion patterns and partition projection in Athena instead of crawling for high-frequency partitions.
- Bookmarks disabled, jobs reprocessing everything. Without
transformation_ctxon both source and sink, the bookmark does nothing. - Tiny output files. Spark writes one file per partition per executor by default. Coalesce or repartition before write to avoid Athena performance death by a thousand small files.
- Mixing DynamicFrames and DataFrames carelessly. Schema inference differs. Convert with
toDF()andfromDF()explicitly when you need exact control. - Default 10 DPUs. A small job runs fine on 2 DPUs. Ten DPUs cost five times as much and may not be faster if the data is small.
Production Tips
Pick the right worker type. G.1X for general ETL, G.2X for memory-hungry joins, Python shell for tiny orchestration scripts that just call APIs (it’s cheap and starts fast). For sub-minute starts, use Glue 4.0 with the streaming runtime or move to EMR Serverless.
Always write Parquet with Snappy compression and a sane partition strategy (date or tenant). Add data-quality checks with Glue Data Quality rules so bad inputs fail the job rather than silently corrupting downstream tables.
Tag jobs with cost-center and set CloudWatch alarms on job duration; a job that suddenly doubles in runtime is the signal that someone uploaded a massive dump or a join key went skewed.
Wrap-up
Glue gives you Spark, a metastore, and bookmarks without managing clusters. Land raw data in partitioned S3, register it with a crawler, run a bookmarked Glue job to produce curated Parquet, and query with Athena. Mind your DPU choice and file sizes, and you’ll have a tidy serverless lakehouse pipeline.
Related articles
- AWS AWS CodeBuild and CodeDeploy Tutorial: Build and Ship on AWS
Learn how to wire AWS CodeBuild and CodeDeploy together to build artifacts, run tests, and deploy to EC2, ECS, or Lambda with blue/green and canary strategies.
- AWS AWS PrivateLink Explained: VPC Endpoints Without the Internet
Understand AWS PrivateLink: interface endpoints, endpoint services, how it differs from VPC peering and Transit Gateway, and when to choose it for private connectivity.
- AWS AWS Secrets Manager Tutorial: Storing and Rotating Secrets
A practical guide to AWS Secrets Manager: creating secrets, retrieving them from apps, automatic rotation, IAM access control, and choosing it over SSM Parameter Store.
- AWS AWS SQS Dead-Letter Queues: Catching Poison Messages
Learn how to configure Amazon SQS dead-letter queues (DLQs) to isolate poison messages, debug consumer failures, and protect your workers from infinite retry loops.