A. Core Data Engineering Concepts
SQL (joins, window functions, performance tuning)
Data Modeling (star vs snowflake, normalization)
ETL/ELT pipelines (batch vs streaming, orchestration tools like Airflow)
B. Apache Spark / PySpark
Catalyst Optimizer & Tungsten
Narrow vs Wide transformations
Joins (broadcast, sort-merge), Skew handling
AQE (Adaptive Query Execution)
Partitioning, Predicate Pushdown
Execution Plan (DAG → Stage → Tasks)
Spark UI and Job Debugging
SCD Type 2 Implementation in PySpark
C. AWS
S3, Glue, Athena, Lambda, EMR, Redshift
Event-driven design (S3 → EventBridge → Lambda)
Security: IAM roles, bucket policies, encryption
CI/CD in AWS (CodePipeline, CloudFormation)
D. Python
Writing modular, reusable code
Working with Pandas, Boto3 (for AWS interaction)
Exception handling, logging
Lambda functions and decorators
E. Kafka / Streaming
Kafka topic partitioning, consumer groups
Offset management
Integration with Spark Structured Streaming