- Design, build, and optimize ETL pipelines using AWS Glue 3.0+ and PySpark.
- Implement scalable and secure data lakes using Amazon S3, following bronze/silver/gold zoning.
- Write performant SQL using AWS Athena (Presto) with CTEs, window functions, and aggregations.
- Take full ownership from ingestion → transformation → validation → metadata → documentation → dashboard-ready output.
- Build pipelines that are not just performant, but audit-ready and metadata-rich from the first version.
- Integrate classification tags and ownership metadata into all columns using AWS Glue Catalog tagging conventions.
- Ensure no pipeline moves to QA or BI team without validation logs and field-level metadata completed.
- Develop job orchestration workflows using AWS Step Functions integrated with EventBridge or CloudWatch.
- Manage schemas and metadata using AWS Glue Data Catalog.
- Take full ownership from ingestion → transformation → validation → metadata → documentation → dashboard-ready output.
- Ensure no pipeline moves to QA or BI team without validation logs and field-level metadata completed.
- Enforce data quality using Great Expectations, with checks for null %, ranges, and referential rules.
- Ensure data lineage with OpenMetadata or Amundsen and add metadata classifications (e.g., PII, KPIs).
- Collaborate with data scientists on ML pipelines, handling JSON/Parquet I/O and feature engineering.
- Must understand how to prepare flattened, filterable datasets for BI tools like Sigma, Power BI, or Tableau.
- Interpret business metrics such as forecasted revenue, margin trends, occupancy/utilization, and volatility.
- Work with consultants, QA, and business teams to finalize KPIs and logic.
- Build pipelines that are not just performant, but audit-ready and metadata-rich from the first version.
- Integrate classification tags and ownership metadata into all columns using AWS Glue Catalog tagging conventions.
- This is not just a coding role. We expect the candidate to think like a data architect within their module – designing pipelines that scale, handle exceptions, and align to evolving KPIs.