data-engineer
Build scalable data pipelines, modern data warehouses, and real-time streaming architectures. Implements Apache Spark, dbt, Airflow, and cloud-native data platforms. Use PROACTIVELY for data pipeline design, analytics infrastructure, or modern data stack implementation.
Author
Category
Development ToolsInstall
Hot:3
Download and extract to your skills directory
Copy command and send to OpenClaw for auto-install:
Download and install this skill https://openskills.cc/api/download?slug=sickn33-skills-data-engineer&locale=en&source=copy
Data Engineer Skills - Building Modern Data Pipelines and Data Platforms
Skills Overview
Expert data engineer for building scalable data pipelines, modern data warehouses, and real-time streaming architectures using Apache Spark, dbt, Airflow, and cloud-native platforms.
Applicable Scenarios
Data Pipeline Design and Implementation
Design batch and stream processing data pipelines and build end-to-end data flow solutions. From extracting and transforming data from sources to loading into target systems, handle reliable transfer and transformation of large-scale data.
Data Warehouse and Data Lake Construction
Build modern data warehouse or lakehouse architectures to enable enterprise-grade data storage and query optimization. Support Snowflake, BigQuery, Redshift and other cloud data warehouses, as well as data lakes based on S3, ADLS, and GCS.
Analytics Infrastructure and Data Platform
Construct complete data platform infrastructure, including workflow orchestration, data quality monitoring, data governance, and metadata management. Enable self-service data products and data-driven business decision support.
Core Capabilities
Modern Data Stack Architecture
Master the full modern data stack, including data integration (Fivetran/Airbyte), data transformation (dbt), data warehouses (Snowflake/BigQuery), and BI tool integration. Support data mesh architecture and domain-driven data ownership design.
Batch and Streaming Processing
Proficient in Apache Spark 4.0 for large-scale batch data processing, and using Apache Kafka and Flink to build real-time streaming pipelines. Support cloud-native streaming services such as AWS Kinesis, Azure Event Hubs, and Google Pub/Sub.
Workflow Orchestration and Monitoring
Use Apache Airflow, Prefect, and Dagster for complex workflow orchestration, enabling dependency management, dynamic task generation, and fault recovery. Accompanied by comprehensive monitoring, alerting, and data lineage tracking capabilities.
Frequently Asked Questions
What is the difference between a data engineer and a data analyst?
Data analysts focus on exploratory analysis and visualization using existing data to answer business questions. Data engineers are responsible for building and managing data infrastructure, designing data pipelines, and ensuring reliable data flow, quality, and availability. Simply put, data engineers "pave the way" for data analysts.
How do I choose the right data warehouse technology?
Choosing a data warehouse requires considering multiple factors: data volume, query performance requirements, budget and costs, team technology stack, and cloud platform preference. Snowflake is suitable for scenarios needing elastic scaling and multi-cloud strategy; BigQuery has cost advantages within the GCP ecosystem; Redshift fits teams deeply invested in AWS; open-source options like ClickHouse and Apache Doris suit self-managed needs.
When do you need real-time data pipelines?
Real-time pipelines are needed when business scenarios require low-latency responses, for example: real-time recommendation systems, fraud detection, IoT device monitoring, live dashboards, and user behavior analysis. If data can tolerate hour- or day-level latency, batch processing is usually simpler and less costly.
What are the limitations of this skill?
This skill focuses on data engineering–related tasks. If you only need exploratory data analysis (EDA), ML model development without pipeline work, or cannot access data sources and storage systems, consider using other specialized skills.
What core technologies should data engineers master?
Core technologies include: SQL (essential), Python/Scala programming, data modeling (dimensional modeling, Data Vault), at least one batch processing framework (Spark), workflow orchestration (Airflow), cloud platform data services, data quality tools, and basic knowledge of containerization and infrastructure as code.