data-engineering-data-pipeline
You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.
Author
Category
Development ToolsInstall
Hot:23
Download and extract to your skills directory
Copy command and send to OpenClaw for auto-install:
Download and install this skill https://openskills.cc/api/download?slug=sickn33-skills-data-engineering-data-pipeline&locale=en&source=copy
Data Pipeline Architecture Expert Skills
Skill Overview
This is a data pipeline architecture expert skill focused on designing scalable, reliable, and cost-optimized batch and streaming data pipelines, covering the full data engineering lifecycle from architecture design to operations and monitoring.
Applicable Scenarios
1. Data Pipeline Architecture Design and Planning
When you need to assess data sources, data volumes, and latency requirements, and choose appropriate architecture patterns (ETL/ELT, Lambda, Kappa, or Lakehouse), this skill can provide guidance on architecture pattern selection, tech stack recommendations, and data flow design.
2. Data Pipeline Implementation and Development
When you are implementing batch or streaming data ingestion, using dbt or Spark for data transformations, or configuring Airflow or Prefect for workflow orchestration, this skill can provide code implementation guidance and best practice recommendations.
3. Data Pipeline Operations and Optimization
When you need to implement data quality monitoring, configure alerting strategies, optimize storage costs, or perform troubleshooting, this skill can provide monitoring configurations, cost optimization strategies, and runbooks.
Core Capabilities
1. Multi-architecture Pattern Design Support
Supports design guidance for five mainstream architecture patterns: ETL (transform-then-load), ELT (load-then-transform), Lambda (batch + speed layer), Kappa (pure streaming), and Lakehouse (unified architecture), including architecture diagram creation, tech stack selection, and scalability analysis.
2. End-to-End Data Quality Framework
Integrates two data quality solutions—Great Expectations and dbt Tests—providing table-level and field-level validation rules, checkpoint configuration, data documentation generation, and failure notification mechanisms, ensuring a data quality check pass rate above 99%.
3. Comprehensive Monitoring and Cost Optimization
Provides CloudWatch/Prometheus/Grafana monitoring configurations, tracks core metrics such as records processed/failed, data size, and execution time, and achieves 30–50% infrastructure cost savings through partition optimization, lifecycle policies, and compute instance selection.
Frequently Asked Questions
What are the common patterns for data pipeline architectures?
There are five main data pipeline patterns: ETL (transform data before loading into the target), ELT (load first then transform, suitable for cloud data warehouses), Lambda (hybrid architecture combining batch and stream processing layers), Kappa (pure stream processing architecture), and Lakehouse (a newer unified data lake and data warehouse architecture). The choice depends on your data volume, latency requirements, and business needs.
What is the difference between ETL and ELT?
ETL (Extract-Transform-Load) transforms data before loading it into the target system, suitable for traditional data warehouse scenarios; ELT (Extract-Load-Transform) loads data into the target first and then transforms it, leveraging the compute capabilities of modern data warehouses, which is more flexible and often more cost-effective. ELT is the mainstream choice for cloud-native data engineering.
How can data pipelines be cost-optimized?
Cost optimization can be applied across multiple dimensions: on the storage layer use appropriate partitioning strategies (keep partition size > 1 GB), control file sizes to 512 MB–1 GB (Parquet format), and configure lifecycle policies (hot→warm→cold); on the compute layer use Spot instances for batch processing, on-demand instances for streaming, and serverless for ad-hoc queries; on the query layer improve performance with partition pruning, clustering, and predicate pushdown.
What is the difference between Lambda and Kappa architectures?
Lambda architecture consists of a batch layer (processing full historical data), a speed layer (processing real-time data), and a serving layer (merging results from both), which requires maintaining two codebases and increases complexity. Kappa architecture retains only the stream processing layer and handles full reprocessing by replaying message queue history, making the architecture simpler but reliant on strong stream processing capabilities.
How do you ensure data quality in a data pipeline?
You can use multi-layer quality assurance mechanisms: perform schema validation and dead-letter queue handling at the ingestion layer; use dbt’s built-in tests (unique, not_null, relationships) and custom tests at the transformation layer; use Great Expectations in a separate quality layer to configure table- and field-level validation rules; and set up data freshness checks and automated alerting.
Which metrics should data pipeline monitoring focus on?
Core monitoring metrics include: business metrics (records processed, failures, data volume), performance metrics (execution time, end-to-end latency), quality metrics (data quality score, freshness), system metrics (CPU/memory usage, error rates), and cost metrics (cost allocation by job/table/project). It is recommended to set up Grafana or CloudWatch dashboards for visual monitoring.