Planning, building, and deploying data pipelines for both batch and streaming workloads using Google Cloud services like Dataflow, Dataproc, and Pub/Sub.
This domain focuses on the complete lifecycle of data pipelines on Google Cloud. It begins with planning pipelines by defining data sources and sinks, transformation and orchestration logic, networking fundamentals, and data encryption. Building pipelines covers data cleansing, identifying appropriate services (Dataflow, Apache Beam, Dataproc, Cloud Data Fusion, BigQuery, Pub/Sub, Apache Spark, Hadoop ecosystem, Apache Kafka), and implementing transformations for batch processing, streaming with windowing and late-arriving data, processing logic, and AI-based data enrichment. The domain also covers data acquisition, import, and integration with new data sources. Deploying and operationalizing pipelines addresses job automation and orchestration using Cloud Composer and Workflows, as well as CI/CD practices for continuous integration and deployment of data pipelines. (~25% of exam)
5 minutes
5 Questions
Ingesting and processing data in Google Cloud involves collecting raw data from various sources and transforming it into meaningful, usable formats for analytics and decision-making.
**Data Ingestion** refers to bringing data into Google Cloud from diverse sources such as on-premises databases, streaming sources, SaaS applications, and IoT devices. Key tools include:
- **Cloud Pub/Sub**: A fully managed messaging service for real-time streaming ingestion. It decouples senders and receivers, enabling asynchronous event-driven architectures.
- **Cloud Data Transfer Service**: Facilitates large-scale data transfers from on-premises, AWS S3, or other cloud sources into Google Cloud Storage.
- **Transfer Appliance**: A physical device for offline bulk data migration when network bandwidth is limited.
- **Cloud IoT Core** (now deprecated, replaced by third-party solutions): Managed ingestion from IoT devices.
- **BigQuery Data Transfer Service**: Automates data movement into BigQuery from SaaS platforms like Google Ads and YouTube.
**Data Processing** transforms ingested data through ETL/ELT pipelines:
- **Cloud Dataflow**: A fully managed service based on Apache Beam for both batch and stream processing. It supports windowing, watermarks, and triggers for complex event processing.
- **Cloud Dataproc**: A managed Hadoop and Spark service for running existing big data workloads with auto-scaling capabilities.
- **Cloud Composer**: An Apache Airflow-based orchestration service for scheduling and managing complex data pipelines.
- **Cloud Dataprep**: A visual data preparation tool for cleaning and transforming data without coding.
- **BigQuery**: Supports in-place processing using SQL, including scheduled queries and materialized views.
**Key Considerations**:
- Choose between batch and streaming based on latency requirements.
- Design for fault tolerance and exactly-once processing semantics.
- Optimize for cost using autoscaling and appropriate storage tiers.
- Ensure data quality through validation, deduplication, and schema enforcement.
- Implement proper IAM controls and encryption for security during ingestion and processing.
Effective ingestion and processing pipelines form the foundation of any data engineering solution on Google Cloud.Ingesting and processing data in Google Cloud involves collecting raw data from various sources and transforming it into meaningful, usable formats for analytics and decision-making.
**Data Ingestion** refers to bringing data into Google Cloud from diverse sources such as on-premises databases, st…