Defining Data Sources and Sinks – GCP Professional Data Engineer Guide
Why Is Defining Data Sources and Sinks Important?
In any data engineering pipeline, the very first architectural decision involves identifying where data comes from (sources) and where data goes (sinks). Getting this wrong can lead to data loss, increased latency, cost overruns, and compliance violations. On the GCP Professional Data Engineer exam, questions about sources and sinks test your ability to choose the right ingestion mechanism and the right storage or output target for a given business scenario. Mastering this topic is foundational to nearly every other domain on the exam, including pipeline design, storage optimization, and security.
What Are Data Sources and Sinks?
A data source is any system, application, device, or service that produces or holds data you need to ingest. Examples include:
- Relational databases (Cloud SQL, on-premises MySQL/PostgreSQL, AlloyDB)
- NoSQL databases (Firestore, Bigtable, MongoDB)
- Streaming platforms (Pub/Sub, Apache Kafka)
- File systems and object stores (Cloud Storage, HDFS, local file systems)
- SaaS applications and APIs (Salesforce, Google Analytics, third-party REST APIs)
- IoT devices sending telemetry data
- Logs from applications, infrastructure, or services (Cloud Logging, Fluentd)
A data sink (also called a data destination or target) is the system where processed or raw data is written. Examples include:
- BigQuery (analytical data warehouse)
- Cloud Storage (data lake, archival)
- Bigtable (low-latency, high-throughput NoSQL)
- Cloud SQL or AlloyDB (transactional workloads)
- Pub/Sub (fan-out to downstream consumers)
- Spanner (globally distributed relational database)
- Looker Studio, dashboards, or downstream APIs
How It Works – The Process of Defining Sources and Sinks
1. Understand the Data Characteristics
Before selecting sources and sinks, assess:
- Volume: How much data is generated? GB per day? TB per hour?
- Velocity: Is data arriving in real time (streaming) or periodically (batch)?
- Variety: Is the data structured (rows/columns), semi-structured (JSON, Avro), or unstructured (images, video)?
- Veracity: How reliable and clean is the data?
2. Choose the Right Ingestion Pattern
- Batch ingestion: Use Cloud Storage as a landing zone, then load into BigQuery or other sinks via Dataflow, Dataproc, or BigQuery load jobs. Tools like Transfer Service or Storage Transfer Service help move data from on-premises or other clouds.
- Streaming ingestion: Use Pub/Sub as the primary streaming source. Dataflow (Apache Beam) reads from Pub/Sub and writes to sinks such as BigQuery (streaming inserts or Storage Write API), Bigtable, or Cloud Storage.
- Change Data Capture (CDC): Use Datastream to capture real-time changes from sources like Cloud SQL, MySQL, PostgreSQL, or Oracle and replicate them into BigQuery or Cloud Storage.
- Database migration: Use Database Migration Service (DMS) to migrate relational databases to Cloud SQL, AlloyDB, or Spanner.
3. Map Sources to Appropriate Sinks
The mapping depends on use cases:
- Analytics and reporting: Source → Pub/Sub or Cloud Storage → Dataflow → BigQuery (sink)
- Real-time serving: Source → Pub/Sub → Dataflow → Bigtable (sink)
- Data lake / archival: Source → Cloud Storage (sink) in Parquet or Avro format
- Transactional applications: Source → Cloud SQL or Spanner (sink)
- ML feature store: Source → Vertex AI Feature Store (sink)
4. Consider Connectivity and Networking
- For on-premises sources, consider VPN, Cloud Interconnect, or Transfer Appliance.
- For cross-cloud sources, use Storage Transfer Service or BigQuery Omni / BigLake.
- Private connectivity via VPC Service Controls and Private Google Access is crucial for security.
5. Handle Schema and Format Compatibility
- Ensure the source data format (CSV, JSON, Avro, Parquet, ORC) is compatible with the sink.
- Use schema evolution strategies (e.g., Avro with schema registry) to handle changing source schemas.
- BigQuery supports schema auto-detection but production pipelines should use explicit schemas.
6. Apply Security and Compliance Controls
- Use IAM roles to restrict who can read from sources and write to sinks.
- Encrypt data in transit (TLS) and at rest (CMEK or Google-managed keys).
- Use DLP API to scan and redact sensitive data between source and sink.
- Ensure data residency requirements are met by choosing appropriate sink regions.
Key GCP Services by Role
Ingestion / Source Connectors:
- Pub/Sub – scalable, serverless messaging for streaming sources
- Datastream – CDC from relational databases
- Storage Transfer Service – scheduled transfers from S3, HTTP, or other GCS buckets
- Transfer Appliance – physical device for large-scale offline data migration
- BigQuery Data Transfer Service – automated data loads from Google SaaS products (Google Ads, YouTube, etc.)
Processing / Transformation:
- Dataflow (Apache Beam) – unified batch and streaming ETL
- Dataproc (Spark/Hadoop) – big data processing for existing Spark workloads
- Cloud Data Fusion – code-free, visual ETL pipelines (CDAP-based)
- Dataform – SQL-based transformations inside BigQuery (ELT pattern)
Sinks / Destinations:
- BigQuery – petabyte-scale analytics warehouse
- Cloud Storage – object storage for data lake, staging, and archival
- Bigtable – low-latency, high-throughput NoSQL for time-series and IoT
- Spanner – globally consistent relational database
- Cloud SQL / AlloyDB – managed relational databases for OLTP
- Firestore – document database for mobile/web apps
Exam Tips: Answering Questions on Defining Data Sources and Sinks
Tip 1: Match the source/sink to the use case. The exam frequently presents a scenario and asks you to pick the best source-sink pairing. Ask yourself: Is this batch or streaming? Is it analytical or transactional? Does it need low latency or high throughput?
Tip 2: Know when to use Pub/Sub vs. direct ingestion. If the question mentions real-time, event-driven, or decoupled architecture, Pub/Sub is almost always the streaming source. If the question mentions scheduled or periodic loads, think batch ingestion via Cloud Storage.
Tip 3: Understand Datastream for CDC scenarios. If a question describes real-time replication from an operational database (MySQL, PostgreSQL, Oracle) to BigQuery or Cloud Storage, Datastream is the answer—not Dataflow alone.
Tip 4: Remember BigQuery Data Transfer Service for Google SaaS sources. If the data source is Google Ads, Campaign Manager, YouTube, or Google Play, BigQuery Data Transfer Service is purpose-built.
Tip 5: Watch for hybrid and multi-cloud keywords. If data is on-premises or in AWS S3, look for answers involving Storage Transfer Service, Transfer Appliance, Cloud Interconnect, or BigQuery Omni. The exam tests your knowledge of moving data across environments.
Tip 6: Consider cost and simplicity. The exam often has a "most cost-effective" or "least operational overhead" qualifier. Serverless options (Pub/Sub, Dataflow, BigQuery) are generally preferred over managing infrastructure (Dataproc clusters, self-hosted Kafka).
Tip 7: Pay attention to data format and schema evolution. If a question mentions evolving schemas or compatibility, think Avro (supports schema evolution natively). If it mentions columnar analytics, think Parquet or ORC.
Tip 8: Security-related source/sink questions. If the question emphasizes PII, compliance, or data governance, look for answers that include DLP API, CMEK, VPC Service Controls, or column-level security in BigQuery.
Tip 9: Eliminate wrong sinks by workload type. BigQuery is not a transactional database—eliminate it for OLTP. Bigtable does not support SQL joins—eliminate it for complex analytical queries. Cloud SQL has size limits—eliminate it for petabyte-scale analytics.
Tip 10: Read the full question carefully. Many questions include subtle requirements like "minimize latency," "exactly-once delivery," or "global availability" that narrow down the correct source-sink combination. Do not rush—the differentiator is often in a single phrase.
Summary
Defining data sources and sinks is the cornerstone of data pipeline architecture on GCP. By understanding the characteristics of your data, choosing the right ingestion pattern, mapping sources to appropriate sinks, and applying security controls, you build pipelines that are scalable, cost-effective, and reliable. For the exam, always reason from the business requirements to the technical solution, and let the use case guide your selection of GCP services.