Back to Preparing and Using Data for Analysis

Unstructured Data for Embeddings and RAG

5 minutes 5 Questions

Unstructured data refers to information that doesn't follow a predefined schema or organized format, such as text documents, images, audio files, videos, PDFs, and emails. In the context of Google Cloud's data engineering ecosystem, handling unstructured data effectively is critical for modern AI-d…

Unstructured Data for Embeddings and RAG – GCP Professional Data Engineer Guide

Why Is This Topic Important?

Unstructured data—text documents, images, audio, video, PDFs, and more—makes up the vast majority of enterprise data. Traditional databases and analytics tools were built for structured, tabular data, leaving a massive gap in how organizations derive value from unstructured content. With the rise of large language models (LLMs) and generative AI, the ability to transform unstructured data into meaningful numerical representations (embeddings) and then use those embeddings in Retrieval-Augmented Generation (RAG) pipelines has become a critical skill. Google Cloud's Professional Data Engineer exam increasingly tests your understanding of how to prepare, store, and serve unstructured data for these modern AI workflows.

What Are Embeddings?

An embedding is a dense vector (a list of floating-point numbers) that captures the semantic meaning of a piece of data. Two pieces of content that are semantically similar will have embeddings that are close together in vector space.

Key points:
- Text embeddings represent the meaning of words, sentences, or entire documents.
- Multimodal embeddings can represent images, audio, or video alongside text in the same vector space.
- Google Cloud offers embedding models through Vertex AI (e.g., textembedding-gecko, multimodalembedding).
- Embeddings are typically 256–768 dimensions, though some models produce larger vectors.

What Is RAG (Retrieval-Augmented Generation)?

RAG is an architectural pattern that enhances LLM responses by grounding them in external, authoritative data rather than relying solely on the model's training data.

The RAG workflow:
1. Ingestion (Offline): Unstructured documents are chunked, converted to embeddings, and stored in a vector database or search index.
2. Retrieval (Online): When a user asks a question, the query is also converted to an embedding, and a similarity search retrieves the most relevant document chunks.
3. Generation (Online): The retrieved chunks are passed as context to an LLM, which generates a grounded, accurate answer.

RAG reduces hallucinations, enables the LLM to use up-to-date or proprietary data, and avoids the cost of fine-tuning the entire model.

How It Works on Google Cloud

Step 1 – Collecting and Storing Unstructured Data
- Store raw unstructured data in Cloud Storage (GCS buckets). This is the most common landing zone for documents, images, and media files.
- Use Document AI to extract text and structure from PDFs, scanned documents, and forms.
- Use Speech-to-Text or Video Intelligence API to convert audio/video to text transcripts.

Step 2 – Preprocessing and Chunking
- Raw text must be broken into manageable chunks (e.g., 500–1000 tokens) because embedding models have input-length limits and retrieval quality improves with appropriately sized chunks.
- Common chunking strategies: fixed-size with overlap, sentence-based, paragraph-based, or semantic chunking.
- Metadata (source document, page number, timestamp) should be preserved alongside each chunk for traceability.
- Tools: Dataflow (Apache Beam) for scalable batch/stream processing, Cloud Functions for event-driven processing, or Vertex AI Pipelines for orchestrated ML workflows.

Step 3 – Generating Embeddings
- Use Vertex AI Embeddings API (e.g., text-embedding-004) to convert each chunk into a vector.
- Batch embedding generation can be done via Vertex AI Batch Prediction or custom Dataflow jobs.
- For multimodal data, use Vertex AI Multimodal Embeddings to encode images and text into the same space.

Step 4 – Storing Embeddings in a Vector Database
- Vertex AI Vector Search (formerly Matching Engine): Google's fully managed, high-scale approximate nearest neighbor (ANN) service. Best for production workloads requiring low latency at massive scale.
- AlloyDB for PostgreSQL with pgvector extension: Good when you need combined relational + vector queries.
- Cloud SQL for PostgreSQL with pgvector: Suitable for smaller-scale use cases.
- BigQuery with VECTOR_SEARCH function: Useful when embeddings are part of a larger analytical pipeline.
- Spanner: Supports KNN vector search for globally distributed applications.
- Firestore: Supports vector search for application-centric use cases.

Step 5 – Building the RAG Pipeline
- Vertex AI Search (part of Vertex AI Agent Builder): A managed, end-to-end RAG solution. You point it at your data sources (GCS, BigQuery, websites), and it handles chunking, embedding, indexing, retrieval, and grounded generation automatically.
- Vertex AI RAG Engine: Provides a managed RAG API within Vertex AI that handles corpus management, document ingestion, and retrieval.
- Custom RAG: Build your own pipeline using Vertex AI Embeddings API + Vector Search + Gemini/PaLM for generation. Use LangChain or LlamaIndex on Vertex AI for orchestration.

Step 6 – Serving and Monitoring
- Serve the RAG application via Cloud Run, GKE, or Vertex AI Endpoints.
- Monitor with Cloud Monitoring and Cloud Logging.
- Track embedding drift and retrieval quality over time.

Key Concepts for the Exam

1. Chunking Strategy Matters: Chunks that are too large lose specificity; chunks that are too small lose context. Overlapping chunks help preserve context at boundaries.

2. Embedding Model Selection: Choose the right model dimensionality and type (text vs. multimodal) based on your use case. Higher dimensions = more expressive but more storage and compute.

3. Vector Search Algorithms: Understand approximate nearest neighbor (ANN) vs. exact nearest neighbor (KNN). ANN (used by Vertex AI Vector Search) trades a small amount of accuracy for massive speed improvements at scale.

4. Distance Metrics: Common metrics include cosine similarity, dot product, and Euclidean distance. Cosine similarity is most common for text embeddings.

5. Grounding vs. Fine-tuning: RAG provides grounding (giving the model access to external data at inference time) and is preferred when data changes frequently or is domain-specific. Fine-tuning changes the model's weights and is better for adjusting style or learning new tasks.

6. Data Freshness: One major advantage of RAG is that the knowledge base can be updated independently of the model. New documents can be ingested and embedded without retraining.

7. Security and Access Control: Use IAM and VPC Service Controls to protect vector databases and embedding endpoints. Ensure sensitive data in chunks inherits the same access controls as the source documents.

8. Cost Optimization: Vertex AI Vector Search pricing is based on the number of nodes deployed. Right-size your index shards. Use BigQuery VECTOR_SEARCH for analytical workloads where real-time latency is not critical.

Exam Tips: Answering Questions on Unstructured Data for Embeddings and RAG

1. Look for keywords: Questions mentioning "grounding," "reducing hallucinations," "proprietary documents," "semantic search," or "knowledge base" almost always point to a RAG architecture.

2. Managed vs. Custom: If the question emphasizes minimal operational overhead or a quick solution, prefer Vertex AI Search or Vertex AI RAG Engine. If it emphasizes full control or custom logic, think custom pipeline with Vector Search + Embeddings API.

3. Choose the right vector store: Match the vector store to the workload profile:
- High-scale, low-latency production → Vertex AI Vector Search
- Combined relational + vector queries → AlloyDB with pgvector
- Analytics-heavy, batch retrieval → BigQuery VECTOR_SEARCH
- Globally distributed → Spanner

4. Document AI is the answer for PDFs and scanned docs: When a question involves extracting text from unstructured documents (invoices, contracts, forms), Document AI is the go-to preprocessing step before chunking and embedding.

5. Remember the full pipeline: Ingest → Preprocess/Chunk → Embed → Index/Store → Retrieve → Generate. Questions may test your understanding of any individual step or the overall architecture.

6. RAG vs. Fine-tuning: If the question describes frequently changing data or the need for citation/attribution, RAG is the answer. If the question describes adapting the model's behavior or style, fine-tuning is more appropriate.

7. Watch for data preparation nuances: Questions may test whether you know that embeddings must be regenerated when the embedding model changes, or that chunk overlap is a best practice, or that metadata filtering can improve retrieval precision.

8. Multimodal scenarios: If the question involves searching across both images and text, think multimodal embeddings from Vertex AI, which encode different modalities into a shared vector space.

9. Scalability signals: Large corpus sizes (millions of documents) strongly point toward Vertex AI Vector Search with its ANN indexing. Smaller corpora may work fine with pgvector in AlloyDB or Cloud SQL.

10. Elimination strategy: On the exam, eliminate answers that suggest storing embeddings in Cloud Storage (it is not a vector database), or that suggest feeding entire documents into an LLM without retrieval (context window limits and cost), or that suggest retraining the LLM instead of using RAG for dynamic knowledge bases.

Test mode:

Exam (Timed)

Practice (With explanations)

Start practice test

Engineer Data Solutions on GCP

Design & build data processing systems on GCP

Data Pipelines: Dataflow, Dataproc, Pub/Sub, and Cloud Composer
BigQuery & Analytics: Data warehousing, ML integration, and BI Engine
Security & Governance: Data access controls, encryption, and data catalog
100% Satisfaction Guaranteed: Full refund if unsatisfied
Risk-Free: 7-day free trial with all premium features!

More Unstructured Data for Embeddings and RAG questions

45 questions (total)

Start 45 question test