Creating data sources and indexers is fundamental to implementing knowledge mining solutions in Azure Cognitive Search. A data source defines the connection to your content repository, while an indexer automates the process of extracting and indexing that content.
**Data Sources**
A data source iā¦Creating data sources and indexers is fundamental to implementing knowledge mining solutions in Azure Cognitive Search. A data source defines the connection to your content repository, while an indexer automates the process of extracting and indexing that content.
**Data Sources**
A data source in Azure Cognitive Search represents a connection to external data that you want to index. Supported data sources include Azure Blob Storage, Azure SQL Database, Azure Cosmos DB, Azure Table Storage, and Azure Data Lake Storage Gen2. When creating a data source, you must specify the connection string, container or table name, and credentials for authentication. You can use managed identities for secure, passwordless connections to Azure resources.
To create a data source, you can use the Azure portal, REST API, or Azure SDKs. The configuration includes specifying the data source type, name, connection details, and optionally a query to filter which data to extract.
**Indexers**
An indexer automates the data extraction process by connecting to your data source, reading content, serializing it into JSON documents, and populating your search index. Indexers can run on-demand or on a scheduled basis for incremental updates.
Key indexer configurations include field mappings that define how source fields map to index fields, output field mappings for skillset outputs, and change detection policies for efficient updates. Indexers support parameters like batch size and maximum items per execution.
**Skillsets Integration**
Indexers can optionally include skillsets that apply AI enrichment during indexing. This enables cognitive processing such as entity recognition, key phrase extraction, image analysis, and custom skills.
**Best Practices**
Implement change tracking to enable incremental indexing. Configure appropriate schedules based on data freshness requirements. Monitor indexer status and handle failures through the Azure portal or programmatic APIs. Use field mappings to transform data during ingestion.
Creating Data Sources and Indexers in Azure Cognitive Search
Why It Is Important
Creating data sources and indexers is fundamental to implementing knowledge mining solutions in Azure. These components form the backbone of the Azure Cognitive Search ingestion pipeline, enabling organizations to automatically extract, transform, and index data from various repositories. Understanding these concepts is essential for the AI-102 exam and for building real-world search solutions that can process large volumes of unstructured data.
What Are Data Sources and Indexers?
Data Sources are connection definitions that specify where your content resides. They contain connection strings and credentials needed to access external data repositories such as: - Azure Blob Storage - Azure SQL Database - Azure Cosmos DB - Azure Table Storage - SharePoint Online
Indexers are automated crawlers that read data from configured data sources, extract content and metadata, serialize documents, and pass them to the search engine for indexing. They act as the bridge between your raw data and the searchable index.
How It Works
The process follows these steps:
1. Create a Data Source: Define the connection to your data repository using the Azure portal, REST API, or SDK. You specify the type, connection string, and container or table name.
2. Configure the Indexer: Create an indexer that references the data source and target index. You can configure: - Field mappings to match source fields to index fields - Schedule for automatic runs (hourly, daily, custom) - Change detection policies for incremental indexing - Parameters for parsing specific formats (JSON, CSV, PDF)
3. Attach a Skillset (Optional): For AI enrichment, connect a skillset to extract additional insights through cognitive skills.
4. Run the Indexer: Execute manually or let the schedule trigger automatic runs. The indexer tracks which documents have been processed using high water mark or soft delete detection.
Key Configuration Options
- maxFailedItems: Number of failures allowed before indexing stops - maxFailedItemsPerBatch: Failures allowed per batch - batchSize: Number of items processed per batch - parsingMode: Options include default, json, jsonArray, jsonLines, and delimitedText
Exam Tips: Answering Questions on Creating Data Sources and Indexers
1. Know the supported data sources: Be familiar with which Azure services can serve as data sources and their specific configuration requirements.
2. Understand field mappings: Questions often test your knowledge of mapping source fields to index fields, especially when names differ or transformations are needed.
3. Remember indexer schedules: Know that the minimum interval is 5 minutes and that you can run indexers on-demand via the portal or API.
4. Change tracking policies: Understand the difference between high water mark (for new/updated content) and soft delete policies (for removed content).
5. Parsing modes matter: When questions involve specific file formats like JSON arrays or CSV files, select the appropriate parsing mode configuration.
6. Connection string security: Remember that managed identities are the recommended approach for securing connections to Azure resources.
7. Error handling: Know how maxFailedItems and maxFailedItemsPerBatch parameters control indexer behavior during failures.
8. Incremental enrichment: Understand that enabling caching on indexers allows reuse of enrichment outputs, reducing processing costs.
9. Practice with REST API syntax: Be comfortable reading and understanding the JSON structure for creating data sources and indexers via REST API calls.