Skip to main content

Understanding the YAML file

The low-code framework involves editing a boilerplate YAML file. This section deep dives into the components of the YAML file.

Stream

Streams define the schema of the data to sync, as well as how to read it from the underlying API source. A stream generally corresponds to a resource within the API. They are analogous to tables for a relational database source.

By default, the schema of a stream's data is defined as a JSONSchema file in <source_connector_name>/schemas/<stream_name>.json.

Alternately, the stream's data schema can be stored in YAML format inline in the YAML file, by including the optional schema_loader key. If the data schema is provided inline, any schema on disk for that stream will be ignored.

More information on how to define a stream's schema can be found here

The stream object is represented in the YAML file as:

  DeclarativeStream:
description: A stream whose behavior is described by a set of declarative low code components
type: object
additionalProperties: true
required:
- type
- retriever
properties:
type:
type: string
enum: [DeclarativeStream]
retriever:
"$ref": "#/definitions/Retriever"
schema_loader:
definition: The schema loader used to retrieve the schema for the current stream
anyOf:
- "$ref": "#/definitions/InlineSchemaLoader"
- "$ref": "#/definitions/JsonFileSchemaLoader"
stream_cursor_field:
definition: The field of the records being read that will be used during checkpointing
anyOf:
- type: string
- type: array
items:
- type: string
transformations:
definition: A list of transformations to be applied to each output record in the
type: array
items:
anyOf:
- "$ref": "#/definitions/AddFields"
- "$ref": "#/definitions/CustomTransformation"
- "$ref": "#/definitions/RemoveFields"
$parameters:
type: object
additional_properties: true

More details on streams and sources can be found in the basic concepts section.

Configuring a stream for incremental syncs

If you want to allow your stream to be configured so that only data that has changed since the prior sync is replicated to a destination, you can specify a DatetimeBasedCursor on your Streams's incremental_sync field.

Given a start time, an end time, and a step function, it will partition the interval [start, end] into small windows of the size described by the step.

More information on incremental_sync configurations and the DatetimeBasedCursor component can be found in the incremental syncs section.

Data retriever

The data retriever defines how to read the data for a Stream and acts as an orchestrator for the data retrieval flow.

It is described by:

  1. Requester: Describes how to submit requests to the API source
  2. Paginator: Describes how to navigate through the API's pages
  3. Record selector: Describes how to extract records from a HTTP response
  4. Partition router: Describes how to retrieve data across multiple resource locations

Each of those components (and their subcomponents) are defined by an explicit interface and one or many implementations. The developer can choose and configure the implementation they need depending on specifications of the integration they are building against.

Since the Retriever is defined as part of the Stream configuration, different Streams for a given Source can use different Retriever definitions if needed.

The schema of a retriever object is:

  retriever:
description: Retrieves records by synchronously sending requests to fetch records. The retriever acts as an orchestrator between the requester, the record selector, the paginator, and the partition router.
type: object
required:
- requester
- record_selector
- requester
properties:
"$parameters":
"$ref": "#/definitions/$parameters"
requester:
"$ref": "#/definitions/Requester"
record_selector:
"$ref": "#/definitions/HttpSelector"
paginator:
"$ref": "#/definitions/Paginator"
stream_slicer:
"$ref": "#/definitions/StreamSlicer"
PrimaryKey:
type: string

Routing to Data that is Partitioned in Multiple Locations

Some sources might require specifying additional parameters that are needed to retrieve data. Using the PartitionRouter component, you can specify a static or dynamic set of elements which will be iterated upon and made available for use when a connector dispatches requests to get data from a source.

More information on how to configure the partition_router field on a Retriever to retrieve data from multiple location can be found in the iteration section.

Combining Incremental Syncs and Iterable Locations

A stream can be configured to support incrementally syncing data that is spread across multiple partitions by defining incremental_sync on the Stream and partition_router on the Retriever.

During a sync where both are configured, the Cartesian product of these parameters will be calculated and the connector will repeat requests to the source using the different combinations of parameters to get all of the data.

For example, if we had a DatetimeBasedCursor requesting data over a 3-day range partitioned by day and a ListPartitionRouter with the following locations A, B, and C. This would result in the following combinations that will be used to request data.

PartitionDate Range
A2022-01-01T00:00:00 - 2022-01-01T23:59:59
B2022-01-01T00:00:00 - 2022-01-01T23:59:59
C2022-01-01T00:00:00 - 2022-01-01T23:59:59
A2022-01-02T00:00:00 - 2022-01-02T23:59:59
B2022-01-02T00:00:00 - 2022-01-02T23:59:59
C2022-01-02T00:00:00 - 2022-01-02T23:59:59
A2022-01-03T00:00:00 - 2022-01-03T23:59:59
B2022-01-03T00:00:00 - 2022-01-03T23:59:59
C2022-01-03T00:00:00 - 2022-01-03T23:59:59

More readings