Overview

In this section, we will go over the metrics that Artie Transfer emits and the future roadmap.

Today

Artie Transfer's currently only supports metrics and integrates with Datadog. We are committed to being vendor neutral, but not at the cost of stability and reliability. As such, we will be using OpenTelemetry when the library is stable.

We also plan to support application tracing such that we can directly plug into your APM provider.

Metrics

You can specify additional tags and namespace in the configuration file and it will apply to every metric that Transfer emits. See Options for more details.

Name Description Unit Tags

Name	Description	Unit	Tags
`*.row.lag`	Difference between Kafka's high watermark (which is at the partition level) and the current message offset.	-	`groupid` `topic` `partition` `table` `mode`
`*.ingestion.lag.95percentile`	p95 of time lag from Kafka message was published and received. Since Transfer `1.4.6`	`ms`	`groupid` `topic` `partition` `table` `mode`
`*.ingestion.lag.avg`	Avg of time lag from Kafka message was published and received. Since Transfer `1.4.6` If self-hosting Transfer, this is a good metric to set a monitor for.	`ms`	`groupid` `topic` `partition` `table` `mode`
`*.ingestion.lag.max`	max lag from Kafka message was published and received. Since Transfer `1.4.6`	`ms`	`groupid` `topic` `partition` `table` `mode`
`*.process.message.count`	How many rows has Transfer processed.	Count	`database` `schema` `table` `groupid` `op` `skipped` `mode`
`*.process.message.95percentile`	p95 of how long each row process takes.	`ms`	`database` `schema` `table` `groupid` `op` `skipped` `mode`
`*.process.message.avg`	Average of how long each row process takes.	`ms`	`database` `schema` `table` `groupid` `op` `skipped` `mode`
`*.process.message.max`	Max of how long each row process takes.	`ms`	`database` `schema` `table` `groupid` `op` `skipped` `mode`
`*.process.message.median`	Median of how long each row process takes.	`ms`	`database` `schema` `table` `groupid` `skipped` `mode`
`*.flush.count`	How many flush operations have been performed.	Count	`database` `schema` `table` `what` `reason`
`*.flush.95percentile`	p95 of how long each flush process takes.	`ms`	`database` `schema` `table` `what` reason
`*.flush.avg`	Avg of how long each flush process takes.	`ms`	`database` `schema` `table` `what` `reason`
`*.flush.max`	Max of how long each flush process takes.	`ms`	`database` `schema` `table` `what` `reason`
`*.flush.median`	Median of how long each flush process takes.	`ms`	`database` `schema` `table` `what` `reason`

*.row.lag

Difference between Kafka's high watermark (which is at the partition level) and the current message offset.

groupid
topic
partition
table
mode

*.ingestion.lag.95percentile

p95 of time lag from Kafka message was published and received. Since Transfer 1.4.6

ms

groupid
topic
partition
table
mode

*.ingestion.lag.avg

Avg of time lag from Kafka message was published and received. Since Transfer 1.4.6 If self-hosting Transfer, this is a good metric to set a monitor for.

ms

groupid
topic
partition
table
mode

*.ingestion.lag.max

max lag from Kafka message was published and received. Since Transfer 1.4.6

ms

groupid
topic
partition
table
mode

*.process.message.count

How many rows has Transfer processed.

Count

database
schema
table
groupid
op
skipped
mode

*.process.message.95percentile

p95 of how long each row process takes.

ms

database
schema
table
groupid
op
skipped
mode

*.process.message.avg

Average of how long each row process takes.

ms

database
schema
table
groupid
op
skipped
mode

*.process.message.max

Max of how long each row process takes.

ms

database
schema
table
groupid
op
skipped
mode

*.process.message.median

Median of how long each row process takes.

ms

database
schema
table
groupid
skipped
mode

*.flush.count

How many flush operations have been performed.

Count

database
schema
table
what
reason

*.flush.95percentile

p95 of how long each flush process takes.

ms

database
schema
table
what
reason

*.flush.avg

Avg of how long each flush process takes.

ms

database
schema
table
what
reason

*.flush.max

Max of how long each flush process takes.

ms

database
schema
table
what
reason

*.flush.median

Median of how long each flush process takes.

ms

database
schema
table
what
reason

The what tag explained

The what tag aims to provide a high level of visibility into whether an attempt has succeeded or not. And if it did not succeed, it will provide additional visibility into which particular operation failed (vs just providing a generic error state).

Transfer will provide what:success if the attempt failed and different reasoning depending on the error state. This way, our monitors and response to failures can be more actionable and we can jump straight to the offending code block.

Last updated 3 months ago