Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Extra samples added when Cloudwatch metric stops #1486

Closed
1 task done
qcattez opened this issue Jul 30, 2024 · 6 comments
Closed
1 task done

[BUG] Extra samples added when Cloudwatch metric stops #1486

qcattez opened this issue Jul 30, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@qcattez
Copy link

qcattez commented Jul 30, 2024

Is there an existing issue for this?

  • I have searched the existing issues

YACE version

v0.61.2

Config file

apiVersion: v1alpha1
    sts-region: us-west-2
    discovery:
      exportedTagsOnMetrics:
        AWS/Firehose:
          - Environment
          - Platform
      jobs:
        - type: AWS/Firehose
          regions:
            - us-west-2
          addCloudwatchTimestamp: true
          delay: 600
          period: 300
          length: 1200
          nilToZero: true
          statistics:
            - Sum
          metrics:
            - name: DeliveryToRedshift.Records
            - name: DeliveryToRedshift.Bytes
            - name: DeliveryToRedshift.Success
            - name: DeliveryToS3.Records
            - name: DeliveryToS3.Bytes
            - name: DeliveryToS3.Success

Current Behavior

When configured with a period of 5m and a matching scraping-interval, the last sample of a metric is duplicated by YACE.

Here is the Cloudwatch graph :
Capture d’écran 2024-07-29 à 14 12 54

And here is the Grafana graph :
Capture d’écran 2024-07-29 à 14 13 16

Querying the metric with the AWS CLI shows that no extra samples are present (the duplicated value is 143170.0 on the Grafana graph):

{
    "MetricDataResults": [
        {
            "Id": "m1",
            "Label": "DeliveryToRedshift.Records",
            "Timestamps": [
                "2024-07-29T11:15:00+00:00",
                "2024-07-29T11:10:00+00:00",
                "2024-07-29T11:05:00+00:00",
                "2024-07-29T11:00:00+00:00",
                "2024-07-29T10:55:00+00:00",
                "2024-07-29T10:50:00+00:00",
                "2024-07-29T10:45:00+00:00",
                "2024-07-29T10:40:00+00:00",
                "2024-07-29T10:35:00+00:00",
                "2024-07-29T10:30:00+00:00",
                "2024-07-29T10:25:00+00:00",
                "2024-07-29T10:20:00+00:00",
                "2024-07-29T10:15:00+00:00"
            ],
            "Values": [
                143170.0,
                4615248.0,
                2366956.0,
                3730344.0,
                3370700.0,
                3448014.0,
                3160632.0,
                2367084.0,
                3018033.0,
                2441417.0,
                2579474.0,
                1866146.0,
                2005916.0
            ],
            "StatusCode": "Complete"
        }
    ],
    "Messages": []
}

Note : 1 extra sample is added with a period of 5m. When the period is 1m, 4 extra samples are generated.

Expected Behavior

No extra samples are generated.

Steps To Reproduce

We deploy YACE with kustomize :

kustomization.yaml :

---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: cloudwatch-exporter

helmCharts:
  - name: yet-another-cloudwatch-exporter
    repo: https://nerdswords.github.io/helm-charts
    version: 0.38.0
    releaseName: cloudwatch-exporter
    namespace: cloudwatch-exporter
    valuesFile: values.yaml
    apiVersions:
      - monitoring.coreos.com/v1

values.yaml :

fullnameOverride: cloudwatch-exporter

serviceAccount:
  create: false
  name: cloudwatch-exporter

securityContext:
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

testConnection: false

resources:
  requests:
    cpu: 10m
    memory: 128Mi
  limits:
    memory: 128Mi

extraArgs:
  scraping-interval: 300

serviceMonitor:
  enabled: true
  interval: 300s

config: |-
  apiVersion: v1alpha1
  sts-region: us-west-2
  discovery:
    exportedTagsOnMetrics:
      AWS/Firehose:
        - Environment
        - Platform
    jobs:
      - type: AWS/Firehose
        regions:
          - us-west-2
        addCloudwatchTimestamp: true
        delay: 600
        period: 300
        length: 1200
        nilToZero: true
        statistics:
          - Sum
        metrics:
          - name: DeliveryToRedshift.Records
          - name: DeliveryToRedshift.Bytes
          - name: DeliveryToRedshift.Success
          - name: DeliveryToS3.Records
          - name: DeliveryToS3.Bytes
          - name: DeliveryToS3.Success

To generate the manifests : kustomize build . --enable-helm

Don't mind the serviceAccount.create: false, we create it elsewhere with the IRSA annotation.

Anything else?

Hope someone can see help 🙏

@qcattez qcattez added the bug Something isn't working label Jul 30, 2024
@kgeckhart
Copy link
Contributor

👋 @qcattez, the behavior you are seeing is due to the usage of nilToZero: true in your configuration. Removing it or explicitly setting it to false (the default is false) should fix your issue.

What's happening is that CloudWatch continues to export resources in ListMetrics for a variable length of time even when they have no data points. When the exporter calls GetMetricData and no data point is found in the time window nilToZero dictates if a zero should be produced or a nil. Producing zero gives you continuity in your graph panels while nil leaves gaps much like cloudwatch.

@qcattez
Copy link
Author

qcattez commented Aug 12, 2024

Hi @kgeckhart 👋

Thanks for helping out ;)

I tried setting nilToZero: false but it doesn't change the behavior I mentionned : there is still an extra sample with the same value (not zero).
Furthermore, as we can see in the screenshots and graphs, even if I had nilToZero: true, I didn't get a continuity in my graph :/

Do you have another idea ?

@qcattez
Copy link
Author

qcattez commented Aug 30, 2024

Just commenting to get some help on this 🙏

@kgeckhart
Copy link
Contributor

You say you want continuity in your graphs and for that to happen you need consistent sampling or prometheus will mark the value as stale. Nils can be connected through settings if you are using grafana. Continuity cannot be achieved if the sample is dropped which is more important?

@qcattez
Copy link
Author

qcattez commented Sep 2, 2024

I think there's a misunderstanding : my problem is about a wrong extra sample, not about continuity.
I annotated the screenshots to be more explicit.

Here are the samples from Cloudwatch :
explained_1

And here are the samples retrieved by YACE and presented in Grafana :
explained_2

We see that the data is wrong with a sample that shouldn't exist.
nilToZero has no impact on this behavior.

Can someone reproduce it ?
Is it expected looking at my configuration ?

@qcattez
Copy link
Author

qcattez commented Sep 13, 2024

For anyone stumbling upon the same issue, I finally had my cloudwatch metrics without errors.

Updated fields in my values.yaml :

extraArgs:
  scraping-interval: 60

serviceMonitor:
  enabled: true
  interval: 300s

config: |-
  apiVersion: v1alpha1
  sts-region: us-west-2
  discovery:
    exportedTagsOnMetrics:
      AWS/Firehose:
        - Environment
        - Platform
    jobs:
      - type: AWS/Firehose
        regions:
          - us-west-2
        delay: 600
        period: 300
        length: 300
        nilToZero: true
        statistics:
          - Sum
        metrics:
          - name: DeliveryToRedshift.Records
          - name: DeliveryToRedshift.Bytes
          - name: DeliveryToRedshift.Success
          - name: DeliveryToS3.Records
          - name: DeliveryToS3.Bytes
          - name: DeliveryToS3.Success

A scraping-interval of 60 instead of 300 removed the wrong extra samples at the end.
A length of 300 was enough.
addCloudwatchTimestamp: true was introducing wrong samples in the middle of the timeserie.
nilToZero: true now performs as expected.

@qcattez qcattez closed this as completed Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants