Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s Container Attributes not Included when Exporting to Stackdriver in GKE #796

Open
seawolf42 opened this issue Oct 1, 2019 · 10 comments · May be fixed by #830
Open

k8s Container Attributes not Included when Exporting to Stackdriver in GKE #796

seawolf42 opened this issue Oct 1, 2019 · 10 comments · May be fixed by #830
Assignees
Labels

Comments

@seawolf42
Copy link

seawolf42 commented Oct 1, 2019

Describe your environment.

  • Google Cloud Platform
  • GKE cluster
  • Docker container
  • Python 3.7.4

Python dependencies (trimmed to just relevant):

google-api-core[grpc]==1.14.2 ; platform_python_implementation != 'PyPy'
google-api-python-client==1.7.11
google-auth-httplib2==0.0.3
google-auth==1.6.3
google-cloud-core==1.0.3
google-cloud-firestore==1.4.0
google-cloud-logging==1.12.1
google-cloud-monitoring==0.33.0
google-cloud-pubsub==1.0.0
google-cloud-storage==1.19.1
google-cloud-trace==0.22.1
google-resumable-media==0.4.1
googleapis-common-protos[grpc]==1.6.0
grpc-google-iam-v1==0.12.3
grpcio==1.23.0
opencensus-context==0.1.1
opencensus-ext-stackdriver==0.7.2
opencensus==0.7.3

Steps to reproduce.

Using the following code:

import logging

from opencensus.stats import stats
from opencensus.stats import aggregation
from opencensus.stats import measure
from opencensus.stats import view
from opencensus.tags import tag_key as tag_key_module
from opencensus.tags.tag_map import TagMap
from opencensus.tags.tag_value import TagValue

from opencensus.ext.stackdriver import stats_exporter

from . import config

log = logging.getLogger(__name__)

entity_count = measure.MeasureInt('entity_count', 'Count of entities', 'entities')

x_key = tag_key_module.TagKey('x')
y_key = tag_key_module.TagKey('y')

view_key = (x_key, y_key)

count_view = view.View(
    f'entity_count',
    'Count of entities collected',
    view_key,
    entity_count,
    aggregation.CountAggregation(),
)

exporter = stats_exporter.new_stats_exporter(stats_exporter.Options(project_id='my_project_id'))

view_manager = stats.stats.view_manager
view_manager.register_exporter(exporter)
view_manager.register_view(count_view)

recorder = stats.stats.stats_recorder

def record_entity(x, y):
    log.debug('telemetry: %s/%s', x, y)
    tag_map = TagMap()
    tag_map.insert(x_key, TagValue(x))
    tag_map.insert(y_key, TagValue(y))
    mmap = recorder.new_measurement_map()
    mmap.measure_int_put(entity_count, 1)
    try:
        mmap.record(tag_map)
    except Exception as e:
        log.exception('error recording metric: %s', e)

... enter a Python shell locally and call record_entity('a', 'b') and everything works (including seeing data appear in Stackdriver's UI). This gives an exception when running in a container in GKE (see details below).

What is the expected behavior?

I expect this to work the same in GKE as it does locally.

What is the actual behavior?

Running the same code in a container in GKE gives the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/opencensus/metrics/transport.py", line 59, in func
    return self.func(*aa, **kw)
  File "/usr/local/lib/python3.7/site-packages/opencensus/metrics/transport.py", line 113, in export_all
    export(itertools.chain(*all_gets))
  File "/usr/local/lib/python3.7/site-packages/opencensus/ext/stackdriver/stats_exporter/__init__.py", line 162, in export_metrics
    self.client.project_path(self.options.project_id), ts_batch)
  File "/usr/local/lib/python3.7/site-packages/google/cloud/monitoring_v3/gapic/metric_service_client.py", line 1024, in create_time_series
    request, retry=retry, timeout=timeout, metadata=metadata
  File "/usr/local/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 143, in __call__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/google/api_core/retry.py", line 273, in retry_wrapped_func
    on_error=on_error,
  File "/usr/local/lib/python3.7/site-packages/google/api_core/retry.py", line 182, in retry_target
    return target()
  File "/usr/local/lib/python3.7/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable
    six.raise_from(exceptions.from_grpc_error(exc), exc)
  File "<string>", line 3, in raise_from
google.api_core.exceptions.InvalidArgument: 400 One or more TimeSeries could not be written: The set of resource labels is incomplete. Missing labels: (container_name namespace_name).: timeSeries[0-199]

Additional context.

This might be a regression of #647.

@seawolf42 seawolf42 added the bug label Oct 1, 2019
@seawolf42 seawolf42 changed the title Regression of 647 k8s Container Attributes not Included when Exporting to Stackdriver in GKE Oct 1, 2019
@ymaki
Copy link

ymaki commented Oct 7, 2019

I also faced this issue.
It would be nice if the library could resolve these parameter automatically.

Though I'm not sure how to fix it, I come here to share a workaround.

If we add 2 environment variable, NAMESPACE and CONTAINER_NAME, then we can avoid gRPC parameters missing. We can add the environment variables like:

(snip)
          env:
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: CONTAINER_NAME
              value: "INSERT_CONTAINER_NAME_HERE"

@seawolf42
Copy link
Author

@ymaki thank you so much, that resolved the issue.

@seawolf42
Copy link
Author

Re-opening for tracking, I intend to submit a PR updating the documentation with this. Would someone on the team be willing to assign this issue to me for resolution?

@seawolf42 seawolf42 reopened this Oct 7, 2019
@lzchen lzchen assigned lzchen and seawolf42 and unassigned lzchen Oct 7, 2019
@lzchen
Copy link
Contributor

lzchen commented Oct 7, 2019

@seawolf42
Done! Thanks for working on this!

@seawolf42
Copy link
Author

Thanks @lzchen, I should be able to get to this shortly after 10/15.

@rphillipsz
Copy link

rphillipsz commented Oct 7, 2019

I'm having the same issue but in opencensus-node. Locally works fine, but results in the same error when deployed to GKE.
-update- the work-around by @ymaki worked for me as well

@c24t
Copy link
Member

c24t commented Oct 9, 2019

@seawolf42, @ymaki, @rphillipsz did you just start seeing this issue recently? Or did you run into this while setting up monitoring on a new GKE project? If the code in question previously worked on GKE, when did it start failing?

We set the time series' resource labels from environment variables, and use different labels for different resource types. See e.g. get_k8s_metadata for kubernetes containers. Stackdriver also expects certain labels depending on the resource type.

As far as I can tell, if this is a regression, it's happening for one of two reasons: either GKE stopped populating the NAMESPACE and CONTAINER_NAME environment variables, or the stackdriver backend changed its handling of required resource labels.

Thanks for the detailed report @seawolf42, sorry to keep you waiting on the fix.

@seawolf42
Copy link
Author

@c24t this is setting up OpenCensus for the very first time on a new project (and on any project in my case, never used OpenCensus prior to this project). I first installed OC in about the second week of August and it's been non-functional the entire time on my project when running in GKE though working completely as expected locally that same entire time.

I can get GCP details for you if it helps uncover a regression, just let me know what pieces of information would be helpful.

@rphillipsz
Copy link

We've been using OpenCensus for awhile in Golang (in GKE also) and aren't having any problems with stats in GKE. This is the first time I've used the node library, and ran into this. Since adding the NAMESPACE and CONTAINER_NAME env variables to the container fixed the problem, and if that's how they've always been populated, then I'd think GKE must have stopped populating them.

@ymaki
Copy link

ymaki commented Oct 10, 2019

I'm not sure whether this is the regression and when it is happened because I just started to use OpenCensus recently for the first time and found this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants