Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polaris.checker k8s 部署隔一段时间会出现实例健康异常 #1380

Open
huangpj0210 opened this issue Aug 27, 2024 · 8 comments
Open
Labels
bug Something isn't working

Comments

@huangpj0210
Copy link

huangpj0210 commented Aug 27, 2024

Describe the bug
image
k8s StatefulSet yaml 文件

kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: polaris
  namespace: basic
  labels:
    app: polaris
spec:
  replicas: 2
  selector:
    matchLabels:
      app: polaris
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: polaris
    spec:
      volumes:
        - name: polaris-server-config
          configMap:
            name: polaris-server-config
            defaultMode: 420
      containers:
        - name: polaris-server
          image: 'polarismesh/polaris-server:v1.18.1'
          resources:
            limits:
              cpu: '1'
              memory: 2Gi
            requests:
              cpu: 100m
              memory: 128Mi
          volumeMounts:
            - name: polaris-server-config
              mountPath: /root/conf/polaris-server.yaml
              subPath: polaris-server.yaml
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: Always
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      securityContext: {}
      schedulerName: default-scheduler
  serviceName: polaris
  podManagementPolicy: OrderedReady
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  revisionHistoryLimit: 10
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain

polaris 配置

# server启动引导配置
bootstrap:
  # 全局日志
  logger:
    config:
      rotateOutputPath: log/runtime/polaris-config.log
      errorRotateOutputPath: log/runtime/polaris-config-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
      - stdout
      errorOutputPaths:
      - stderr
    auth:
      rotateOutputPath: log/runtime/polaris-auth.log
      errorRotateOutputPath: log/runtime/polaris-auth-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    store:
      rotateOutputPath: log/runtime/polaris-store.log
      errorRotateOutputPath: log/runtime/polaris-store-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    cache:
      rotateOutputPath: log/runtime/polaris-cache.log
      errorRotateOutputPath: log/runtime/polaris-cache-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    naming:
      rotateOutputPath: log/runtime/polaris-naming.log
      errorRotateOutputPath: log/runtime/polaris-naming-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    healthcheck:
      rotateOutputPath: log/runtime/polaris-healthcheck.log
      errorRotateOutputPath: log/runtime/polaris-healthcheck-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    xdsv3:
      rotateOutputPath: log/runtime/polaris-xdsv3.log
      errorRotateOutputPath: log/runtime/polaris-xdsv3-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    apiserver:
      rotateOutputPath: log/runtime/polaris-apiserver.log
      errorRotateOutputPath: log/runtime/polaris-apiserver-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    discoverEventLocal:
      rotateOutputPath: log/event/polaris-discoverevent.log
      errorRotateOutputPath: log/event/polaris-discoverevent-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    discoverstat:
      rotateOutputPath: log/statis/polaris-discoverstat.log
      errorRotateOutputPath: log/statis/polaris-discoverstat-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    token-bucket:
      rotateOutputPath: log/runtime/polaris-ratelimit.log
      errorRotateOutputPath: log/runtime/polaris-ratelimit-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    local:
      rotateOutputPath: log/statis/polaris-statis.log
      errorRotateOutputPath: log/statis/polaris-statis-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    HistoryLogger:
      rotateOutputPath: log/operation/polaris-history.log
      errorRotateOutputPath: log/operation/polaris-history-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      rotationMaxDurationForHour: 24
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    default:
      rotateOutputPath: log/runtime/polaris-default.log
      errorRotateOutputPath: log/runtime/polaris-default-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 10
      rotationMaxAge: 7
      outputLevel: info
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
    nacos-apiserver:
      rotateOutputPath: log/runtime/nacos-apiserver.log
      errorRotateOutputPath: log/runtime/nacos-apiserver-error.log
      rotationMaxSize: 100
      rotationMaxBackups: 30
      rotationMaxAge: 7
      outputLevel: info
      compress: true
      outputPaths:
        - stdout
      errorOutputPaths:
        - stderr
  # 按顺序启动server
  startInOrder:
    open: true # 是否开启,默认是关闭
    key: sz # 全局锁
  # 注册为北极星服务
  polaris_service:
    probe_address: basic.mysql:3306
    enable_register: true
    isolated: false
    services:
      - name: polaris.checker
        protocols:
          - service-grpc
# apiserver配置
apiservers:
  - name: service-eureka
    option:
      listenIP: "0.0.0.0"
      listenPort: 8761
      namespace: default
      owner: polaris
      refreshInterval: 10
      deltaExpireInterval: 60
      unhealthyExpireInterval: 180
      connLimit:
        openConnLimit: false
        maxConnPerHost: 1024
        maxConnLimit: 10240
        whiteList: 127.0.0.1
        purgeCounterInterval: 10s
        purgeCounterExpired: 5s
  - name: api-http # 协议名,全局唯一
    option:
      listenIP: "0.0.0.0"
      listenPort: 8090
      enablePprof: true # debug pprof
      enableSwagger: true # debug pprof
      connLimit:
        openConnLimit: false
        maxConnPerHost: 128
        maxConnLimit: 5120
        whiteList: 127.0.0.1
        purgeCounterInterval: 10s
        purgeCounterExpired: 5s
    api:
      admin:
        enable: true
      console:
        enable: true
        include: [default]
      client:
        enable: true
        include: [discover, register, healthcheck]
      config:
        enable: true
        include: [default]
  - name: service-grpc
    option:
      listenIP: "0.0.0.0"
      listenPort: 8091
      connLimit:
        openConnLimit: false
        maxConnPerHost: 128
        maxConnLimit: 5120
    api:
      client:
        enable: true
        include: [discover, register, healthcheck]
  - name: config-grpc
    option:
      listenIP: "0.0.0.0"
      listenPort: 8093
      connLimit:
        openConnLimit: false
        maxConnPerHost: 128
        maxConnLimit: 5120
    api:
      client:
        enable: true
  - name: xds-v3
    option:
      listenIP: "0.0.0.0"
      listenPort: 15010
      connLimit:
        openConnLimit: false
        maxConnPerHost: 128
        maxConnLimit: 10240
  - name: service-nacos
    option:
      listenIP: "0.0.0.0"
      listenPort: 8848
      # 设置 nacos 默认命名空间对应 Polaris 命名空间信息
      defaultNamespace: default
      connLimit:
        openConnLimit: false
        maxConnPerHost: 128
        maxConnLimit: 10240
#  - name: service-l5
#    option:
#      listenIP: 0.0.0.0
#      listenPort: 7779
#      clusterName: cl5.discover
# Core logic configuration
auth:
  # auth's option has migrated to auth.user and auth.strategy
  # it's still available when filling auth.option, but you will receive warning log that auth.option has deprecated.
  user:
    name: defaultUser
    option:
      # Token encrypted SALT, you need to rely on this SALT to decrypt the information of the Token when analyzing the Token
      # The length of SALT needs to satisfy the following one:len(salt) in [16, 24, 32]
      salt: polarismesh2023
  strategy:
    name: defaultStrategy
    option:
      # Console auth switch, default true
      consoleOpen: true
      # Console Strict Model, default true
      consoleStrict: true
      # Customer auth switch, default false
      clientOpen: false
      # Customer Strict Model, default close
      clientStrict: false
namespace:
  autoCreate: true
naming:
  # 批量控制器
  batch:
    register:
      open: true
      queueSize: 10240
      waitTime: 32ms
      maxBatchCount: 32
      concurrency: 64
    deregister:
      open: true
      queueSize: 10240
      waitTime: 32ms
      maxBatchCount: 32
      concurrency: 64
    clientRegister:
      open: true
      queueSize: 10240
      waitTime: 32ms
      maxBatchCount: 32
      concurrency: 64
    clientDeregister:
      open: true
      queueSize: 10240
      waitTime: 32ms
      maxBatchCount: 32
      concurrency: 64
# 配置中心模块启动配置
config:
  # 是否启动配置模块
  open: true
# 健康检查的配置
healthcheck:
  open: true
  service: polaris.checker
  slotNum: 30
  minCheckInterval: 1s
  maxCheckInterval: 30s
  batch:
    heartbeat:
      open: true
      queueSize: 10240
      waitTime: 32ms
      maxBatchCount: 32
      concurrency: 64
  checkers:
    - name: heartbeatLeader
      option:
        soltNum: 128
# Maintain configuration
maintain:
  jobs:
    # Clean up long term unhealthy instance
    - name: DeleteUnHealthyInstance
      enable: false
      option:
        # Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
        instanceDeleteTimeout: 60m
    # Delete auto-created service without an instance
    - name: DeleteEmptyAutoCreatedService
      enable: false
      option:
        # Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
        serviceDeleteTimeout: 30m
    # Clean soft deleted instances
    - name: CleanDeletedInstances
      enable: true
      option:
        # Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
        # instanceCleanTimeout: 10m
    # Clean soft deleted clients
    - name: CleanDeletedClients
      enable: true
      option:
        # Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
        # clientCleanTimeout: 10m
# 存储配置
store:
  # 单机文件存储插件
  # name: boltdbStore
  # option:
  #   path: ./polaris.bolt
  # 数据库存储插件
  name: defaultStore
  option:
    master:
       dbType: mysql
       dbName: polaris_server
       dbUser: polaris
       dbPwd: #密码#
       dbAddr: basic.mysql:3306
       maxOpenConns: -1
       maxIdleConns: -1
       connMaxLifetime: 300 # 单位秒
       txIsolationLevel: 2 #LevelReadCommitted
# 插件配置
plugin:
  history:
    entries:
      - name: HistoryLogger
  discoverEvent:
    entries:
      - name: discoverEventLocal
  statis:
    entries:
      - name: local
        option:
          interval: 60
      - name: prometheus
  ratelimit:
    name: token-bucket
    option:
      remote-conf: false # 是否使用远程配置
      ip-limit: # ip级限流,全局
        open: false # 系统是否开启ip级限流
        global:
          open: false
          bucket: 300 # 最高峰值
          rate: 200 # 平均一个IP每秒的请求数
        resource-cache-amount: 1024 # 最大缓存的IP个数
        white-list: [127.0.0.1]
      instance-limit:
        open: false
        global:
          bucket: 200
          rate: 100
        resource-cache-amount: 1024
      api-limit: # 接口级限流
        open: false # 是否开启接口限流,全局开关,只有为true,才代表系统的限流开启。默认关闭
        rules:
          - name: store-read
            limit:
              open: false # 接口的全局配置,如果在api子项中,不配置,则该接口依据global来做限制
              bucket: 2000 # 令牌桶最大值
              rate: 1000 # 每秒产生的令牌数
          - name: store-write
            limit:
              open: false
              bucket: 1000
              rate: 500
        apis:
          - name: "POST:/v1/naming/services"
            rule: store-write
          - name: "PUT:/v1/naming/services"
            rule: store-write
          - name: "POST:/v1/naming/services/delete"
            rule: store-write
          - name: "GET:/v1/naming/services"
            rule: store-read
          - name: "GET:/v1/naming/services/count"
            rule: store-read
  crypto: # 配置加密
    entries:
      - name: AES                               

心跳检查日志
polaris-healthcheck.log
polaris-healthcheck-error.log
To Reproduce
StatefulSet replicas 2,运行几天后会出现一个实例健康异常的情况,replicas 1 从未遇到过这种问题

Expected behavior
polaris实例健康运行

Environment

  • Version: v1.18.1
  • OS: Alibaba Cloud Linux 3.2104 U9.1

Additional context
Add any other context about the problem here.

@huangpj0210 huangpj0210 added the bug Something isn't working label Aug 27, 2024
@chuntaojun
Copy link
Member

无法自动恢复为监控吗

@huangpj0210
Copy link
Author

image
不行 需要删除 pod 重新创建才会恢复健康,隔一段时间自己又会变成异常,我看了polaris-server.yaml并没有健康检查的配置。但是我的 pod 有时候会自动重启然后就恢复了,我查看了 k8s的polaris 的 pod event 和StatefulSet event里面事件记录都是空
image
image

无法自动恢复为监控吗?

@chuntaojun
Copy link
Member

明白,我这里本地check看下

@huangpj0210
Copy link
Author

明白了,我在这里本地查看看下
你好,我发现一个问题异常的实例内存占用的的特别高,我 pod 内存 limit 限制了 2G,应该是异常实例达到上限OOMKiller才触发了自动重启了
image
image
image

@chuntaojun
Copy link
Member

可以访问 8090/debug/pprof 抓取下内存分析,然后发到issue这里

@huangpj0210
Copy link
Author

可以访问 8090/debug/pprof 抓取下内存分析,然后发到issue这里
你好这个我刚从异常 polaris 实例10.233.111.79,导出来的 heap.pprof文件,麻烦查看一下
heap.pprof.zip

@flashbai
Copy link

可以访问 8090/debug/pprof 抓取下内存分析,然后发到issue这里
你好这个我刚从异常 polaris 实例10.233.111.79,导出来的 heap.pprof文件,麻烦查看一下
heap.pprof.zip

你好,问下这块问题有结论了吗,我也遇到同样问题想请教一下

@chuntaojun
Copy link
Member

可以访问 8090/debug/pprof 抓取下内存分析,然后发到issue这里
你好这个我刚从异常 polaris 实例10.233.111.79,导出来的 heap.pprof文件,麻烦查看一下
heap.pprof.zip

你好,问下这块问题有结论了吗,我也遇到同样问题想请教一下

确实存在几个比较隐含的内存 OOM 的情况

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants
@chuntaojun @flashbai @huangpj0210 and others