Skip to content

Latest commit

 

History

History
1417 lines (1254 loc) · 55.7 KB

sysops-administrator-associate.md

File metadata and controls

1417 lines (1254 loc) · 55.7 KB


SysOps Administrator Associate

5/2018 - 9/2018


Monitoring And Metrics

Virtualization Types

Linux Amazon Machine Images use one of two types of virtualization:

AMI Type Effect
PV Paravirtual Historically better performance than HVM, but no longer the case
HVM Hardware virtual machine More modern, same or better performance than PV

EC2 Instance Types

General Purpose Balance of computer, memory and networking
M5
(2017)
* Require HVM AMIs
* Instance store via EBS or NVMe SSD (physically connected to to the host server)
M4
(2015)
* Allows enhanced networking
* EBS-optimized
M3
(2012)
* SSD (instance) store
T3
(2018)
* 30% better price performance
T2
(2014)
* Intented for workloads that do not use the full CPU constantly (e.g. web server)
* Allows burstable performance
* Burst credits allow to 'burst' past the baseline performance up to 100%
* 1 credit = 100% load per core per minute
* Credits are earned per hour, expire after 24h
* EBS storage only
Compute optimized Lowest prize for compute performance
C5
(2016)
* Intel Skylake
* Use Nitro, Amazon’s lightweight hardware accelerated hypervisor
* Better performance and pricing than C4
C4
(2015)
* Intel Haswell
* Optimized for EC2
* Allows enhanced networking and clustering
* EBS-optimized
C3
(2013)
* SSD (instance) store
* Allows enhanced networking and clustering
Memory optimized Lowest prize for memory performance
Z1d
(2018)
* Offer both high compute capacity and a high memory footprint
* Ideal for workloads with high per-core licensing costs
X1
(2016)
* One of the lowest price per GiB of RAM
* SSD storage and EBS-optimized by default
* X1e has even more RAM
R5
(2018)
* Use Nitro, Amazon’s lightweight hardware accelerated hypervisor
R4
(2016)
* Improved networking and EBS performance
R3
(2014)
* SSD (instance) store
* High memory capacity
* Allows enhanced networking
GPU optimized .
P3
(2017)
* Faster than P2
P2
(2016)
* Intended for general-purpose GPU compute applications
G3
(2017)
* Optimized for graphics-intensive applications
* Faster then G2
G2
(2013)
* High frequency processors
* High-performce NVIDIA GPUs
Storage optimized Very fast SSD-backed instance storage optimized for high random I/O and high IOPS
H1
(2017)
* HDD-based local storage
* deliver high disk throughput
* Balance of compute and memory
I3
(2016)
* (NVMe) SSD-backed instance storage optimized for low latency
* very high random I/O performance
D2
(2015)
* Lowest price per disk throughput performance
I2
(2013)
* SSD (instance) store
* Allows enhanced networking
* Supports TRIM (more efficient SSD operations)
RDS instance types Optimized to fit different relational database use cases
db. General purpose, memory optimized, burstable performance

.*

EC2 Monitoring

EC2 Status Checks

  • AWS performs automated checks on every running EC2 instance
  • Performed every minute
  • Each returns a pass or a fail status

System Status Check

  • Loss of network connectivity
  • Loss of system power
  • Hardware/software issues on physical host
  • Solution
    • Stop and start instance
    • Terminate and re-launch instance
    • Contact AWS
  • Can configure for auto-recovery
    • Instance will be rebooted and retain instance id, (e)ip address, EBS volumes et al

Instance Status Check

  • Failed system status check
  • Network/startup configuration issues
  • Memory/disk problems
  • Kernel compatability issues
  • Solution
    • Fix problem
    • Stop and start instance
    • Terminate and re-launch instance, potentially with more memory/network/disk/...

EBS Monitoring

EBS Status Checks

  • Run every 5 minutes

    • insufficient data if checks a running
    • ok if all checks pass
    • warning typically has to do with performance degradation from provisioned IOPS
    • impaired is a check fails, eg. the volume is stalled or not available
  • If Amazon EBS finds that data on a volume might be inconsistent, it disables I/O to that volume.

    • Changes status to impaired
    • This behaviour can be disabled

EBS Performance Essentials

IOPS (Input/Output Operations Per Second) is a common performance measurement used to benchmark computer storage devices like hard disk drives (HDD), solid state drives (SSD), and storage area networks (SAN).

  • I/O size is capped at 256 KiB for SSD volumes and 1,024 KiB for HDD volumes because SSD volumes handle small or random I/O much more efficiently than HDD volumes.
  • SSDs deliver constant performance for both random and sequential I/O
  • HDDs have optimal performance for large and sequential I/O
  • HDD can deliver more throughput put drastically less IOPS
. gp2 io1 st1 sc1
Volume type General purpose SSD Provisioned IOPS SSD Throughput optimized HDD Cold HDD
Purpose Balances price and performance For mission-critical low-latency or high-throughput workloads Low cost HDD volume designed for frequently accessed, throughput-intensive workloads Lowest cost HDD volume designed for less frequently accessed workloads
Volume Size 1 GiB - 16 TiB 4 GiB - 16 TiB 500 GiB - 16 TiB 500 GiB - 16 TiB
Max. IOPS(1)/Volume 10,000 32,000 500 250
Max. Throughput/Volume 160 MiB/s 500 MiB/s 500 MiB/s 250 MiB/s
IOPS * 3 IOPS per GB (larger volume means more IOPS)
* 100 IOPS <-> 10,000 IOPS
* Can burst to 3,000 IOPS if volume size is < 1TB
* Requires credits that are acquired per 3 IOPS/GB/second
* Max 5.4 miilion credit (also intitial value), enough for 3,000 IOPS for 30min
* Running out of credits reverts volume back to baseline performance
* 30 IOPS per GB (larger volume means more IOPS), up to 20,000
* Does not burst, delivers consistent IOPS rate instead
. .

(1) gp2/io1 based on 16 KiB I/O size, st1/sc1 based on 1 MiB I/O size

  • Using EBS optimized instances guarantees optimal networking between EBS and EC2
  • Pre-warming/intialization
    • No longer needed for new EBS volumes
    • Storage blocks on volumes restored from snapshots do need to be initialized (read from)

EFS Monitoring

  • Two throughput modes to choose from for your file system
    • Bursting Throughput - throughput on Amazon EFS scales as your file system grows
    • Provisioned Throughput - you can instantly provision the throughput of your file system (in MiB/s) independent of the amount of data stored.

Performance comparison

. Amazon EFS Amazon EBS Provisioned IOPS (io1)
Per-operation latency Low, consistent latency. Lowest, consistent latency.
Throughput scale 10+ GB per second. Up to 2 GB per second.

Storage Characteristics Comparison

. Amazon EFS Amazon EBS Provisioned IOPS
Availability and durability Data is stored redundantly across multiple AZs. Data is stored redundantly in a single AZ.
Access Up to thousands of Amazon EC2 instances, from multiple AZs, can connect concurrently to a file system. A single Amazon EC2 instance in a single AZ can connect to a file system.
Use cases Big data and analytics, media processing workflows, content management, web serving, and home directories. Boot volumes, transactional and NoSQL databases, data warehousing, and ETL.

S3 vs EFS vs EBS Comparison

Amazon S3 Amazon EBS Amazon EFS
Can be publicly accessible Accessible only via the given EC2 Machine Accessible via several EC2 machines and AWS services
Web interface File System interface Web and file system interface
Object Storage Block Storage Object storage
Scalable Hardly scalable Scalable
Slower than EBS and EFS Faster than S3 and EFS Faster than S3, slower than EBS
Good for storing backups Is meant to be EC2 drive Good for shareable applications and workloads

CloudWatch

Monitoring service that plugs into many other services

  • Metrics
    • Based on currently used service
    • Not everything is available out of the box, e.g. no data on memory usage of EC2 instances
  • Alarms
    • Based on thresholds defined on metrics
    • Can be added to dashboard
    • Invoke Lambda, SNS, email, ...
    • Takes place once, at a specific point in time
      • Disable with mon-disable-alarm-actions via CLI
  • Logs
    • Log into log groups
  • Events
    • Define actions on things that happened
    • Define cron-based events
    • Events are recorded constantly over time

Key metrics for EC2

  • EC2 metrics are based on what is exposed to the hypervisor.
  • Basic Monitoring (default) submits values every 5 minutes, Detailed Monitoring every minute
  • Can install Cloudwatch agent (new)
    • Provides access to more metrics
  • Can use Cloudwatch monitoring scripts (old) to provide more metrics
    • Perl-scripts provided by AWS, need to manually install on instance
    • Use cron to automate sending data to CloudWatch
Metric Effect
CPUUtilization The total CPU resources utilized within an instance at a given time.
DiskReadOps,DiskWriteOps The number of read (write) operations performed on all instance store volumes. This metric is applicable for instance store-backed AMI instances.
DiskReadBytes,DiskWriteBytes The number of bytes read (written) on all instance store volumes. This metric is applicable for instance store-backed AMI instances.
NetworkIn,NetworkOut The number of bytes received (sent) on all network interfaces by the instance
NetworkPacketsIn,NetworkPacketsOut The number of packets received (sent) on all network interfaces by the instance
StatusCheckFailed,StatusCheckFailed_Instance,StatusCheckFailed_System Reports whether the instance has passed both/instance/system status check in the last minute.
  • Can not monitor memory usage, available disk space, swap usage

Key metrics for EBS

Metric Effect
VolumeReadBytes,VolumeWriteBytes sum reports total bytes transferred, average also useful
VolumeReadOps,VolumeWriteOps total number of IO operations
VolumeQueueLength Number of read/write operation requests waiting to finish
VolumeTotalReadTime,VolumeTotalWriteTime Total number of seconds spent by all operations in a given time
VolumeThroughputPercentage Percentage of IOPS that was achieved out of total provisioned IOPS
VolumeConsumedReadWriteOps Total amount of r/w operations consumed within a specific time period
  • Can not monitor disk usage percentage

Key metrics for EFS

Metric Effect
BurstCreditBalance The number of burst credits that a file system has.
ClientConnections The number of client connections to a file system.
DataReadIOBytes,DataWriteIOBytes The number of bytes for each file system read(write) operation.
MetadataIOBytes The number of bytes for each metadata operation.
PercentIOLimit Shows how close a file system is to reaching the I/O limit of the General Purpose performance mode.
PermittedThroughput The maximum amount of throughput a file system is allowed.
TotalIOBytes The number of bytes for each file system operation, including data read, data write, and metadata operations.

Key metrics for ELB (classic load balancer)

Metric Effect
Latency Time it takes to receive an response. Measure max and average
BackendConnectionErrorr Number of not successfully established connections to registered instances, measure sum and look at difference between min and max
SurgeQueueLength Total number of request waiting to get routed, look at max and average
SpilloverCount Dropped requests because of exceeded surge queue. Look at sum
HTTPCode_ELB_3XX_Count
HTTPCode_ELB_4XX_Count
HTTPCode_ELB_5XX_Count
The number of HTTP XXX server error codes that originate from the load balancer. This count does not include any response codes generated by the targets.
RequestCount Number of completed requests
HealthyHostCount,UnhealthyHostCount Self explainatory
  • In case of sudden and very large increases in traffic it's possible to contact AWS and have them 'pre-warm' the ELB.

spillover and surge queue give an indication of the ELB being overloaded

  • Typically this means that the backend system cannot process requests as fast as they are coming in
    • Ideally load balance into an autoscaling group.

Key metrics for ALB (active load balancer)

Metric Effect
RequestCount Number of completed requests
HealthyHostCount,UnhealthyHostCount Self explainatory
TargetResponseTime The time elapsed after the request leaves the load balancer until a response from the target is received.
HTTPCode_ELB_3XX_Count
HTTPCode_ELB_4XX_Count
HTTPCode_ELB_5XX_Count
The number of HTTP XXX server error codes that originate from the load balancer. This count does not include any response codes generated by the targets.

Key metrics for NLB (network load balancer)

Metric Effect
processedbyte The total number of bytes processed by the load balancer, including TCP/IP headers.
tcp_client_reset_count the total number of reset (rst) packets sent from a client to a target.
tcp_elb_reset_count the total number of reset (rst) packets generated by the load balancer.
tcp_target_reset_coun the total number of reset (rst) packets sent from a target to a client.

Key metrics for elasticache

Supports memcached and redis

Metric memcached redis
. Designed for simplicity Supports a much richer set of features. can be backed up if in cluster mode
cpu utilization * multithreaded
* stay under 90%/#cores
* -> increase # read replicase or use larger cache instance
* single threaded
* stay under 90%
* -> increase size of node or add more nodes
evictions * -> increase size or add nodes to cluster * -> increase node size
concurrent connections * -> check application logic * -> check application logic
swap usage * avoid swapping
-> increase memcached_connections_overhead
avoid swapping
* -> increase node size
* -> increase memory connection overhead (will decrease memory available for cache)

.*

Key metrics for RDS

Metric Effect
CPUUtilization Percentage of CPU utilization
DatabaseConnections Number of connections that we have at a given point in time
DiskQueueDepth Number of read/write requests waiting to access the disk
FreeableMemory Amount of available RAM
FreeStorageSpace Amount of available storage space
SwapUsage When data is stored in memory on disk
Increase In this usually has to do with running out of available RAMReadIOPS/WriteIOPS
IOPS Represent the number of I/O operations completed per secondIf we don’t have enough IOPS, performance will slow down
ReadLatency/WriteLatency * Average amount of time taken per disk I/O operation (input/output)
* High latency can be solved with more IOPSReadThroughput/WriteThroughput
* Average is number of bytes read or written to or from disk per second

.*

  • Also look at RDS Events

Costs

Consolidated Billing

Set up a billing account to pay for multiple linked accounts at the same time.

  • Allows for consolidated billing. Does not give IAM visibility into linked accounts.
  • Enables volume discounts across linked accounts.
  • If one account uses reserved instances, other accounts running on similar on demand instances will be billed under the reserved instance price. Similar for RDS instances.
  • All credits earned while linked will be applied to consolidated bill.

Limits:

  • Up to 20 linked accounts

Billing Metrics & Alarms

  • Only shows metrics of services that have been used.
  • Set up billing alarms based on billing metrics.
    • Overall billing alarm, or service-specific alarms
    • Can still be account-specific, even with consolidated billing

Costs Optimization

  • Purchase EC2 Reserved Instances
    • Commit for 1-3 years and get a discount
  • Minimize the number of running instances
    • Set up CloudWatch alarms to spin down underutilized instances
    • Find balance between acceptable downtime & costs to eleminate this downtime
  • Remove unused Load Balancers
  • Look for idle (unattached) EBS volumes
    • Delete unused volumes
      • Take a snapshot to keep the data
    • Downsize volumes that aren't near full capacity
    • Look for over-provisoned IOPS
  • Look for unassociated Elastic IP addresses
  • Look for idle RDS instances
    • Check for 0 connections

Cost Explorer

  • Costs per time frame per service, various grouping and filtering options
  • Provides forecasts
  • Pricing API allows to download pricing information for specific services

High Availability

Scalability & Elasticity Fundamentals

  • Pay only for what you need when you need it
    • Define minimum capacity
    • Define what needs to stretch out
. Elasticity Scalability
. Scaling up/down on demand Scaling for growth in order to meet long term requirements
typically does not focus on shrinking back
DynamoDb Can provision more or less throughput Stores as much data as we like, scales transparently
EC2 Use autoscaling More instances or bigger instance types
RDS ./. Bigger instances, more read replicas

Reserved Instances

  • Reserve instances for a specific period of time
    • Standard reserved instances (fixed instance type)
    • Convertible reserved instances (can be exchanged against another convertible instance type)
    • Scheduled reserved instances (purchased by the hour on a set schedule with a set instance type)
  • Up to 50% cheaper than a fully utilized on-demand instance (because we commit upfront to a certain usage)
  • Guarantees to not run into 'insufficent instance capacity' issues if AWS is unable to provision instances in that AZ
  • Can resell reserved capacity on Reserved Instance Marketplace
  • Available for:
    • EC2
    • RDS (reserved instances)
    • DynamoDB (reserved capacity)
    • ElastiCache (reserved nodes)
    • CloudFront (reserved capacity)
    • Elastic MapReduce (reserved EC2 instances)
    • ECR (reserved EC2 instances)

Autoscaling vs Resizing

  • Auto Scaling distributes load across multiple instances

    • Scheduled Scaling allows to scale or shrink on a schedule
    • Relativly complex to set up
    • Applications need to be designed to benefit from multiple instances
    • Components
      • Launch Configuration
      • Autoscaling Group
      • Scaling Policy
      • Cloudwatch Alarms
  • Changing instance size increases/decreases available resources to the running application

    • EBS backed instances need to be stopped before resizing
    • Instance storage need to be migrated across
    • Not as flexible as auto scaling. Not elastic
    • Within an autoscaling group the to-be-resized instance might be treated as unhealthy

Load Balancers

. ALB NLB ELB
. Active Load Balancer Network Load Balancer Classic Load Balancer
Layer 7 (application layer) 4 (transport layer) EC2-classic network (deprecated)
Protocoll HTTP, HTTPS TCP TCP, SSL, HTTP, HTTPS
Health checks
Cloudwatch metrics
Logging
Zone failover
Connection draining
Load balancing to different ports on the same instance .
WebSockets .
IP Addresses as targets .
Load balancing deletion protection .
Path-based routing . .
Host-based routing . .
Native http/2 . .
Configurable idle connection timeout .
Cross zone load-balancing
SSl-offloading .
Server-name indication .
Sticky-sessions .
Backend server encryption .
Static IP . .
Elastic IP . .
Preserve source IP address . .
Resource-based IAM permissions
Tag-based IAM permissions .
Slow start . .
User authenticaion . .
Redirects . .
Fixed responses . .

Elastic Load Balancer ('Classic LB')

Overview

  • External load balancer
    • Public facing
    • Often used to distribute load between web servers
    • Provides public DNS host name
  • Internal load balancer
    • Often used to Distribute load between backend servers
    • Provides internal DNS host name
  • Configure (in AWS console)
    • Internal and external load balancer
    • Subnets for each AZ that traffic should be routed to
      • Can route into private subnets
    • Cross-zone load balancing
    • Connection draining (maximum time for the load balancer to keep connections alive before reporting the instance as de-registered)

Sticky Sessions

  • Need to make sure that session is maintained between instances
    • Load Balancer generated stickiness (duration based session stickiness)
    • Application generated stickiness (application based session stickiness)
    • For HA, use ElastiCache to persist and share session state. So maintaining stickiness doesn't matter any more

RDS HA

  • Create subnets in different AZs
  • Create subnet group in RDS dashboard
    • Collection of subnets (typically private) in a VPC that is desgnated for DB instances
    • Should have subnets in at least two Availability Zones in a given region
  • Configure RDS for multi-AZ-deployments and turn replication on
    • Keeps a synchronous standby replica in a different AZ
      • Recommendation is use of Provisioned IOPS
    • Automatic failover in case of planned or unplanned outage of the first AZ
      • Most likely still has downtime
      • Can force failover by rebooting
    • Other benefits
      • Patching
      • Backups
    • Aurora can replicate accross 3 AZs
  • Failover process is automated
    • AWS detects an issue and starts the failover process
    • DNS records are modified to point to the standby instance
    • Application re-establishes existing DB connections

HA for IP-based Applications

  • If the application requires specific IPs (that are hardcoded somewhere), autoscaling cannot be used
  • Use Elastic IP and standby instances in different AZs instead
    • Cannot use Elastic IP across different regions though
    • Scale by increasing instance size (vertical scaling)

HA/Fault Tolerance for Bastion Hosts

  • Assign Elastic IP to bastion host in AZ 1
    • This IP can also be whitelisted to comply with corporate regulations
  • Have another instance on standby in different AZ
  • Could be in ASG (min/max 1), so that it gets immediately replaced
  • Place 2 instances behind ELB and enable SSH Keep Alive
  • Place 1 instance behind ELB, configure auto recovery

Analysis

Optimize the environment to ensure maximum performance

Offloading database workload

  • Using read replicas

    • Read queries are routed to read replicas, reducing load on primary db instance (source instance)
      • Table indexes can be created on read replicas directly (and not on the master)
      • Some use cases (e.g. data analytics) can be performed exclusively against read replicas
    • To create read replicas, AWS initally creates a snapshot of the source instance
      • Multi-AZ failover instance (if enabled) is used for snapshotting
      • After that all read queries are then asynchronously copied to read replica
      • Implies data latency, which typically is acceptable.
        • ReplicaLag can be monitored and Cloudwatch alarms can be configured
    • Read replicas are not the same as multi-AZ failover instances which
      • are synchronously updated
      • are designed to handle failover
      • don't receive any load unless failover actually happens
    • Often it is beneficial to have both read replicas and multi-AZ failover instances
      • Read replicas themselves can not use the Multi-AZ feature
    • A single master can have up to 5 read replicas
      • Can be in different regions
  • Setting up a read replica

    • Configure from master instance or other read replica
      • Requires 'automated backups' to be enabled on source instance
    • Choice of db engine matters, because internal engine features are being used
    • Usually pick same database instance type as source instance uses
    • AWS provisiones different endpoint for read replica
    • Configure use of endpoint on application level
  • Read replicas can be promoted to normal instances

    • E.g. use read replica to implement bigger changes on db level, after these have been finished promote to master instance
    • Useful for database sharding, could create replicas for each shard

Looking at EBS volumes

  • EBS pre-warming
    • Used to be required for maximum performance
    • Performance is reduced the very first time each block is accessed
    • Has been renamed to initialization and is no longer required if new EBS volumes are used
    • Still required for volumes that are restored from snapshots
      • Storage blocks must be initialized (pulled down from Amazon S3 and written to the volume)
      • Use dd or fio to read from every block
      • Only required if performance matters, obviously

Prewarming ELBs

  • ELB is designed to increase its resource capacity gradually
  • Prevents http 503 (ELB cannot handle anymore requests)
  • Can contact AWS to pre-warm ELB
    • This should not really be required. Maybe if TV ads are running or so.
    • Use load testing tools to get a rough estimate of what the current ELB can handle
      • Increase at a rate no more than 50% per 5min.

Identify Performance Bottlenecks and Implement Remedies

Resizing or changing EBS root volumes

  • If EBS is at capacity
    • Either upgrade volume size to increase the amount of IOPS available
    • Or switch to provisiones IOPS volumes (io1)
  • Resizing
    • Create snapshot of EBS volume first
      • Incrementally stored on S3
      • Can continue to use EBS volume while the snapshot is taking place
    • Create new volume from snapshot
    • Stop instance
    • Attach new volume

Setting up certificates for Elastic Load Balancers

  • Offloading overhead from the instances behind the ELB
    • Create ELB and configure https
    • Certificate from
      • ACM (AWS managed)
      • IAM (for external certificiates)
      • Upload directly

Network bottlenecks

  • Primary network bottlenecks
    • EC2 instances
      • Instances in different AZs or regions
      • Different instance types get different bandwith capacities
        • No absolute numbers communicated by AWS though
      • Not using enhanced network capabilities (not supported by some instance types)
      • Check for performance issues with iperf3 (github)
        • Measures performance for ip-based networks
      • Use VPC Peering to create a reliable connection
        • No single point of failure
    • Connection to on-prem networks
      • Use Direct Connect

Identify Potential Issues on a Given Application Deployment

EBS Root Devices on Terminated Instances - Ensuring Data Durability

  • EBS root volumes will be deleted on instance termination as per default option
    • Could create snapshot before termination to backup data
    • Could change default settings
  • Instance store root volumes will be left untouched on instance termination

Troubleshooting Auto Scaling Issues

  • Attempting to use wrong subnet
  • AZ no longer available or supported (outage)
  • Security group does not exist
  • Associated keypair does not exist
  • Auto scaling configuration is not working correctly
  • Instance type specification does not exist in that AZ
  • Auto scaling is not enabled on that subnet
  • Invalid EBS device mapping
  • Attempt to attach EBS block device to instance-store AMI
  • AMI issues
  • Attempt to use placement groups with instance types that don't support that
  • AWS running out of capacity in that AZ
  • If an instance is stopped, e.g. for updating it, autoscaling will consider it unhealthy and terminate - restart it. Need to suspend autoscaling first.

OpsWorks

Overview and components

  • Declarative desired state engine
    • Automate, monitor and maintain deployments
  • Cookbooks define recipes
  • AWS' implementation of Chef
    • Original Chef
    • AWS-bespoke orchestration components
  • Components
    • Stack
    • Set of resources that is managed as a group
      • Whole service stack
    • Layer
    • Represent and configure components of a stack
      • E.g. loadbalancer layer, app layer, db layer
      • Share common configuration elements
    • Instance
      • Units of compute within the platform
    • Must be associated with at least one layer
    • Can run
      • 24/7
      • Load-based
      • Time-based
    • Application
      • Applications that are deployed on one or more instances
    • Deployed through source code repo or S3
  • Recipes
    • Created in ruby, used to customize different layers
    • Run at stack lifecycle events
      • setup
        • Instance has finished booting
      • configure
        • Instance enters or leaves the online state
        • Elastic IP is associated or disassociated
        • Load balancer is attached or detached
        • Event is executed on all instances, not only the impacted one
      • deploy
        • Deploy command is run on an instance
      • undeploy
        • Undeploy command is run on an instance
        • App is deleted
      • shutdown
        • When instance is shutdown, before termination
        • Allows cleanup
  • Under the hood
    • OpsWorks agent
      • Configuration of machines
    • OpsWorks automation engine
      • Create, update & delete of various AWS components
      • Handles loadbalancing, autoscaling and autohealing
      • Supports lifecycle events

BerkShelf

  • Addresses an OpsWorks shortcoming from old versions - only one repository for recipes
  • Was added in OpsWorks 11.10 and allows to install cookbooks from many repositories

TODO: Quickstart OpsWorks

Cloudformation

Overview

  • Allows to create and provision resources in a reusable template fashion
    • A CloudFormation template is a JSON or YAML formatted text file
  • Related resources are managed in a single unit called a stack
    • Controls lifecycle of managed resources
    • All the resources in a stack are defined by the stack's CloudFormation template
    • Stack has name & id
  • Two ways to update a stack
    • Direct update
      • Directly applies changes (if any)
    • Change set
      • Summary of proposed changes, can be applied or rejected
  • Will rollback stack if it fails to create (can be disabled via API/console)
  • A stack policy is an IAM-style policy statements that governs who can do what

Templates

  • AWSTemplateFormatVersion
  • Description
  • Metadata
    • Details about the template
  • Parameters
    • Values to pass in right before template creation
      • Type
        • String, Number, List, CommaDelimitedList
        • AWS-specific types like AWS::EC2::KeyPair::KeyName
      • Description
      • Default Value
      • Allowed Values
      • Allowed Pattern
        • Validation per regular expression
      • MinLength/MaxLength
      • MinValue/MaxValue
    • Problem:
      • Usage of parameters might make it hard to instantiate stacks without human interaction
      • CloudFormation is able to auto-generate many resources attributes, e.g. name
  • Mappings
    • Maps keys to values (eg different values for different regions)
  • Conditions
    • Check values before deciding what to do
  • Resources
    • Creates resources. Only mandatory section in a template.
    • Can have Condition element to toggle creation
  • Outputs
    • Values to be exposed from the console or from API calls.
    • Can be used in a different stack (cross stack references)
    • Can be:
      • Constructed value
      • Parameter reference
      • Pseudo parameter
      • Output from a function like fn::getAtt or Ref

Intrinsic Functions

  • Used to pass in values that are not available until runtime
  • Usable in resource properties, metadata attributes, and update policy attributes (auto-scaling)
  • Ref
    • Returns the default value of the specified parameter or resource, usually instance id
  • Fn::GetAtt
    • Returns the value of an attribute from an object, either the default or the specified attribute
    • Object is either from the same or a nested template
  • Fn::Join
    • Joins a set of values into a single value separated by the specified delimiter
  • Fn::Sub
    • Substitutes variables in an input string with values that you specify
  • Fn::FindInMap
    • Returns the value corresponding to keys in a two-level map that is declared in the Mappings section
  • Fn::Select
    • Returns a single object from a list of objects by index
  • Fn::Base64
    • Provides encoding, converts from plain text into base64
  • Fn::GetAZs
    • Returns an array that lists Availability Zones for a specified region
    • If region is omitted return AZs from the region the template is applied in
  • Fn::ImportValue
    • Returns the value of an Output exported by another stack
  • Fn::Split
    • Split a string into a list of string values so that you can select an element from the resulting string list
  • Fn::If
    • Takes a list of arguments (boolean, string1, string2)
    • Returns string1 if boolean is true, string2 otherwise
  • Fn::And, Fn::Equals, Fn::Or, Fn::Not
    • Good for condition element

Backups & Recovery

AWS Services with automated backups

  • RDS

    • Backups

      • Transactional storage engine recommended as DB engine
      • Degrades performance if multi-AZ is not enabled (taken from slave if enabled)
      • Deleting an instance deletes all automated backups
      • Backups are stored internaly on S3
      • PITR 5 minutes
    • Restoring

      • When restoring, only default parameters and security groups are associated with instance
      • Can change to different storage engine if closely related and enough space available
  • Elasticache

    • Backups
      • Available to Redis cluster only
      • Taking snaphots can degrade performance, should be performed on read replica
      • Backups are stored internaly on S3
  • Redshift

    • Backups
      • Provides free storage equal to the storage capacity of the cluster
      • Snapshots can be automated or manual and are incremental
      • Backups are stored internaly on S3
    • Restoring
      • Creates a new cluster and imports the data
  • EC2

    • Backups
      • No built-in automated backup solution
      • Snapshots of EBS volumes are incremental, causing performance degradation
      • Every snapshot will restore all data, even if older snapshots are deleted
      • Backups are stored internaly on S3

Disaster Recovery Scenarios

DR of on-prem infra

  • Use AWS as backup solution by storing VMs, snapshots and other data
  • 'Pilot light' - have bare minimum infra always ready and scale up as required
  • 'Hot standby' (aka 'multi site') - has everything ready to go

DR of cloud infra

  • Duplicate the environment from one region to another

DR of RDS data

  • Protection from multiple AZs being down
  • Reduce latency for global audience
  • Replica lag will most likely go up
    • Data transfer across regions is getting charged
    • May potentially run into bandwith issues
  • Create read replica from existing DB instance, pick different region
    • Trigger setup process that will take some time

Storing log files and backups

  • Implement centralized logging

    • From there
      • Send to 3rd party tool for analyis
      • Backup to S3
        • 11x9 durability
        • Versioning
        • Lifecycle policies
  • Other logging options

    • S3 access logs
    • Cloudtrail
    • Cloudwatch

Security

Implement and Manage Security Policies

IAM

IAM is a global service that helps to securely control access to AWS resources.

  • Users hold credentials
  • Groups hold users, typically only provides permission to assume a role
  • Roles hold policies.
    • Can have trust relationships with trusted entities that can assume this role
  • Policies can be attached to users, groups or roles (preferred)
  • An instance profile is a container for an IAM role that you can use to pass role information to an EC2 instance when the instance starts.
  • Users and/or services assume roles

Policies

  • Any actions on resources that are not explicitly allowed are denied by default
  • Structure
    • E - effect (allow/deny)
      • What the effect will be when the user requests the specific action
    • P - prinicpal (ARN)
      • The account or user who is allowed access to the actions and resources in the statement
      • IAM policies do not have a principal (because they are attached to users, groups or roles)
    • A - action or notaction
      • Describes the specific action or actions that will be allowed or denied
    • R - resource or notresource
      • Specifies the object or objects that the statement covers
    • C - condition
      • Specifies conditions for when a policy is in effect
  • Can use policy variables
    • aws:currentTime, aws:userid, ...
	{
		"Version": "2012-10-17",
		"Statement": [
			{
				"Effect": "Allow",
				"Action": "s3:ListAllMyBuckets",
				"Resource": "arn:aws:s3:::*"
			},
			{
				"Effect": "Allow",
				"Action": [
						"s3:ListBucket",
						"s3:GetBucketLocation"
				],
				"Resource": "arn:aws:s3:::productionapp"
			},
			{
				"Effect": "Allow",
				"Action": [
					"s3:GetObject",
					"s3:PutObject",
					"s3:DeleteObject"
				],
				"Resource": "arn:aws:s3:::productionapp/*"
			}
		]
	}

IAM Policies

  • Managed policies (the new way)
    • Can be attached to multiple users, groups and roles
    • AWS managed policies
      • Updated by AWS if new API come out
    • Customer managed policies
  • Inline policies (the old way)

IAM roles and EC2

  • Create an IAM role.

    • Define which accounts or AWS services can assume the role.
      • EC2 here, could be other services
    • Define which API actions and resources the application can use after assuming the role.
    • Specify the role when you launch your instance, or attach the role to a running or stopped instance.
    • Have the application retrieve a set of temporary credentials and use them.
  • Only one role can be assigned to an EC2 instance, and all applications share the same role and permissions

S3 IAM and bucket policy concepts

Defaults

  • Bucket is owned by the AWS account that created it
    • Bucket ownership is not transferable
  • Bucket owner gets full permission (ACL)
  • The person paying the bills always has full control.
  • A person uploading an object into a bucket owns it by default.

Bucket policies (resource level)

  • Specify what actions are allowed or denied for which principals on the bucket that the policy is attached to
  • Attached only to S3 buckets. Can however effect object in buckets.
  • Contains principal element (unnecessary for IAM policies)
  • Use if you’re more interested in “Who can access this S3 bucket?”
  • Easiest way to grant cross-account permissions for all s3:* permission. (Cannot do this with ACLs.)
  • Explicit deny in bucket policy overwrites explicite allow in IAM policy
  • Defined as JSON
{
"Version":"2012-10-17",
"Statement":
  [
    {
      "Sid":"PutObjectAcl",
      "Effect":"Allow",
      "Principal":
      {
        "AWS":
          [
           "arn:aws:iam::111122223333:tom", "arn:aws:iam::444455556666:chris"
          ]
      },
      "Action":
        [
          "s3:PutObject",
          "s3:PutObjectAcl"
        ],
        "Resource":
        [
          "arn:aws:s3:::examplebucket/*"
        ]
    }
  ]
}

ACLs

  • Defined as XML. Legacy, not recomended any more.
  • Can
    • be attached to individual objects (bucket policies only bucket level)
    • control access to object uploaded into a bucket from a different account.
  • Cannot..
    • have conditions
    • cannot explicitely deny actions
    • grant permission to bucket sub-resources (eg. lifecycle or static website configurations)
  • Other than object ACLs there are bucket ACLs as well - only for writing access log objects to a bucket.
<?xml version="1.0" encoding="UTF-8"?>
<AccessControlPolicy xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Owner>
    <ID>*** Owner-Canonical-User-ID ***</ID>
    <DisplayName>owner-display-name</DisplayName>
  </Owner>
  <AccessControlList>
    <Grant>
      <Grantee xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
               xsi:type="Canonical User">
        <ID>*** Owner-Canonical-User-ID ***</ID>
        <DisplayName>display-name</DisplayName>
      </Grantee>
      <Permission>FULL_CONTROL</Permission>
    </Grant>
  </AccessControlList>
</AccessControlPolicy> 

IAM policies (user level)

  • IAM policies (in general) specify what actions are allowed or denied on what AWS resources
  • Attached to IAM users, groups, or roles (so they cannot grant access to anonymous users)
  • Use if you’re more interested in “What can this user do in AWS?”
. .
arn:partition:service:region:namespace:relative-id arn:aws:s3:::mybucket
arn:aws:s3:::* All buckets and objects in account
arn:aws:s3:::mybucket mybucket
arn:aws:s3:::mybucket/* All objects in mybucket
arn:aws:s3:::mybucket/mykey mykey in mybucket
arn:aws:s3:::mybucket/developers/($aws:username)/ folder matching the accessing user's name

Cloudfront

  • Can use Cloudfront Origin Access Identity to restrict access to S3 objects

Ensure Data Integrity and Access Controls when Using the AWS Platform

MFA

  • Should be turned on for all console access

  • Can be enabled for API access as well

    • The administrator configures an AWS MFA device for each user who needs to make API requests that require MFA authentication. This process is described at Enabling MFA Devices.
    • The administrator creates policies for the users that include a Condition element that checks whether the user authenticated with an AWS MFA device.
    • The user calls one of the AWS STS API operations that support the MFA parameters AssumeRole or GetSessionToken, depending on the scenario for MFA protection, as explained later. As part of the call, the user includes the device identifier for the device that's associated with the user. The user also includes the time-based one-time password (TOTP) that the device generates. In either case, the user gets back temporary security credentials that the user can then use to make additional requests to AWS.
    • This is not supported by all services (support by SQS, SNS, S3)
  • MFA delete can be enabled for root accounts (bucket owners) before permanently deleting an object

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {"AWS": ["ALICE", "BOB"]},
    "Action": [ "s3:PutObject", "s3:DeleteObject" ],
    "Resource": ["arn:aws:s3:::Alice-Bucket/*"],
    "Condition": {"Bool": {"aws:MultiFactorAuthPresent": "true"}}
  }]
}

Secure Token Service (STS)

  • Allows to grant temporary access to authenticated users
    • IAM users
    • Web-based identity providers (google, facebook, ...)
    • Organization's existing identity system
  • Returns temporary credentials that expire after some time:
    • Access key
    • Session token

Terms

  • Federation
    • Trust relationship between identity provider and AWS
  • Identity broker
    • Broker in charge of mapping user to the right set of credentials
  • Identity store
    • Eg Google or Facebook
  • Identities
    • Users

Scenarios

  • Temporary credentials with EC2
    • Assign IAM role to instance
    • Get temp credentials from instance metadata
  • Temporary credentials with SDK
    • Call assumeRole, extract temp credentials
  • Options for temporary credentials with API calls
    • Sign request with temp credentials
    • Add AC/SK to request (header or query string)

Share responsibility model

  • Shared responsibility environment
  • AWS is responsible for:
    • Server/Host level and below
    • Physical environment security
    • Hardware decommissioning
    • Traffic security (Networks, ACLs, SSL, DDOS-protection)
    • EC2 hypervisor isolation
  • User is responsible for:
    • IAM
    • MFA
    • Password/key-rotation
    • Access advisor (shows used permissions)
    • Trusted advisor (validates best practices)
    • Security groups
    • ACL (resource based policy)
    • VPC

AWS and IT Audits

  • AWS performs self audits of changes to key services to monitor quality, maintain high standards, and facilitate continuous improvement of the change management process
  • For audits, AWS provides:
  • Security of the cloud
  • Information regarding their global infrastructure
  • From the host operating system and virtualization layer down to the physical security of facilities
  • Annual certifications and reports: (like the Service Organization Control (SOC) reports, ISO 27001 cert, PCI assessments)
  • For audits, the customer provides:
    • Security in the cloud
    • Anything their organization puts on (or connects to) their AWS assets Examples: guest operating system, apps on virtual machine instances, objects in S3, database like RDS, etc...

Networking

Route53 Routing Policies

  • Simple
  • Weighted
  • Latency
  • Failover
  • Geolocation

DNS Failover

  • Can set up health checks for endpoints or domains from within Route53
    • Route 53 has health checkers in locations around the world. When you create a health check that monitors an endpoint, health checkers start to send requests to the endpoint that you specify to determine whether the endpoint is healthy.
    • evaluate target health
  • DNS entries are then being associated with health checks and can be configured to failover as well (1 primary and n secondary recordsets)

Weighted

  • Control distribution of traffic with DNS entries
    • This can be based on a certain percentage
    • Set routing policy to weighted (instead of failover)

Latency-based

  • Control distribution of traffic based on latency.

VPC Essentials

  • Provisions a logically isolated section of the AWS cloud
  • Spans over all AZs in a region
  • Allows to create layered architecture
  • Shared or dedicated tenancy (exclusive hardware or not)
  • Security groups and subnet network ACLs
  • Ability to extend on-premise network to cloud

Default VPC (Amazon specific)

  • Gives easy access to a VPC without having to configure it from scratch
  • Has different subnets in different AZs and an internet gateway per AZ
  • Each instance launched automatically receives a public IP (very different to non-default VPC)
  • Cannot be restored if deleted

Non-default VPC (regular VPC)

  • Only has private IP addresses
  • Resources only accessible through Elastic IP, VPN or internet gateways
  • Does not have a gateway attached

VPC Peering

  • Connect VPCs through direct network routing
  • Can occur between different accounts and VPCs, but must be in the same region
  • Allows instances to communicate with each other as if they were in the same network
  • CIDRs must not overlap

VPC Scenarios

  • VPC with private subnet only -> single tier apps
  • VPC with public and private subnets -> layered apps
  • VPC with public, private subnets and hardware connected VPN -> extending apps to on-premise
  • VPC with private subnets and hardware connected VPN -> extended VPN

Components

  • Subnet
    • In exactly one AZ
    • If a subnet doesn't have a route to the Internet gateway, it's known as a private subnet
    • Instances receive
      • Private IP address
      • Internal DNS hostname
    • If traffic is routed to an Internet gateway, the subnet is known as a public subnet
    • Instances receive
      • Public IP address
      • External DNS hostname
    • EC2 instances are launched into subnets
    • Use ssh-agent forwarding to connect from public to private instances
    • Sometimes grouped into Subnet Groups, e.g. for caching or DB. Typically across AZs
  • Route Table
    • Contains a set of rules, called routes that determine where network traffic is directed to
    • Each VPC automatically comes with a main route table that can be configured
    • Each subnet in a VPC must be associated with a route table; the table controls the routing for the subnet. A subnet can only be associated with one route table at a time, but multiple subnets can be associated with the same route table
    • Each route in a table specifies a destination CIDR and a target
    • Every route table contains a local route for communication within the VPC
    • Can have a default route 0.0.0.0/0 to route everything that doesn't have a specific rule
  • Elastic IP
    • Static IPv4 address mapped to an instance or network interface
    • If attached to network interface it's decoupled from the instance's lifecycle
    • Routes to private IP address of instance
    • Can be remapped in case of failure.
    • For use in a specific region only
    • Can only map to instances in public subnets
  • Gateways
    • Internet Gateway
      • Horizontally scaled, redundant, and highly available VPC component that allows communication between instances in a VPC and the internet
      • Provides a target in VPC route tables for internet-routable traffic
      • Performs network address translation (NAT) for instances that have been assigned public IPv4 addresses
    • Virtual Private Gateway
      • Has VPN connection to customer gateway attached
      • Serves as VPN concentrator on the Amazon side of the VPN connection
    • Customer Gateway
      • A physical device or software application on your side of the VPN connection
  • NAT
    • NAT Instances
      • Manually configured instance from an NAT AMI
    • NAT Gateway
      • AWS-mananged service

Security

Network ACL

  • Subnet level, acting as firewall
  • Rules for inbound and outbound traffic
  • Rules have numbers and are evaluated from low to high, first matching rule wins, others are not evaluated
  • Stateless

Security Groups

  • Acts as a virtual firewall to control inbound and outbound traffic to instances
  • Acts on instance level, not subnet level
  • Rules for inbound and outbound traffic
  • Stateful - will always allow response to (allowed) outbound traffic
  • Can refer to other security group, e.g. allow traffic from there

Structure & package flow

  • VPC (has CIDR)

    • Gateway (Internet or VPN)
    • Routes (one per subnet, can be shared)
    • Network ACL (one per subnet, can be shared)
    • Subnets (CIDRs match VPC's CIDR)
    • Security Group (on VPC level)
    • Instance (needs public IP for internet communication, either ELB or Elastic IP)
  • Flow from internet

    • Internet Gateway
    • VPC Router (routes into desired subnet)
    • Route Table (of that subnet)
    • NACL
    • Security Group
    • Instance

Connection To On-prem Network/Direct Connect

  • VPC
    • (has attached) Virtual Private Gateway
    • (has attached) VPN Connection
    • (has attached) Customer Gateway

TODO: VPN vs direct connect. Can I use VPN instead of DC?

Limits:

. .
VPCs per region 5
Subnets per VPC 200
Customer gateways per region 50
Virtual private gateways per region 5
Virtual private gateways per VPC 1
Gateway per region 5 Internet
Elastic IPs per account per region 5
VPN connections per region 50
Route tables per region 200
Security groups per region 500

Accessing the OS

  • Services that allow access the the underlaying OS
    • EC2
    • ECS
    • EB (Elastic Bean Stalk)
    • EMR (Elastic Map Reduce)
    • OpsWorks
  • Services that hide the OS away (managed services)
    • DynamoDB
    • RDS

  • Default message retention period: 4 days (max 14 days)
  • DelaySeconds will delay a message appearing in the queue
  • Setting WaitTimeSeconds will enable long polling (can be more cost efficient)

DynamoDb

  • Prefix partition key with hash to enforce even distribution of IO across many partitions