TCP socket poisoned by agent restart

# Problem
When using EMF in an agent-based environment with a TCP endpoint, all metrics are silently dropped if the agent is restarted after the first metric publish.

# Reproduction

## Minimal Code
```
import java.util.concurrent.ExecutionException;
import software.amazon.cloudwatchlogs.emf.config.Configuration;
import software.amazon.cloudwatchlogs.emf.config.EnvironmentConfigurationProvider;
import software.amazon.cloudwatchlogs.emf.environment.Environment;
import software.amazon.cloudwatchlogs.emf.environment.EnvironmentProvider;
import software.amazon.cloudwatchlogs.emf.environment.Environments;
import software.amazon.cloudwatchlogs.emf.logger.MetricsLogger;

public class Program {
  
  public static void main(String[] args) throws InterruptedException, ExecutionException {
    final Configuration config = EnvironmentConfigurationProvider.getConfig();
    config.setEnvironmentOverride(Environments.Agent);

    final Environment environment = new EnvironmentProvider()
        .resolveEnvironment()
        .get();

    final MetricsLogger metricsLogger = new MetricsLogger(environment);
    metricsLogger.putMetric("Test", 1.0d);
    metricsLogger.flush();

    System.out.println("Sleeping, restart agent socket now...");
    Thread.sleep(10_000);

    final MetricsLogger metricsLogger2 = new MetricsLogger(environment);
    metricsLogger2.putMetric("Test2", 1.0d);
    metricsLogger2.flush();


    environment.getSink()
        .shutdown()
        .join();
  }

}
```

## Steps
1. Start a fake agent listener with netcat - `nc -l 127.0.0.1 25888 -k`
2. Execute the provided minimal code as a command line program in a separate terminal
3. When `Sleeping, restart agent socket now...` is printed, Ctrl-C the netcat command and run it again - `nc -l 127.0.0.1 25888 -k`
4. Wait for the provided code to finish

## Expected behavior
Two metric entries are printed in the netcat terminal:
```
$ nc -l 127.0.0.1 25888 -k
{"_aws":{"Timestamp":1650461190933,"CloudWatchMetrics":[{"Namespace":"aws-embedded-metrics","Metrics":[{"Name":"Test","Unit":"None"}],"Dimensions":[["LogGroup","ServiceName","ServiceType"]]}],"LogGroupName":"Unknown-metrics"},"LogGroup":"Unknown-metrics","ServiceName":"Unknown","Test":1.0,"ServiceType":"Unknown"}
^C
$ nc -l 127.0.0.1 25888 -k
{"_aws":{"Timestamp":1650461201045,"CloudWatchMetrics":[{"Namespace":"aws-embedded-metrics","Metrics":[{"Name":"Test2","Unit":"None"}],"Dimensions":[["LogGroup","ServiceName","ServiceType"]]}],"LogGroupName":"Unknown-metrics"},"LogGroup":"Unknown-metrics","ServiceName":"Unknown","ServiceType":"Unknown","Test2":1.0}
```

## Actual behavior
The second metric entry is never received by netcat:
```
$ nc -l 127.0.0.1 25888 -k
{"_aws":{"Timestamp":1650460771166,"CloudWatchMetrics":[{"Namespace":"aws-embedded-metrics","Metrics":[{"Name":"Test","Unit":"None"}],"Dimensions":[["LogGroup","ServiceName","ServiceType"]]}],"LogGroupName":"Unknown-metrics"},"LogGroup":"Unknown-metrics","ServiceName":"Unknown","Test":1.0,"ServiceType":"Unknown"}
^C
$ nc -l 127.0.0.1 25888 -k
```

# Details
This appears to be caused by how the TCP communication is handled, specifically there is no mechanism that actually verifies that a given log event is actually delivered to the CloudWatch agent.  TCP sockets (and by extension Java TCP sockets) do not make any guarantees about when the the bytes written to it are sent.

If the receiving side closes the socket in a disorderly manner, our Socket won't recognize that it is closed until an operation on it goes through all of the TCP retries, [which takes upwards of 15 minutes by default](https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html).

In general, the recommended way to deal with disorderly disconnects is to have the application protocol acknowledge each message sent.  If the reading the expected acknowledgement times out (via SO_TIMEOUT), the connection is closed on the client side and handled by the application.

# Workaround
To work around this, you can switch to the UDP endpoint by setting AgentEndpoint to `udp://127.0.0.1:25888`.  This still doesn't confirm that events are written; however, it does not get poisoned by severed connections.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TCP socket poisoned by agent restart #99

Problem

Reproduction

Minimal Code

Steps

Expected behavior

Actual behavior

Details

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TCP socket poisoned by agent restart #99

Description

Problem

Reproduction

Minimal Code

Steps

Expected behavior

Actual behavior

Details

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions