Build environment such that we have a limited run length and diverse initial states

Going from dimer up to a 20-sized cluster is a really large stretch. As @kmherman is finding, our agent has a tendency of learning policies that quickly add water molecules so that it gets to end of the episode (where it receives the reward) quickly. One route to avoiding this "rush to the finish" policy learning is to introduce rewards during the episode (#5). Another we could try is going in smaller steps.

A few different approaches could work for this:

1. Pick a random cluster at the beginning of an episode, and add a fixed number of bonds or fixed number of waters for the episode.
    - Challenge 1: How do we reward the end of the episode? 
       - Just use the same reward function? 
    - Challenge 2: Where do we get the structures to start with?
       - Option 1: Keep a priority queue of best structures for each size of water. Pick one at random for the beginning of the episode
          - Best structures could be those with low energies or those with high rewards based on the policy's value function
       - Option 2: Generate random graphs and alter from there.
       - Option 3: Use the known ground states from existing studies, do not alter during the RL learning
         - Problem is that no tracks to go from ground-states at small sizes to larger ones without bond removal (our RL
       
       

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build environment such that we have a limited run length and diverse initial states #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Build environment such that we have a limited run length and diverse initial states #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions