Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC support #138

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

GC support #138

wants to merge 3 commits into from

Conversation

amsukdu
Copy link

@amsukdu amsukdu commented Jun 16, 2019

  1. Added Gc class to sync.py
  2. Mounted to ExpirableDictStorage and PersistentDictStorage
  3. dict interface changed

Solves #115 issue.

  • Although the python collection blocks when GC-ing, I decided to go multithreading. Thread creation will be mutexed so only one thread will be created. I carefully tested this and will work fine.
  • DEBUG variable in Gc class is for debug log purpose. If you know the better way, please, let me know.
  • I saved the updating documentation work to more suitable person 😄
  • Quick question, is this strategy commonly used in CS? I am not a GC expert and only know a couple of strategies such as 'mark & sweep' etc. I was wondering that, based on your strategy, "live" entries can be removed instead of the "expired" ones. This seems quite unfair to me. I can make a PriorityQueue to remove expired ones first. Build time would be between O(n) or O(n*log(n)) so it won't be that harmful.

1. Added Gc class to sync.py
2. Mounted to ExpirableDictStorage and PersistentDictStorage
3. dict interface changed
@@ -463,6 +463,7 @@ def dict(
storage_class = fsync.PersistentDictStorage
else:
storage_class = fsync.ExpirableDictStorage
storage_class.maxsize = maxsize
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks like it changes the maxsize value of the storage class. It will cause side effect for other calls with different value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed
if storage_class is None:
means that we're in a 'default' mode so we're in the somewhat private zone... what about _maxsize?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is not about the naming rule. The next code will make problem.

# maxsize of f is 512 here because fsync.PersistentDictStorage.maxsize is 512
@ring.dict(maxsize=512)
def f():
   ...

# maxsize of both f and g are 128 now becasue fsync.PersistentDictStorage is 128
@ring.dict(maxsize=128)
def g():
   ...

We shouldn't set any kind of class variable for decorator parameters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed that. What a shame. I'll figure out a better way.

def __init__(self, backend, target_size, expire_f=None):
assert round(target_size) > 0, 'target_size has to be at least 1'
assert expire_f is None or callable(expire_f), 'expire_f has to be function or None'
assert isinstance(backend, type({})), 'backend has to be dict-like'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!


self._mutex.acquire()
if (len(self._backend) > self._target_size):
WorkThread(self).start()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we take any advantage by using threads? I think gc for dict is cpu-bouded job and there is no benefits by threading. Tell me if I missed something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. The cache itself is in-memory so won't be any heavy I/O issue. However, If the dict entries are big enough, like 1000000, set_value(or in some other methods) can be blocked quite an amount of time. At least we can context switch if we go multithreading.
This is just my opinion and have no issues removing the threading.

Copy link
Owner

@youknowone youknowone Jun 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove it. If the size matters, I think there are 2 reasons to regard threading is overkill.

  1. If the backend requires IO job, it makes sense. But for dict, even 1000000 iteration is a microsecond problem. In most of cases with reasonable functions, it will be less than 1000000.
  2. If we correctly add locking to it, gc blocks any kind of adding and removing from the dict - which mean the caller still should wait for gc job.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!


class ExpirableDictStorage(fbase.CommonMixinStorage, fbase.StorageMixin):

maxsize = 128
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value should not be bound to the class

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I can't 100% follow your thoughts with maxsize.
My guess is you're worrying that this value can be easily accessed & modified and my understanding is ExpirableDictStorage & PersistentDictStorage is our implement for default calling so we can control this value.

@@ -229,6 +301,11 @@ def set_value(self, key, value, expire):
expired_time = _now + expire
self.backend[key] = expired_time, value

if self.maxsize < len(self.backend):
if self._gc == None:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may check this part after solving threading design of gc

@youknowone
Copy link
Owner

youknowone commented Jun 16, 2019

Thanks for the contribution!
I remained a few comments and here are the responses of your questiosn.

  • I think lock is a good idea. But not sure about threading for gc run.
  • DEBUG is ok for now. Maybe we need to add a log module.
  • This is not a common GC in CS. Because GC is normally about dynamic-allocated memories, not about cache. Probably we need to rename this one to more proper name. The strategy itself is a variation of common one. (Usually random removing + linear scanning is used.) The requirements here is a little bit tricky because we want to expose the raw dict to user. This is not the best but statistically good enough. When users need better cache strategy, they will use @ring.lru rather than dict. So transparency is more important for dict implmenetation.

The CI failure is about formatting. You can check the result from https://travis-ci.org/youknowone/ring/builds/546264007?utm_source=github_status&utm_medium=notification

@amsukdu
Copy link
Author

amsukdu commented Jun 17, 2019

Thanks again for your kind & careful advice.
I totally agree with your response comment. It is more like a trimmer, not GC.
It would be great if we document it correctly, like

expired ones would be prioritized but ANY dict entry can be removed if it reaches to maxsize.

CI Failure was a fail of my own test. I'll correct it at next PR.

@youknowone
Copy link
Owner

Trimmer can be the name too. I don't know the proper name for this kind of job. So if you have any suggestion or find a correct name for it, please use your suggestion to the code. GC was just an expression in issues to make me sure not to be confused later.

@amsukdu
Copy link
Author

amsukdu commented Jun 23, 2019

@youknowone
I removed the threading and changed all of the comments you asked. Please, take a look!

@amsukdu
Copy link
Author

amsukdu commented Jun 25, 2019

I'm working on test_redis_hash fail.

@@ -638,7 +639,7 @@ class RingRope(RopeCore):
def __init__(self, *args, **kwargs):
super(RingRope, self).__init__(*args, **kwargs)
self.user_interface = self.user_interface_class(self)
self.storage = self.storage_class(self, storage_backend)
self.storage = self.storage_class(self, storage_backend, maxsize)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does redis_hash failure related to this change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, __init__ is overridden in RedisHashStorage. I'm looking at right now and seems there are some workarounds to get back to work but not sure which one will be the best one.
The problem is that BaseStorage has a init and ExpirableDictStorage & PersistentDictStorage adheres it but RedisHashStorage overrides it.

Can you tell me your opinion?

@amsukdu
Copy link
Author

amsukdu commented Jul 4, 2019

@youknowone
I fixed the test fail on my own. I think this is quite reasonable. Please take a look.

@youknowone
Copy link
Owner

Sorry, i was busy for last week. i will check your recent change soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants