-
Notifications
You must be signed in to change notification settings - Fork 25
INTPYTHON-676: Adding security and optimization to cache collections #343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d750611
ae3986a
c874952
0d4cf8b
29080ba
d158b13
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +1,63 @@ | ||
import pickle | ||
from datetime import datetime, timezone | ||
from hashlib import blake2b | ||
from hmac import compare_digest | ||
from typing import Any, Optional, Tuple | ||
|
||
from django.core.cache.backends.base import DEFAULT_TIMEOUT, BaseCache | ||
from django.core.cache.backends.db import Options | ||
from django.core.exceptions import SuspiciousOperation | ||
from django.db import connections, router | ||
from django.utils.functional import cached_property | ||
from django.utils.crypto import pbkdf2 | ||
from pymongo import ASCENDING, DESCENDING, IndexModel, ReturnDocument | ||
from pymongo.errors import DuplicateKeyError, OperationFailure | ||
from django.conf import settings | ||
|
||
|
||
class MongoSerializer: | ||
def __init__(self, protocol=None): | ||
def __init__(self, protocol=None, signer=None): | ||
self.protocol = pickle.HIGHEST_PROTOCOL if protocol is None else protocol | ||
self.signer = signer | ||
|
||
def dumps(self, obj): | ||
# For better incr() and decr() atomicity, don't pickle integers. | ||
# Using type() rather than isinstance() matches only integers and not | ||
# subclasses like bool. | ||
if type(obj) is int: # noqa: E721 | ||
return obj | ||
return pickle.dumps(obj, self.protocol) | ||
def _get_signature(self, data) -> Optional[bytes]: | ||
if self.signer is None: | ||
return None | ||
s = self.signer.copy() | ||
s.update(data) | ||
return s.digest() | ||
|
||
def loads(self, data): | ||
try: | ||
return int(data) | ||
except (ValueError, TypeError): | ||
return pickle.loads(data) # noqa: S301 | ||
def _get_pickled(self, obj: Any) -> bytes: | ||
return pickle.dumps(obj, protocol=self.protocol) # noqa: S301 | ||
|
||
def dumps(self, obj) -> Tuple[Any, bool, Optional[str]]: | ||
# Serialize the object to a format suitable for MongoDB storage. | ||
# The return value is a tuple of (data, pickled, signature). | ||
match obj: | ||
case int() | str() | bytes(): | ||
return (obj, False, None) | ||
Comment on lines
+37
to
+38
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is what you refer to as "optimization". I think any optimization should be done separate from security hardening to make that change more clear. However, I wonder if the benefit of not serializing str/bytes is mostly for the signing case. While it helps with CPU, it adds extra the "pickled" attribute for all keys in the cache. It might be better to limit this optimization a new There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my testing, unpickled This is one of the optimizations that I made, however, it is the least impactful overall. The majority of the speed increases come from switching to directly using python's The pickled field adds 10 bytes to every cache request. I personally do not believe it to be a significant overhead for the performance difference, but if we change the field name to simply While I would normally agree that copying existing, tested code is better, I believe that the redis code is severely limited by a traditional relational database thought process. I do not believe it is taking advantage of the inherent performance boosts we can gain by using Mongodb's document model. While there are other performance limitations we are forced into due to the Django API (colocation of cached headers and page data), this is one that is easy to remediate due to all code paths being within our own code. |
||
case _: | ||
pickled_data = self._get_pickled(obj) | ||
return (pickled_data, True, self._get_signature(pickled_data) if self.signer else None) | ||
|
||
def loads(self, data:Any, is_pickled:bool, signature=None) -> Any: | ||
if is_pickled: | ||
try: | ||
if self.signer is not None: | ||
# constant time compare is not required due to how data is retrieved | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm familiar with the usage of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The threat model for signing data is based on the assumption that a malicious actor has gained write access to the cache collection, but no access (or very very low privilege) to the server Django is running on. If you feel like a side channel attack is a potential issue for this threat model, I am willing to reintroduce the constant time comparison. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest (if only to avoid having to reason about these subtleties in the future) that you use the constant time comparison always when comparing cryptographic hashes. It is theoretically possible that using a non-constant time comparison for equality might allow someone to more easily craft a valid HMAC for a maliciously crafted payload. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Originally, I opted not to use That said, I did also benchmark There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On a related note, Django's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I implemented constant time compare using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, my point was that Django's implementation is just an alias |
||
if signature and compare_digest(signature, self._get_signature(data)): | ||
return pickle.loads(data) # noqa: S301 | ||
else: | ||
raise SuspiciousOperation(f"Pickled cache data is missing signature") | ||
else: | ||
return pickle.loads(data) | ||
except (ValueError, TypeError): | ||
# ValueError: Invalid signature | ||
# TypeError: Data wasn't a byte string | ||
raise SuspiciousOperation(f'Invalid pickle signature: {{"signature": {signature}, "data":{data}}}') | ||
else: | ||
return data | ||
|
||
class MongoDBCache(BaseCache): | ||
pickle_protocol = pickle.HIGHEST_PROTOCOL | ||
|
||
|
@@ -39,6 +69,12 @@ class CacheEntry: | |
_meta = Options(collection_name) | ||
|
||
self.cache_model_class = CacheEntry | ||
self._sign_cache = params.get("ENABLE_SIGNING", True) | ||
|
||
key = params.get("KEY", settings.SECRET_KEY) | ||
if len(key) == 0: | ||
key = settings.SECRET_KEY | ||
self._key = pbkdf2(key.encode(), self._collection_name.encode(), 100_000, digest=blake2b) | ||
|
||
def create_indexes(self): | ||
expires_index = IndexModel("expires_at", expireAfterSeconds=0) | ||
|
@@ -47,7 +83,10 @@ def create_indexes(self): | |
|
||
@cached_property | ||
def serializer(self): | ||
return MongoSerializer(self.pickle_protocol) | ||
signer = None | ||
if self._sign_cache: | ||
signer = blake2b(key=self._key) | ||
return MongoSerializer(self.pickle_protocol, signer) | ||
|
||
@property | ||
def collection_for_read(self): | ||
|
@@ -84,19 +123,30 @@ def get_many(self, keys, version=None): | |
with self.collection_for_read.find( | ||
{"key": {"$in": tuple(keys_map)}, **self._filter_expired(expired=False)} | ||
) as cursor: | ||
return {keys_map[row["key"]]: self.serializer.loads(row["value"]) for row in cursor} | ||
results = {} | ||
for row in cursor: | ||
try: | ||
results[keys_map[row["key"]]] = self.serializer.loads(row["value"], row["pickled"], row["signature"]) | ||
except SuspiciousOperation as e: | ||
self.delete(row["key"]) | ||
e.add_note(f"Cache entry with key '{row['key']}' was deleted due to suspicious data") | ||
raise e | ||
return results | ||
|
||
def set(self, key, value, timeout=DEFAULT_TIMEOUT, version=None): | ||
key = self.make_and_validate_key(key, version=version) | ||
num = self.collection_for_write.count_documents({}, hint="_id_") | ||
if num >= self._max_entries: | ||
self._cull(num) | ||
value, pickled, signature = self.serializer.dumps(value) | ||
self.collection_for_write.update_one( | ||
{"key": key}, | ||
{ | ||
"$set": { | ||
"key": key, | ||
"value": self.serializer.dumps(value), | ||
"value": value, | ||
"pickled": pickled, | ||
"signature": signature, | ||
Comment on lines
+148
to
+149
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's not good to add "pickled" and "signature" keys to all cache data when signing is disabled. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The |
||
"expires_at": self.get_backend_timeout(timeout), | ||
} | ||
}, | ||
|
@@ -109,12 +159,15 @@ def add(self, key, value, timeout=DEFAULT_TIMEOUT, version=None): | |
if num >= self._max_entries: | ||
self._cull(num) | ||
try: | ||
value, pickled, signature = self.serializer.dumps(value) | ||
self.collection_for_write.update_one( | ||
{"key": key, **self._filter_expired(expired=True)}, | ||
{ | ||
"$set": { | ||
"key": key, | ||
"value": self.serializer.dumps(value), | ||
"value": value, | ||
"pickled": pickled, | ||
"signature": signature, | ||
"expires_at": self.get_backend_timeout(timeout), | ||
} | ||
}, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently don't have any type hints in this project (Django hasn't adopted them), so it's a bit out of place to add them in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that, Django was primarily written prior to type hinting's standardization. I personally feel like it is required here since I am breaking from the standard APIs for some functions. It felt more correct to add them to everything then to only add them to certain functions. I can remove them, but I am worried it will reduce readability and maintainability.