-
-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Collaboration on better comment ranking algorithms? #4924
Comments
Sure, seems interesting. I'm all for bettering or adding additional sorts. |
Great! The biggest challenge we're currently facing is evaluation with real-life scenarios. We already have simulations which confirm that our algorithms do what we expect them to do, but that's no guarantee that they'll work well in the wild. I see two options to evaluate right now:
The exact data the algorithms needs is a vote-stream which contains:
Any idea what's the most practical way forward? |
If it doesn't give the highest rank to recent stuff only, then it should be added as a new sort type, maybe called "best" or "fair" |
Right. Our algorithm doesn't sort by date. The most descriptive name, we've come up with so far is "convincing". A bit more in depth: It empirically measures "convincingness" in voting patterns and bubbles up the most convincing comments for every parent. We define convincing as a comment, which measurably changes voting behavior of its parent. The idea is to focus more attention on convincing comments and their convincing replies recursively. Since misinformation and debunking information is usually convincing, it should (in theory) help to debunk misinformation faster. |
I looked a bit into the lemmy database schema. I think the easiest first step is to analyze some existing voting data from comment trees. That's much easier than implementing the full algorithm in this code base. I prepared some queries which return exactly the data we need for a first analysis: -- all posts
select id, name, url, body, published from post -- all comments with their parent_id
SELECT
id,
post_id,
content,
published,
CASE
WHEN nlevel(path) = 2 THEN NULL
ELSE ltree2text(subpath(path, nlevel(path) - 2, 1))
END AS parent_id
FROM
comment -- anonymized post likes, CHANGE THE SALT!
-- salt must be the same in all queries.
select sha224(('random salt 123' || person_id)::bytea) as person_id, post_id, score, published from post_like; -- anonymized comment likes, CHANGE THE SALT!
-- salt must be the same in all queries.
select sha224(('random salt 123' || person_id)::bytea) as person_id, comment_id, score, published from comment_like Would it be ok to run those on a popular instance and publish the data (or hand it over in private)? I would then analyse it with our group and share insights here. |
Technically all this data is publicly available through the REST API of any instance. So you can retrieve it that way too, but I would suggest asking permission from an admin. |
I just double checked the API docs again. As far as I understand, the API only offers vote aggregates, not individual votes per person. Am I overlooking something? |
The API does indeed only return individual votes to admins or community moderators, but you can see the number of upvotes and the number of downvotes. Note that typically (some exceptions apply) all posts and comments automatically receive an upvote by their creator. Does it matter which user a vote is from for the result or is that just due to the way it's implemented without affecting the outcome? If it doesn't affect the outcome it could probably just be substituted with a random id for testing. It should also be noted that the SQL queries above do not provide anonymization. There are also some other Fediverse applications out there that publicly display votes, which could further be used to almost (remote instances typically don't have 100% coverage of local content) fully de-pseudonymize the entire list of votes. Considering that Lemmy does not make individual votes public, such a dataset, if provided, should likely not be publicly posted. |
Very important point! That allows to deanonymize users.
Individual votes matter for the calculation. Imagine a comment A and a replying comment B. We statistically measure if users who upvoted B voted differently on A. That can't be calculated from the counts alone. But in fact, the author's vote doesn't matter here. It can just be left out of the data. The calculations are per discussion tree, so we should also include the post_id in the hash. With those changes applied to the queries, would there still be room for deanonymization? |
-- get all posts
select post.id, post.name, post.url, post.body, post.published
from post
join community on post.community_id = community.id
where post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;
-- get all comments
select
comment.id,
comment.post_id,
comment.content,
comment.published,
case
when nlevel(comment.path) = 2 then null
else ltree2text(subpath(comment.path, nlevel(comment.path) - 2, 1))
end as parent_id
from comment
join post on post.id = comment.post_id
join community on post.community_id = community.id
where comment.deleted = false
and comment.removed = false
and post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;
-- pseudonymized post likes, CHANGE THE SALT!
-- salt must be the same in all queries.
select
sha224(('random salt 123' || person_id)::bytea) as person_id,
post_like.post_id,
post_like.score,
post_like.published
from post_like
join post on post.id = post_like.post_id
join community on post.community_id = community.id
where post_like.person_id != post.creator_id
and post_like.score != 0
and post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;
-- pseudonymized comment likes, CHANGE THE SALT!
-- salt must be the same in all queries.
select
sha224(('random salt 123' || person_id)::bytea) as person_id,
comment_like.comment_id,
comment_like.score,
comment_like.published
from comment_like
join comment on comment.id = comment_like.post_id
join post on post.id = comment.post_id
join community on post.community_id = community.id
where comment_like.person_id != comment.creator_id
and comment_like.score != 0
and comment.deleted = false
and comment.removed = false
and post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
; these would probably be better queries then. in the end, realistically, vote data (at least in non-local-only communities) should be considered more or less public, even though it's not always easily accessible to everyone without some additional work. as i mentioned before, some other fediverse software makes votes publicly visible to everyone. in addition to that, even if that wasn't the case, all you'd need to do would be running a fediverse instance on your own to passively listen for other instances to send this information to you over time. if you can find content that only has 2 upvotes (1 if you exclude the creator) and then on another fediverse instance that has public votes on content you can see which person this is, and you can use this dataset to match all other votes that this person did. I don't know whether any software currently provides an overview of all votes by a user. there isn't really a clear line here imo, so instance admins will have to consider what or if they can share this data, but with a bit of time investment all this information can already be automatically and mostly passively collected (for new content). it would probably still reduce the risk of abuse for e.g. harassment if this was not publicly published as a full dataset for everyone to just download. |
Thank you for revising the queries! They look much better now. Any recommendations for which instance admins I should approch? |
@dessalines could provide some information about lemmy.ml, which is probably the oldest instance around, although not necessarily the most complete one. due to bugs, limited federation, age or lack of community subscriptions there won't be any single instance that has all data. lemmy.world as the largest instance is probably worth reaching out to as well. https://lemmyverse.net/?order=posts and https://lemmyverse.net/?order=comments might be useful. you can ignore lemmit.online, that's a reddit repost instance. there are a few more that may be mostly or only mirroring reddit content as well. also keep in mind that post/comment/user ids are local to an instance, so you can't merge data from multiple instances this way. |
Do I understand correctly that the With that understanding, I revised the queries another time:
@Nothing4You if you approve, I'll start approaching instance admins. Thank you for your help! -- get all posts
select ap_id, post.name, post.url, post.body, post.published
from post
join community on post.community_id = community.id
where post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;
-- get all comments
select
comment.ap_id,
post.ap_id as post_ap_id,
comment.content,
comment.published,
parent_comment.ap_id as parent_ap_id
from comment
join post on post.id = comment.post_id
join community on post.community_id = community.id
join comment as parent_comment on parent_comment.id = (
case
when nlevel(comment.path) = 2 then null
else ltree2text(subpath(comment.path, nlevel(comment.path) - 2, 1))
end)::int
where comment.deleted = false
and comment.removed = false
and post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;
-- pseudonymized post likes, CHANGE THE SALT!
-- salt must be the same in all queries.
select
sha224(('random salt 123' || person.actor_id)::bytea) as actor_id,
post.ap_id as post_ap_id,
post_like.score,
post_like.published
from post_like
join post on post.id = post_like.post_id
join community on post.community_id = community.id
join person on person.id = person_id
where post_like.person_id != post.creator_id
and post_like.score != 0
and post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
;
-- pseudonymized comment likes, CHANGE THE SALT!
-- salt must be the same in all queries.
select
sha224(('random salt 123' || person.actor_id)::bytea) as actor_id,
comment.ap_id as comment_ap_id,
comment_like.score,
comment_like.published
from comment_like
join comment on comment.id = comment_like.post_id
join post on post.id = comment.post_id
join community on post.community_id = community.id
join person on person.id = person_id
where comment_like.person_id != comment.creator_id
and comment_like.score != 0
and comment.deleted = false
and comment.removed = false
and post.deleted = false
and post.removed = false
and community.removed = false
and community.deleted = false
; On a local instance, it produces these results:
|
this is correct. those ids will be globally unique identifiers typically. there can be exceptions to this if an instance is torn down and recreated without adjusting the auto increment counters, which can otherwise lead to reuse of identifiers. similarly, user accounts could have been purged at some point, which could be registered again afterwards. the combination of the |
Closing this as the question has been answered. |
Question
Hey everyone,
we're a small open research group currently working on new comment ranking algorithms for discussion trees. Our goal is to make identifying and debunking of misinformation in discussions more effective. Technically we're analyzing voting patterns using bayesian statistics.
Is there interest in the lemmy community to collaborate on that goal?
Everything we do is open-source. An earlier project we worked on was a new ranking metric for the Hacker News frontpage: https://github.com/social-protocols/news#readme
Happy to answer any questions.
The text was updated successfully, but these errors were encountered: