Match opinions based on pincite #4323

albertisfu · 2024-08-19T21:11:12Z

This is a follow-up to #4211, where we discussed potential improvements for matching the correct opinion when resolving citations to create OpinionCited instances.

Currently within es_reverse_match if more than one Opinion found belongs to the same cluster, the first one matched is the one retrieved to create the OpinionCited instance.

This can be improved in one of two ways:

Now that we have opinion ordering, it makes sense to choose the first opinion by order.

But that could be improved because sometimes a citation will have what's called a pincite, which is a citation to a particular page. If we know the pincite, we should figure out which of the sub-opinions it refers to based on the page numbers for each citation, and use that sub-opinion. (But this is hard.)

Currently, ordering keys in Opinions are empty, so this improvement might have to wait until the ordering is populated.

The text was updated successfully, but these errors were encountered:

mlissner · 2024-12-16T18:27:03Z

This is a bit of a complicated bug, so here's the TLDR:

Eyecite finds a citation.
We look for clusters with that citation.
We find one, but it has three sub-opinions.
Currently we just match the first one to be indexed.
Instead, we should match the first one by ordering_key, or if there's a pincite, we should match the opinion where that pincite occurs.

grossir · 2024-12-19T22:13:36Z

I think the scenarios proposed look like this:

We are resolving a pincite
- Does the opinion "content" has page numbers available?
  - Yes: match the opinion in the cluster that has the proper page number
  - No: does the cluster has an ordering key?
  - - - - - Yes: use the first ordering key
  - - - - - No: use the first indexed (we do this currently)
We are resolving a general citation
- Does the cluster have an ordering key
- - Yes: use the first ordering key
- - No: use the first indexed (we do this currently)

For the scenarios where we can't resolve the pincite I think we should match the "combined" opinion if it exists, instead of the first in order.

Looking at a particular cluster with 4 opinions, the first 3 have ordering keys and types lead, concurrence, dissent; the last and oldest is a "combined" opinion with null ordering key. So, if the pincite was not solvable, or in the case of a general citation, shouldn't we point to the combined opinion which is the "whole" decision instead of a fragment?

Even if we decide against this, there would be corrections to do. In that same cluster, 3297 opinions are citing the combined opinion; and only 1 is citing the "lead" opinion.

Jumping into the task itself:

Case when we can identify page numbers in the opinion

if there's a pincite, we should match the opinion where that pincite occurs.

This is possible for opinions that come from a HTML / XML source

html_lawbox : <span class=\"star-pagination\">*115</span> . Example
xml_harvard: <page-number citation-index=\"1\" label=\"116\">*116</page-number> . Example
html_columbia: <span class=\"star-pagination\">*Page 319</span. Example
html: This should be possible in some courts that markup their page numbers; but we would need to check in a case by case basis and would take more time to implement
plain_text: I think it wouldn't be possible to do cleanly, since if the page numbers are not clearly marked up, they could collide with random numbers in the opinion's text

Once we have the pincite page number, we test for it's presence in any of the HTML fields, and it's a match if it exists.

We have 363 895 clusters where a pincite citing into them would be resolvable, around 3.7% of the clusters in the DB

It seems eyecite already identifies pincites...

Matching on ordering key

Currently we just match the first one to be indexed.
Instead, we should match the first one by ordering_key

This can be done easily, but I am unsure if it's the correct choice. If done, we should back-correct the OpinionsCited table for cases like the example above?

As of time of writing this, 430 596 clusters have a opinions with at least 1 not null ordering_key in its opinions. 4.38% of clusters.

Queries for the stats:

-- clusters with more than 1 opinion, and 1 rich structured field per opinion
courtlistener=> select count(*) from (select cluster_id from search_opinion group by cluster_id having count(*) > 1 and bool_and(xml_harvard <> '' or html_lawbox <> '' or html_columbia <> '')) a;
 count  
--------
 363895


-- clusters with at least 1 ordering key
courtlistener=> select count(distinct(cluster_id)) from search_opinion where ordering_key is not null;
 count  
--------
 430596
(1 row)

courtlistener=> select count(distinct(cluster_id)) from search_opinion;
  count  
---------
 9823322
(1 row)

-- clusters that do not have a combined opinion
courtlistener=> select count(*) from (select cluster_id from search_opinion group by cluster_id having bool_and(ordering_key is not null)) a;
 count  
--------
 219875
(1 row)

Some extra thoughts

Assuming we can resolve the pincites, we should update the html_with_citations of the citing opinion to hyperlink (probably with an HTML fragment) the page number; so that when followed, the reader is autoscrolled into the proper page.

Also, resolving pincites suggests some model changes. We could add a filed to OpinionsCited with the actual page number.
This would require deleting the "depth" field. However, the information on that field would not be lost, since it could be re-computed via aggregating the same table over citing_opinion and cited_opinion. Having the proper pincite as a DB field would allow finer grain analysis without losing the "depth" information.

class OpinionsCited(models.Model):
    citing_opinion = models.ForeignKey(
        Opinion, related_name="cited_opinions", on_delete=models.CASCADE
    )
    cited_opinion = models.ForeignKey(
        Opinion, related_name="citing_opinions", on_delete=models.CASCADE
    )
    pincite = models.IntegerField(
        help_text="The page cited"
    )

--- computing depth
SELECT citing_opinion_id, cited_opinion_id, count(*) as depth
FROM search_opinions_cited

What's more, we could even leave the "depth" on the model, as a "depth" of pincites, which I imagine happens if the same opinion part is cited multiple times

--- computing depth
SELECT citing_opinion_id, cited_opinion_id, sum(depth) as depth
FROM search_opinions_cited

mlissner · 2024-12-20T01:11:29Z

For the scenarios where we can't resolve the pincite I think we should match the "combined" opinion if it exists, instead of the first in order.

Hm, @flooie might have an opinion here, but I think if we have sub-opinions (plural) as well as a combined opinion, we should just match to the first one when we can't resolve the pincite. I think it's generally the most important decision in the cluster and the one that's assumed.

Even if we decide against this, there would be corrections to do. In that same cluster, 3297 opinions are citing the combined opinion; and only 1 is citing the "lead" opinion.

When we re-run the citation finder, it'll nuke existing citations and replace them with better ones. It's designed that way.

Once we have the pincite page number, we test for it's presence in any of the HTML fields, and it's a match if it exists.

We wouldn't want to be looking in the HTML to do matches, BUT if we're going to do pincites, we should fix #4843 first. I think it'd give us an efficient way to do this.

We should update the html_with_citations of the citing opinion to hyperlink (probably with an HTML fragment) the page number; so that when followed, the reader is autoscrolled into the proper page.

Yes!

This would require deleting the "depth" field.

Hm, that doesn't seem worth it, but could we just not store the pincite in the DB? Our destination is:

A link from opinion A to the pin-cited opinion B (this goes in the DB)
A anchor fragment (eg #page-22). Maybe we just put that in the HTML and that's good enough?

I noted on the pincite sub-issue that it would be hard to do. Up to Bill if it's worth it now or something we should do later. It's pretty tough.

flooie · 2025-01-02T18:22:31Z

Perhaps this is obvious, but I would just point out that pin-cites alone are not sufficient to identify which sub-opinion is being cited.

If someone searched 58 U.S. 596 at 600 our system would fail unless we had either the author or the text to disambiguate which opinion. If you look at the below image you can see that page 600 contains the end of the majority, the dissent, and the concurrence.

Hm, @flooie might have an opinion here, but I think if we have sub-opinions (plural) as well as a combined opinion, we should just match to the first one when we can't resolve the pincite. I think it's generally the most important decision in the cluster and the one that's assumed.

I agree

We wouldn't want to be looking in the HTML to do matches, BUT if we're going to do pincites, we should fix #4843 first. I think it'd give us an efficient way to do this.

Fixing #4843 only allows us to pincite to the cluster in a safer way - it doesnt help us pincite to sub-opinions. As highlighted above.

A anchor fragment (eg #page-22). Maybe we just put that in the HTML and that's good enough?

We should already be generating #p22 anchor tags. The javascript standardizes most (hopefully all) citations so that each is linkable in the new design.

I'm not sure anyone mentioned the fact that parallel citations are also going to make things trickier.

grossir · 2025-01-02T20:14:21Z

Some thoughts after talking with Bill and looking for examples

The "page" anchor tags are already generated for marked up opinions (example), we would need to generate the proper anchor on the citing opinion
We aren't considering "paragraph" pincites. like "Ward at ¶ 30". I think these hold no ambiguity, since they are not page numbers; but I am not sure how common they are.
- Example of an ohioctapp opinion citing other opinions using the paragraphs.
- Example of an az opinion written with marked up paragraphs (that we do not enrich in our HTML)
I found an html_with_citations issue that I did not see mapped on the parent issue: id. citations sometimes put too much text inside the <a> tag #4882
I agree with Bill that some page-number pincites would be ambiguous when multiple opinions are on the same page; but some are not (see 2 examples below), given that the pincite points to a page that belongs to a single opinion

Type	Pincite	Comment	Citing op	Cited op
Pincite to non-majority opinion	See S. Bell Tel. & Tel. Co. v. Pub. Serv. Comm'n, 270 S.C. 590, 610, 244 S.E.2d 278, 288 (1978) (Ness, J., concurring in part and dissenting in part)	The in part opinion begins in page 605, so this pincite would actually be resolvable. Also, note that there are 2 pincites: the hyperlinked one does not correspond to the numbering we actually have on display	https://www.courtlistener.com/opinion/5065237/in-re-application-of-blue-granite/?q=%22Roe+at%22+dissenting&type=o&order_by=dateFiled+desc&stat_Published=on	https://www.courtlistener.com/opinion/1338206/sou-bell-tel-tel-co-v-pub-ser-comm/#610
Pincite to page	Aros v. Beneficial Ariz., Inc., 194 Ariz. 62, 66 (1999).	We display the opinion in the parallel citation format, not how it was cited, thus the fragment wouldn’t work	https://www.courtlistener.com/opinion/9491968/planned-parenthood-v-kristin-mayeshazelrigg/?q=%22Roe+at%22+dissenting&type=o&order_by=dateFiled+desc&stat_Published=on	https://www.courtlistener.com/opinion/1187886/leonard-h-v-beneficial-arizona-inc/
Pincite to paragraph	Medina, 2011-Ohio-3990, at ¶ 13 (8th Dist.)		https://www.courtlistener.com/opinion/10014124/camacho-v-rose-mary-johanna-graselli-rehab-inc/?q=%22Roe+at%22&type=o&order_by=dateFiled+desc&stat_Published=on	https://www.courtlistener.com/opinion/2704393/medina-v-medina-gen-hosp/
Pincite to non-majority opinion	See, e.g., In re Allstate Cty. Mut. Ins., 85 S.W.3d at 198	Would be resolvable, dissent begins at 197	https://www.courtlistener.com/opinion/4635540/barbara-technologies-corporation-v-state-farm-lloyds/?q=Barbara+Techs.+Corp.+v.+State+Farm+Lloyds	https://www.courtlistener.com/opinion/1588427/in-re-allstate-county-mut-ins-co/#198

flooie · 2025-01-02T21:13:18Z

I think we should table this - and just link to the first ordered opinion.

I think we need to improve eyecite more first as well as think about changes to citation and/or other models first.

mlissner · 2025-01-03T03:47:32Z

Sounds good, thanks Bill and everybody else for the analysis! We'll get to this at some later point.

mlissner changed the title ~~Enhance sub-Opinions matching logic within es_reverse_match~~ Match opinions based on pincite Dec 16, 2024

github-project-automation bot added this to Case Law Sprint and Citator Dec 16, 2024

flooie moved this to To Do in Case Law Sprint Dec 17, 2024

flooie assigned grossir Dec 17, 2024

grossir moved this from To Do to In progress in Case Law Sprint Dec 18, 2024

grossir moved this from In progress to Blocked in Case Law Sprint Dec 20, 2024

flooie moved this from Blocked to General Backlog in Case Law Sprint Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match opinions based on pincite #4323

Match opinions based on pincite #4323

albertisfu commented Aug 19, 2024

mlissner commented Dec 16, 2024

grossir commented Dec 19, 2024 •

edited

Loading

mlissner commented Dec 20, 2024

flooie commented Jan 2, 2025

grossir commented Jan 2, 2025 •

edited

Loading

flooie commented Jan 2, 2025

mlissner commented Jan 3, 2025

Match opinions based on pincite #4323

Match opinions based on pincite #4323

Comments

albertisfu commented Aug 19, 2024

mlissner commented Dec 16, 2024

grossir commented Dec 19, 2024 • edited Loading

Case when we can identify page numbers in the opinion

Matching on ordering key

Some extra thoughts

mlissner commented Dec 20, 2024

flooie commented Jan 2, 2025

grossir commented Jan 2, 2025 • edited Loading

flooie commented Jan 2, 2025

mlissner commented Jan 3, 2025

grossir commented Dec 19, 2024 •

edited

Loading

grossir commented Jan 2, 2025 •

edited

Loading