Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions src/constants.ts
Original file line number Diff line number Diff line change
Expand Up @@ -187,9 +187,11 @@ export const EXACT_SELECTORS = [
// '[href*="/tag/"]',
// '[href*="/tags/"]',
// '[href*="/topics"]', // see issue #131
'[href*="/author/"]',
'[href*="/author?"]',
'[href$="/author"]',
// Author links can be legitimate article content (see issue #252).
// Author metadata/widgets are handled by class selectors and content-pattern removals.
// '[href*="/author/"]',
// '[href*="/author?"]',
// '[href$="/author"]',
'a[href*="copyright.com"]',
'a[href*="google.com/preferences"]',
'[href="#top"]',
Expand Down
24 changes: 24 additions & 0 deletions src/removals/content-patterns.ts
Original file line number Diff line number Diff line change
Expand Up @@ -452,6 +452,30 @@ export function removeByContentPattern(mainContent: Element, debug: boolean, url
break;
}

// Remove compact author byline lists near the top of content. The broad
// href-based selector removal is intentionally disabled so body links to
// author pages are preserved; pre-content author lists are metadata.
for (const list of mainContent.querySelectorAll('ul, ol')) {
if (!list.parentNode) continue;
if (!isPreContent(list)) continue;
if (countWords(list.textContent || '') > 10) continue;
if (list.querySelector(CONTENT_ELEMENT_SELECTOR)) continue;

const links = Array.from(list.querySelectorAll('a[href]'));
if (links.length === 0) continue;
const allAuthorLinks = links.every(link => {
const href = link.getAttribute('href') || '';
return href.includes('/author/') || href.includes('/author?') || /\/author\/?$/.test(href);
});
if (!allAuthorLinks) continue;

const target = walkUpToWrapper(list, list.textContent?.trim() || '', mainContent);
if (debug && debugRemovals) {
debugRemovals.push({ step: 'removeByContentPattern', reason: 'author byline list', text: textPreview(target) });
}
target.remove();
}

const candidates = Array.from(mainContent.querySelectorAll('p, span, div, time'));

// Single pass over candidates for all metadata-removal checks.
Expand Down
21 changes: 21 additions & 0 deletions tests/expected/issues--252-author-links-preserved.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
```json
{
"title": "Simple Made Clear",
"author": "Jane Smith",
"site": "Example Talks",
"published": ""
}
```

Software systems become easier to understand when each part has one reason to change. This talk explains how teams can separate concerns without splitting code into arbitrary fragments.

The main example follows a reporting service as it grows from a single script into a small set of modules. Each step keeps the public behavior the same while making dependencies visible and easier to test.

## People mentioned

- [Alan Perlis](https://en.wikipedia.org/wiki/Alan_Perlis)
- [Grady Booch](https://example.com/author/grady-booch)
- [Edsger Dijkstra](https://en.wikipedia.org/wiki/Edsger_W._Dijkstra)
- [Erik Meijer](https://example.com/author/erik-meijer)

The point is not that every program needs more layers. The point is that names, data flow, and boundaries should make the important choices visible to the next person reading the code.
28 changes: 28 additions & 0 deletions tests/fixtures/issues--252-author-links-preserved.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
<!-- {"url": "https://example.com/presentations/simple-made-clear/"} -->
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Simple Made Clear</title>
<meta property="og:title" content="Simple Made Clear">
<meta property="og:site_name" content="Example Talks">
<meta name="author" content="Jane Smith">
</head>
<body>
<article>
<h1>Simple Made Clear</h1>
<p>Software systems become easier to understand when each part has one reason to change. This talk explains how teams can separate concerns without splitting code into arbitrary fragments.</p>
<p>The main example follows a reporting service as it grows from a single script into a small set of modules. Each step keeps the public behavior the same while making dependencies visible and easier to test.</p>

<h2>People mentioned</h2>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Alan_Perlis">Alan Perlis</a></li>
<li><a href="https://example.com/author/grady-booch">Grady Booch</a></li>
<li><a href="https://en.wikipedia.org/wiki/Edsger_W._Dijkstra">Edsger Dijkstra</a></li>
<li><a href="https://example.com/author/erik-meijer">Erik Meijer</a></li>
</ul>

<p>The point is not that every program needs more layers. The point is that names, data flow, and boundaries should make the important choices visible to the next person reading the code.</p>
</article>
</body>
</html>