Skip to content

Commit

Permalink
Front Matter support (#1)
Browse files Browse the repository at this point in the history
* Front matter extractor class

* Added support for front matter from command line

* Updated docs

* Updated version numbers
  • Loading branch information
mikegoatly authored Aug 9, 2020
1 parent 700f034 commit a84cd54
Show file tree
Hide file tree
Showing 13 changed files with 537 additions and 14 deletions.
81 changes: 75 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,10 @@

![Build and test](https://github.com/mikegoatly/html2md/workflows/Build%20and%20test/badge.svg)

Convert an HTML page to markdown, including re-linking and downloading of images.
Reverse engineer markdown from an HTML page, including:

- Re-linking and downloading of images
- Front Matter metadata generation

## Usage as a dotnet tool

Expand All @@ -27,13 +30,28 @@ If unspecified the entire body tag will be processed, otherwise only text contai
Allows for specific tags to be ignored.
--image-path-prefix|--ipp <IMAGE PATH PREFIX>
The prefix to apply to all rendered image URLs - helpful when you're going to be serving images from a different location, relative or absolute.
The prefix to apply to all rendered image URLs - helpful when you're going to be serving images from a
different location, relative or absolute.
--default-code-language <LANGUAGE>
The default language to use on code blocks converted from pre tags - defaults to csharp
--code-language-class-map <CLASSNAME:LANGUAGE,CLASSNAME:LANGUAGE,...>
Map between a pre tag's class names and languages. E.g. you might map the class name "sh_csharp" to "csharp" and "sh_powershell" to "powershell".
Map between a pre tag's class names and languages. E.g. you might map the class name "sh_csharp" to "csharp"
and "sh_powershell" to "powershell".
--front-matter-data <PROPERTY:[XPATH|{{MACRO}}|{{'CONSTANT'}}]>
Allows for configuration of information to be extracted to a Front Matter property. This can be an XPath to an element
or attribute in the HTML page, a string constant or a supported macro.
Supported macros:
RelativeUriPath: The relative path of the page being converted. e.g. for https://example.com/pages/page-1 the macro would
return /pages/page-1
--front-matter-data-list <PROPERTY:XPATH>
Allows for configuration of list-based information to be extracted to a Front Matter property.
--front-matter-delimiter <DELIMITER>
The delimiter to write out for the Front Matter section of the converted document. The default is ---
```

## Usage as a nuget package
Expand Down Expand Up @@ -61,7 +79,50 @@ ConversionResult converted = await converter.ConvertAsync(

```

`ConvertedDocument` exposes:
You can also extract Front Matter metadata:

``` csharp

var options = new ConversionOptions
{
FrontMatter =
{
Enabled = true,
SingleValueProperties =
{
{ "Title", "//h1" },
{ "Author", "{{'Mike Goatly'}}" },
{ "RedirectFrom", @"{{RelativeUriPath}}" }
},
ArrayValueProperties =
{
{ "Tags", @"//p[@class='tags']/a" }
}
}
}

var converter = new MarkdownConverter(options);

ConversionResult converted = await converter.ConvertAsync("https://goatly.net/some-article");

```

Where the resulting markdown would be:

```
---
Title: Article Title
Author: Mike Goatly
RedirectFrom: /some-article
Tags:
- Help
- Coding
---
```

### `ConvertedDocument`

`ConvertedDocument` is the result of a conversion process, containing:

- `Documents`: The markdown representations of all the converted pages.
- `Images`: A collection of images referenced in the documents. Each image includes the downloaded raw data as a byte array.
Expand All @@ -78,14 +139,22 @@ The default is `csharp`.
- `ExcludeTags`: The set of tags to exclude from the conversion process. You can use this if there are certain parts of
a document you don't want translating to markdown, e.g. aside, nav, etc.
- `CodeLanguageClassMap`: A dictionary mapping between class names that can appear on `pre` tags and the language they map to.E.g. you might map the class name "sh_csharp" to "csharp" and "sh_powershell" to "powershell".
- `FrontMatter`: Configuration for how Front Matter metadata should be emitted into a converted document.
- `Enabled`: Whether Front Matter metadata should be emitted. Defaults to `false`.
- `Delimiter`: The delimiter to write to the Front Matter section. Defaults to `---`.
- `SingleValueProperties`: Configuration of information to be extracted to a Front Matter property. This can be an XPath to an element
or attribute in the HTML page, a string constant or a supported macro. Supported macros:
- RelativeUriPath: The relative path of the page being converted. e.g. for https://example.com/pages/page-1 the macro would
return /pages/page-1
- `ArrayValueProperties`: Configuration of list-based information to be extracted to a Front Matter property.

## Converted content

### `<em>`
### `<em>` and `<i>`

`<em>italic</em>` becomes `*italic*`

### `<strong>`
### `<strong>` and `<b>`

`<strong>bold</strong>` becomes `**bold**`

Expand Down
3 changes: 3 additions & 0 deletions src/Html2md.Core/ConversionOptions.cs
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,8 @@ public class ConversionOptions : IConversionOptions

/// <inheritdoc />
public IDictionary<string, string> CodeLanguageClassMap { get; set; } = new Dictionary<string, string>();

/// <inheritdoc />
public FrontMatterOptions FrontMatter { get; set; } = new FrontMatterOptions();
}
}
75 changes: 75 additions & 0 deletions src/Html2md.Core/FrontMatterExtractor.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
using HtmlAgilityPack;
using System;
using System.Text;
using System.Text.RegularExpressions;

namespace Html2md
{
public class FrontMatterExtractor
{
public string? Extract(FrontMatterOptions options, HtmlDocument document, Uri pageUri)
{
if (!options.Enabled)
{
return null;
}

var builder = new StringBuilder();
builder.AppendLine(options.Delimiter);

foreach (var singleValue in options.SingleValueProperties)
{
builder
.Append(singleValue.Key)
.Append(": ")
.AppendLine(ExtractValue(singleValue.Value, document, pageUri));
}

foreach (var singleValue in options.ArrayValueProperties)
{
builder
.Append(singleValue.Key)
.AppendLine(":");

foreach (var match in document.DocumentNode.SelectNodes(singleValue.Value))
{
builder
.Append(" - ")
.AppendLine(match.GetDirectInnerText().Trim());
}
}

builder.AppendLine(options.Delimiter);

return builder.ToString();
}

private static string ExtractValue(string xpathOrMacro, HtmlDocument document, Uri pageUri)
{
if (xpathOrMacro.StartsWith("{{"))
{
if (Regex.IsMatch(xpathOrMacro, @"^{{'[^']*'}}$"))
{
return xpathOrMacro.Substring(3, xpathOrMacro.Length - 6);
}

return xpathOrMacro switch
{
"{{RelativeUriPath}}" => pageUri.LocalPath,
_ => throw new Exception("Unknown macro " + xpathOrMacro),
};
}
else
{
var node = document.DocumentNode.SelectSingleNode(xpathOrMacro);
var attributeName = Regex.Match(xpathOrMacro, @"/@(\w+)$");
if (attributeName.Success)
{
return node.GetAttributeValue(attributeName.Groups[1].Value, string.Empty).Trim();
}

return node.GetDirectInnerText().Trim();
}
}
}
}
32 changes: 32 additions & 0 deletions src/Html2md.Core/FrontMatterOptions.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
using System.Collections.Generic;

namespace Html2md
{
/// <summary>
/// Configuration for writing Front Matter sections to converted documents.
/// </summary>
public class FrontMatterOptions
{
/// <summary>
/// Gets or sets a value indicating whether Front Matter should be written to converted documents.
/// </summary>
public bool Enabled { get; set; }

/// <summary>
/// Gets or sets the delimiter that should be written to the Front Matter section. Default is ---.
/// </summary>
public string Delimiter { get; set; } = "---";

/// <summary>
/// Gets or sets the XPath or macro properties that should be written to the Front Matter section.
/// If an XPath is provided and more than one element matches, then the first is used.
/// </summary>
public Dictionary<string, string> SingleValueProperties { get; set; } = new Dictionary<string, string>();

/// <summary>
/// Gets or sets the XPath properties that should be written to the Front Matter section as a list. Each matching
/// value will be written as an entry in the list.
/// </summary>
public Dictionary<string, string> ArrayValueProperties { get; set; } = new Dictionary<string, string>();
}
}
5 changes: 3 additions & 2 deletions src/Html2md.Core/Html2md.Core.csproj
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<TargetFramework>netstandard2.0</TargetFramework>
<TargetFramework>netstandard2.1</TargetFramework>
<LangVersion>8.0</LangVersion>
<Nullable>enable</Nullable>
<RootNamespace>Html2md</RootNamespace>
Expand All @@ -13,8 +13,9 @@
<PackageLicenseFile>LICENSE</PackageLicenseFile>
<PackageProjectUrl>https://github.com/mikegoatly/html2md</PackageProjectUrl>
<PackageTags>convert-html convert-markdown html markdown conversion</PackageTags>
<Version>1.0.1</Version>
<Version>1.1.0</Version>
<RepositoryUrl>https://github.com/mikegoatly/html2md</RepositoryUrl>
<PackageReleaseNotes>Added support for extracting Front Matter metadata</PackageReleaseNotes>
</PropertyGroup>

<ItemGroup>
Expand Down
8 changes: 6 additions & 2 deletions src/Html2md.Core/IConversionOptions.cs
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
using System;
using System.Collections.Generic;
using System.Collections.Generic;

namespace Html2md
{
Expand Down Expand Up @@ -35,5 +34,10 @@ public interface IConversionOptions
/// a document you don't want translating to markdown, e.g. aside, nav, etc.
/// </summary>
ISet<string> ExcludeTags { get; }

/// <summary>
/// Gets the FrontMatter configuration to apply to the conversion process.
/// </summary>
FrontMatterOptions FrontMatter { get; }
}
}
7 changes: 7 additions & 0 deletions src/Html2md.Core/MarkdownConverter.cs
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ public class MarkdownConverter
private readonly IConversionOptions options;
private readonly ILogger logger;
private readonly HttpClient httpClient;
private readonly FrontMatterExtractor frontMatterExtractor = new FrontMatterExtractor();

public MarkdownConverter(IConversionOptions options, ILogger? logger = null)
: this(options, null, logger)
Expand Down Expand Up @@ -70,6 +71,12 @@ private async Task<ConvertedDocument> ConvertAsync(Uri pageUri, StringBuilder bu
var doc = new HtmlDocument();
doc.LoadHtml(content);

var frontMatter = this.frontMatterExtractor.Extract(this.options.FrontMatter, doc, pageUri);
if (frontMatter != null)
{
builder.Append(frontMatter);
}

this.logger.LogDebug("Processing page content");
this.ProcessNode(pageUri, doc.DocumentNode, builder, imageCollector, false);

Expand Down
41 changes: 38 additions & 3 deletions src/Html2md/CommandLineArgs.cs
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,20 @@ public CommandLineArgs(string[] args)
SaveArg(args, ref i, ref this.imageOutputLocation);
break;

case "--front-matter-delimiter":
this.FrontMatter.Delimiter = GetArgParameter(args, ref i) ?? this.FrontMatter.Delimiter;
break;

case "--front-matter-data":
AddArg(args, ref i, this.FrontMatter.SingleValueProperties);
this.FrontMatter.Enabled = true;
break;

case "--front-matter-data-list":
AddArg(args, ref i, this.FrontMatter.ArrayValueProperties);
this.FrontMatter.Enabled = true;
break;

case "--image-path-prefix":
case "--ipp":
SaveArg(args, ref i, ref this.imagePathPrefix!);
Expand All @@ -65,13 +79,13 @@ public CommandLineArgs(string[] args)
case "--include-tags":
case "--it":
case "-t":
SaveArg(args, ref i, ref this.includeTags);
AddArg(args, ref i, ref this.includeTags);
break;

case "--exclude-tags":
case "--et":
case "-e":
SaveArg(args, ref i, ref this.excludeTags);
AddArg(args, ref i, ref this.excludeTags);
break;

case "--code-language-class-map":
Expand Down Expand Up @@ -123,7 +137,7 @@ private void SaveArg(string[] args, ref int i, ref string? arg)
arg = GetArgParameter(args, ref i);
}

private void SaveArg(string[] args, ref int i, ref HashSet<string> arg)
private void AddArg(string[] args, ref int i, ref HashSet<string> arg)
{
var argValue = GetArgParameter(args, ref i);
if (argValue != null)
Expand All @@ -132,6 +146,25 @@ private void SaveArg(string[] args, ref int i, ref HashSet<string> arg)
}
}

private void AddArg(string[] args, ref int i, Dictionary<string, string> arg)
{
var argIndex = i;
var argValue = GetArgParameter(args, ref i);
if (argValue != null)
{
var pair = argValue.Split(":");

if (pair.Length != 2)
{
this.Error = "Malformed argument value for " + args[argIndex];
}
else
{
arg[pair[0]] = pair[1];
}
}
}

private void SaveArg(string[] args, ref int i, ref Dictionary<string, string> arg)
{
var argIndex = i;
Expand Down Expand Up @@ -185,5 +218,7 @@ public LogLevel LogLevel
public ISet<string> ExcludeTags => this.excludeTags;

public IDictionary<string, string> CodeLanguageClassMap => this.codeLanguageClassMap;

public FrontMatterOptions FrontMatter { get; } = new FrontMatterOptions();
}
}
3 changes: 2 additions & 1 deletion src/Html2md/Html2md.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,10 @@
<Copyright>Copyright Mike Goatly</Copyright>
<PackageLicenseFile>LICENSE</PackageLicenseFile>
<PackageProjectUrl>https://github.com/mikegoatly/html2md</PackageProjectUrl>
<Version>1.0.2</Version>
<Version>1.1.0</Version>
<PackageTags>convert-html convert-markdown html markdown conversion</PackageTags>
<RepositoryUrl>https://github.com/mikegoatly/html2md</RepositoryUrl>
<PackageReleaseNotes>Added support for extracting Front Matter metadata</PackageReleaseNotes>
</PropertyGroup>

<ItemGroup>
Expand Down
Loading

0 comments on commit a84cd54

Please sign in to comment.