Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Parse functionnality and some unit tests #3

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

lduchosal
Copy link

This might be of interest

Copy link
Owner

@drewnoakes drewnoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up! I've added some comments.

HexDump/HexDump_Parse.cs Outdated Show resolved Hide resolved
Comment on lines 24 to 25
/// - includeOffset = true
/// - includeAscii = true
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These documented restrictions don't appear to match the regular expression pattern.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implenented in parse so we have same behaviour in both methodes

/// </summary>
/// <param name="dump"></param>
/// <returns></returns>
public static byte[] Parse(string dump )
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an FYI: This method allocates a lot of temporary memory on the heap. It is possible to write this without allocating anything other than the List<byte>, though the parsing has to be more manual. I can provide more details if you're interested.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glad to see any improvement if you feel like it's worth it.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether it's worth it depends upon the consumer's situation. If they're only parsing one short blob of text then it's probably fine as is. If they're running many operations concurrently or in an application that's sensitive to GC pauses, then the current implementation may be problematic. This is general library code, so I try to be as well behaved as possible as we cannot make many assumptions about the user's requirements.

At a high level, all of these can be avoided:

  • A string per line of the input
  • A Replace string per line
  • A string per hex character pair
  • Enumerators (via Linq) per line
  • StringReader per call

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseLookup implementation seems the most promising one.
Will have a look at a state machine to implement special cases.

//00000000 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

var result = new List<byte>();
var lines = dump.Split(Environment.NewLine.ToCharArray());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Environment.NewLine means you can get different behaviours on different machines for the same input. This kind of thing regularly breaks CI, for example. I'd rather see split explicitly on '\r' and '\n' and remove empty entries (though I'd rather not use String.Split at all as it allocates an array and temporary strings, both of which can be avoided).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to use StringReader and remove the ToArray() in linq query.


int hexaWidth = (columnWidth * 3 * columnCount) + columnCount - 1 - 1;
var _re = new Regex($"^{rio}(?<hexa>[0-9a-f\\s]{{{hexaWidth}}}){ria}$",
RegexOptions.Compiled | RegexOptions.IgnoreCase);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid compiling the regex here, given its only used once.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally right here, compiling the regex is worthless and take too much time.

/// <param name="includeOffset"></param>
/// <param name="includeAscii"></param>
/// <returns></returns>
public static byte[] Parse(string dump, int columnWidth = 8, int columnCount = 2, bool includeOffset = true, bool includeAscii = true)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's possible to make parsing work without so many options?

00000000   01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

What if the parser just looked for valid pairs of hex characters separated by whitespace?

  • The offset would have to have more than two characters to be automatically excluded
  • The ASCII section could contain eg 00 01 02

The second point is the hardest one to deal with.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basic state machine will do the trick

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the case:

00000000   30 31 20 30 32 30 30 33 20 30 34 20 30 35 20 30   01 02 03 04 05 0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseStateMachine takes into account special cases like this one.

@drewnoakes
Copy link
Owner

I'd be interested to see benchmark results.

@lduchosal
Copy link
Author

The only features complete are ParseRegex and ParseStateMachine.
ParseLookup*, ParseConvert, ParseDic and ParseSimd do not take in account HEX in ASCII nor missing OFFSET column thus are not fair to compare with.

ParseStateMachine is the best so far.

// * Summary *

BenchmarkDotNet=v0.12.0, OS=macOS 10.15.4 (19E266) [Darwin 19.4.0]
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.1.200
[Host] : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Job-ZWSPBT : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
ShortRun : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT

Runtime=.NET Core 3.1

Method Job _len Mean Error StdDev Median Gen 0 Gen 1 Gen 2 Allocated
ParseSimd Default 10 398.0 ns 7.87 ns 7.73 ns 398.5 ns 0.1407 - - 664 B
ParseLookup1 Default 10 207.6 ns 4.43 ns 4.35 ns 206.5 ns 0.0372 - - 176 B
ParseLookup2 Default 10 375.3 ns 2.33 ns 2.18 ns 374.8 ns 0.0372 - - 176 B
ParseStasteMachine Default 10 809.7 ns 10.26 ns 9.60 ns 810.1 ns 0.1984 - - 936 B
ParseConvert Default 10 581.8 ns 8.00 ns 7.48 ns 581.4 ns 0.1049 - - 496 B
ParseDic Default 10 377.0 ns 2.90 ns 2.71 ns 377.2 ns 0.0372 - - 176 B
ParseRegex Default 10 4,164,130.3 ns 56,327.77 ns 52,689.03 ns 4,182,981.4 ns - - - 28434 B
ParseSimd Default 100 1,928.9 ns 37.68 ns 38.70 ns 1,925.4 ns 0.7973 - - 3768 B
ParseLookup1 Default 100 1,005.9 ns 11.36 ns 10.07 ns 1,006.6 ns 0.1183 - - 560 B
ParseLookup2 Default 100 2,465.1 ns 23.86 ns 18.63 ns 2,469.7 ns 0.1183 - - 560 B
ParseStasteMachine Default 100 1,315.6 ns 12.08 ns 11.30 ns 1,313.4 ns 0.2842 - - 1344 B
ParseConvert Default 100 4,255.7 ns 42.84 ns 40.07 ns 4,247.6 ns 0.7935 - - 3760 B
ParseDic Default 100 2,581.3 ns 7.66 ns 7.16 ns 2,579.4 ns 0.1183 - - 560 B
ParseRegex Default 100 4,021,113.2 ns 44,650.63 ns 41,766.23 ns 4,006,587.6 ns 7.8125 7.8125 - 40095 B
ParseSimd Default 1000 17,823.8 ns 352.53 ns 824.03 ns 17,744.9 ns 6.3629 0.1068 - 29944 B
ParseLookup1 Default 1000 8,833.3 ns 175.97 ns 490.53 ns 8,923.9 ns 0.7019 - - 3320 B
ParseLookup2 Default 1000 22,266.0 ns 442.74 ns 1,158.56 ns 21,734.3 ns 0.7019 - - 3320 B
ParseStasteMachine Default 1000 6,128.0 ns 108.26 ns 128.88 ns 6,147.1 ns 0.8698 - - 4104 B
ParseConvert Default 1000 45,591.6 ns 908.72 ns 1,414.77 ns 45,348.1 ns 7.4463 - - 35320 B
ParseDic Default 1000 23,885.4 ns 110.17 ns 92.00 ns 23,873.6 ns 0.7019 - - 3320 B
ParseRegex Default 1000 5,510,959.0 ns 33,224.21 ns 29,452.40 ns 5,517,129.5 ns 31.2500 15.6250 - 148313 B
ParseSimd Default 10000 188,767.6 ns 3,709.28 ns 4,951.78 ns 186,187.5 ns 83.2520 - - 392896 B
ParseLookup1 Default 10000 75,740.7 ns 1,509.18 ns 1,853.41 ns 75,747.4 ns 9.1553 0.3662 - 43136 B
ParseLookup2 Default 10000 217,959.2 ns 5,740.22 ns 11,059.45 ns 213,729.0 ns 9.0332 0.2441 - 43136 B
ParseStasteMachine Default 10000 54,167.0 ns 1,071.17 ns 1,466.23 ns 54,317.8 ns 9.3384 0.3662 - 43920 B
ParseConvert Default 10000 455,312.7 ns 9,078.46 ns 9,322.91 ns 454,262.1 ns 76.6602 4.3945 - 363136 B
ParseDic Default 10000 236,517.2 ns 2,347.11 ns 2,080.65 ns 236,509.0 ns 9.0332 0.2441 - 43136 B
ParseRegex Default 10000 7,760,028.0 ns 39,873.13 ns 37,297.35 ns 7,754,200.2 ns 257.8125 7.8125 - 1244956 B
ParseSimd Default 100000 2,252,414.6 ns 26,816.03 ns 25,083.73 ns 2,253,281.6 ns 878.9063 835.9375 824.2188 3398327 B
ParseLookup1 Default 100000 844,646.0 ns 7,211.42 ns 6,745.56 ns 845,149.3 ns 95.7031 69.3359 68.3594 362854 B
ParseLookup2 Default 100000 2,238,950.8 ns 42,998.95 ns 51,187.20 ns 2,240,066.0 ns 93.7500 66.4063 66.4063 362743 B
ParseStasteMachine Default 100000 593,549.1 ns 11,356.83 ns 13,519.50 ns 592,267.7 ns 83.0078 40.0391 40.0391 345683 B
ParseConvert Default 100000 4,995,479.4 ns 86,894.57 ns 81,281.23 ns 4,978,550.9 ns 765.6250 140.6250 62.5000 3562641 B
ParseDic Default 100000 2,556,374.2 ns 49,970.68 ns 88,822.83 ns 2,563,986.6 ns 93.7500 66.4063 66.4063 362797 B
ParseRegex Default 100000 29,308,015.6 ns 314,524.24 ns 278,817.54 ns 29,341,885.7 ns 2593.7500 250.0000 62.5000 12156608 B
ParseSimd Default 1000000 31,504,537.5 ns 624,777.69 ns 990,963.63 ns 31,804,790.1 ns 1062.5000 1062.5000 1062.5000 29778298 B
ParseLookup1 Default 1000000 9,860,476.9 ns 89,285.30 ns 83,517.53 ns 9,851,358.5 ns 140.6250 125.0000 125.0000 3097681 B
ParseLookup2 Default 1000000 25,764,063.9 ns 89,020.32 ns 83,269.66 ns 25,727,866.8 ns 125.0000 125.0000 125.0000 3097672 B
ParseStasteMachine Default 1000000 5,491,525.6 ns 60,442.08 ns 56,537.56 ns 5,488,614.8 ns 125.0000 101.5625 101.5625 1571364 B
ParseConvert Default 1000000 49,334,426.7 ns 350,510.63 ns 327,867.86 ns 49,409,953.5 ns 6818.1818 - - 35098524 B
ParseDic Default 1000000 28,297,394.5 ns 279,271.09 ns 261,230.35 ns 28,289,751.3 ns 125.0000 125.0000 125.0000 3097670 B
ParseRegex Default 1000000 239,392,171.3 ns 4,331,922.98 ns 4,052,083.45 ns 239,547,237.3 ns 25000.0000 - - 121093053 B
ParseSimd Default 10000000 324,154,914.9 ns 6,073,125.69 ns 6,498,176.44 ns 324,871,152.0 ns 1000.0000 1000.0000 1000.0000 398441368 B
ParseLookup1 Default 10000000 95,645,622.3 ns 1,894,442.35 ns 2,835,512.11 ns 96,720,511.8 ns - - - 43555056 B
ParseLookup2 Default 10000000 265,436,727.6 ns 5,279,863.17 ns 14,274,417.11 ns 267,626,024.5 ns - - - 43555088 B
ParseStasteMachine Default 10000000 55,162,209.4 ns 1,018,040.99 ns 952,276.18 ns 55,269,276.5 ns - - - 22057717 B
ParseConvert Default 10000000 485,470,815.9 ns 9,360,966.90 ns 10,016,129.71 ns 485,518,774.0 ns 68000.0000 - - 363555816 B
ParseDic Default 10000000 279,954,373.4 ns 5,275,073.50 ns 4,934,307.02 ns 279,560,809.5 ns - - - 43555296 B
ParseRegex Default 10000000 2,229,431,891.0 ns 47,299,004.43 ns 58,087,425.16 ns 2,212,341,288.5 ns 250000.0000 - - 1223549040 B

// * Legends *
_len : Value of the '_len' parameter
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Median : Value separating the higher half of all measurements (50th percentile)
Gen 0 : GC Generation 0 collects per 1000 operations
Gen 1 : GC Generation 1 collects per 1000 operations
Gen 2 : GC Generation 2 collects per 1000 operations
Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
1 ns : 1 Nanosecond (0.000000001 sec)

// * Diagnostic Output - MemoryDiagnoser *

// ***** BenchmarkRunner: End *****

@lduchosal
Copy link
Author

Benchmark on Feature complete implementation only..

// * Summary *

BenchmarkDotNet=v0.12.0, OS=macOS 10.15.4 (19E266) [Darwin 19.4.0]
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.1.200
[Host] : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Job-PKYYLI : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT

Runtime=.NET Core 3.1

Method _len Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
ParseStasteMachine 10 786.1 ns 16.70 ns 15.62 ns 0.1984 - - 936 B
ParseRegex 10 4,059,551.3 ns 21,944.37 ns 19,453.11 ns - - - 28434 B
ParseStasteMachine 100 1,348.8 ns 5.96 ns 5.57 ns 0.2842 - - 1344 B
ParseRegex 100 4,100,269.9 ns 54,937.82 ns 51,388.87 ns 7.8125 7.8125 - 40091 B
ParseStasteMachine 1000 5,361.1 ns 29.78 ns 27.86 ns 0.8698 - - 4104 B
ParseRegex 1000 4,265,290.3 ns 33,367.12 ns 31,211.63 ns 31.2500 15.6250 - 148313 B
ParseStasteMachine 10000 45,081.0 ns 202.43 ns 189.35 ns 9.3384 0.4272 - 43920 B
ParseRegex 10000 6,037,076.9 ns 39,285.25 ns 34,825.35 ns 257.8125 23.4375 - 1244974 B
ParseStasteMachine 100000 469,396.6 ns 6,485.16 ns 6,066.22 ns 83.4961 41.0156 41.0156 345683 B
ParseRegex 100000 24,066,503.8 ns 164,110.50 ns 145,479.68 ns 2593.7500 250.0000 62.5000 12156609 B
ParseStasteMachine 1000000 4,322,987.1 ns 58,760.00 ns 52,089.21 ns 132.8125 109.3750 109.3750 1571404 B
ParseRegex 1000000 203,545,505.2 ns 607,934.18 ns 507,652.57 ns 25000.0000 - - 121091632 B
ParseStasteMachine 10000000 49,810,917.9 ns 773,578.13 ns 723,605.47 ns 90.9091 90.9091 90.9091 22057721 B
ParseRegex 10000000 2,002,033,220.9 ns 6,521,205.88 ns 6,099,940.02 ns 250000.0000 - - 1223553024 B

// * Hints *
Outliers
FormatBenchmark.ParseRegex: Runtime=.NET Core 3.1 -> 1 outlier was removed (4.17 ms)
FormatBenchmark.ParseRegex: Runtime=.NET Core 3.1 -> 1 outlier was removed (6.16 ms)
FormatBenchmark.ParseRegex: Runtime=.NET Core 3.1 -> 1 outlier was removed (24.51 ms)
FormatBenchmark.ParseStasteMachine: Runtime=.NET Core 3.1 -> 1 outlier was removed (4.45 ms)
FormatBenchmark.ParseRegex: Runtime=.NET Core 3.1 -> 2 outliers were removed (205.34 ms, 208.47 ms)

// * Legends *
_len : Value of the '_len' parameter
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Gen 0 : GC Generation 0 collects per 1000 operations
Gen 1 : GC Generation 1 collects per 1000 operations
Gen 2 : GC Generation 2 collects per 1000 operations
Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
1 ns : 1 Nanosecond (0.000000001 sec)

// * Diagnostic Output - MemoryDiagnoser *

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants