Added Parse functionnality and some unit tests #3

lduchosal · 2020-03-29T21:14:01Z

This might be of interest

drewnoakes

Thanks for following up! I've added some comments.

HexDump/HexDump_Parse.cs

drewnoakes · 2020-03-29T21:54:23Z

HexDump/HexDump_Parse.cs

+        /// - includeOffset = true
+        /// - includeAscii = true


These documented restrictions don't appear to match the regular expression pattern.

implenented in parse so we have same behaviour in both methodes

drewnoakes · 2020-03-29T21:57:27Z

HexDump/HexDump_Parse.cs

+        /// </summary>
+        /// <param name="dump"></param>
+        /// <returns></returns>
+        public static byte[] Parse(string dump )


Just an FYI: This method allocates a lot of temporary memory on the heap. It is possible to write this without allocating anything other than the List<byte>, though the parsing has to be more manual. I can provide more details if you're interested.

glad to see any improvement if you feel like it's worth it.

Whether it's worth it depends upon the consumer's situation. If they're only parsing one short blob of text then it's probably fine as is. If they're running many operations concurrently or in an application that's sensitive to GC pauses, then the current implementation may be problematic. This is general library code, so I try to be as well behaved as possible as we cannot make many assumptions about the user's requirements.

At a high level, all of these can be avoided:

A string per line of the input

A Replace string per line

A string per hex character pair

Enumerators (via Linq) per line

StringReader per call

ParseLookup implementation seems the most promising one.
Will have a look at a state machine to implement special cases.

drewnoakes · 2020-03-29T22:00:03Z

HexDump/HexDump_Parse.cs

+            //00000000   01 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
+
+            var result = new List<byte>();
+            var lines = dump.Split(Environment.NewLine.ToCharArray());


Using Environment.NewLine means you can get different behaviours on different machines for the same input. This kind of thing regularly breaks CI, for example. I'd rather see split explicitly on '\r' and '\n' and remove empty entries (though I'd rather not use String.Split at all as it allocates an array and temporary strings, both of which can be avoided).

Changed it to use StringReader and remove the ToArray() in linq query.

drewnoakes · 2020-03-30T19:03:27Z

HexDump/HexDump_Parse.cs

+
+            int hexaWidth = (columnWidth * 3 * columnCount) + columnCount - 1 - 1;
+            var _re = new Regex($"^{rio}(?<hexa>[0-9a-f\\s]{{{hexaWidth}}}){ria}$",
+                RegexOptions.Compiled | RegexOptions.IgnoreCase);


I would avoid compiling the regex here, given its only used once.

totally right here, compiling the regex is worthless and take too much time.

drewnoakes · 2020-03-30T19:06:39Z

HexDump/HexDump_Parse.cs

+        /// <param name="includeOffset"></param>
+        /// <param name="includeAscii"></param>
+        /// <returns></returns>
+        public static byte[] Parse(string dump, int columnWidth = 8, int columnCount = 2, bool includeOffset = true, bool includeAscii = true)


Do you think it's possible to make parsing work without so many options?

00000000 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................

What if the parser just looked for valid pairs of hex characters separated by whitespace?

The offset would have to have more than two characters to be automatically excluded

The ASCII section could contain eg 00 01 02

The second point is the hardest one to deal with.

basic state machine will do the trick

Consider the case:

00000000 30 31 20 30 32 30 30 33 20 30 34 20 30 35 20 30 01 02 03 04 05 0

ParseStateMachine takes into account special cases like this one.

HexDump.Benchmark/FormatBenchmark.cs

drewnoakes · 2020-04-06T00:13:32Z

I'd be interested to see benchmark results.

lduchosal · 2020-04-06T07:55:12Z

The only features complete are ParseRegex and ParseStateMachine.
ParseLookup*, ParseConvert, ParseDic and ParseSimd do not take in account HEX in ASCII nor missing OFFSET column thus are not fair to compare with.

ParseStateMachine is the best so far.

// * Summary *

BenchmarkDotNet=v0.12.0, OS=macOS 10.15.4 (19E266) [Darwin 19.4.0]
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.1.200
[Host] : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Job-ZWSPBT : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
ShortRun : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT

Runtime=.NET Core 3.1

Method	Job	_len	Mean	Error	StdDev	Median	Gen 0	Gen 1	Gen 2	Allocated
ParseSimd	Default	10	398.0 ns	7.87 ns	7.73 ns	398.5 ns	0.1407	-	-	664 B
ParseLookup1	Default	10	207.6 ns	4.43 ns	4.35 ns	206.5 ns	0.0372	-	-	176 B
ParseLookup2	Default	10	375.3 ns	2.33 ns	2.18 ns	374.8 ns	0.0372	-	-	176 B
ParseStasteMachine	Default	10	809.7 ns	10.26 ns	9.60 ns	810.1 ns	0.1984	-	-	936 B
ParseConvert	Default	10	581.8 ns	8.00 ns	7.48 ns	581.4 ns	0.1049	-	-	496 B
ParseDic	Default	10	377.0 ns	2.90 ns	2.71 ns	377.2 ns	0.0372	-	-	176 B
ParseRegex	Default	10	4,164,130.3 ns	56,327.77 ns	52,689.03 ns	4,182,981.4 ns	-	-	-	28434 B
ParseSimd	Default	100	1,928.9 ns	37.68 ns	38.70 ns	1,925.4 ns	0.7973	-	-	3768 B
ParseLookup1	Default	100	1,005.9 ns	11.36 ns	10.07 ns	1,006.6 ns	0.1183	-	-	560 B
ParseLookup2	Default	100	2,465.1 ns	23.86 ns	18.63 ns	2,469.7 ns	0.1183	-	-	560 B
ParseStasteMachine	Default	100	1,315.6 ns	12.08 ns	11.30 ns	1,313.4 ns	0.2842	-	-	1344 B
ParseConvert	Default	100	4,255.7 ns	42.84 ns	40.07 ns	4,247.6 ns	0.7935	-	-	3760 B
ParseDic	Default	100	2,581.3 ns	7.66 ns	7.16 ns	2,579.4 ns	0.1183	-	-	560 B
ParseRegex	Default	100	4,021,113.2 ns	44,650.63 ns	41,766.23 ns	4,006,587.6 ns	7.8125	7.8125	-	40095 B
ParseSimd	Default	1000	17,823.8 ns	352.53 ns	824.03 ns	17,744.9 ns	6.3629	0.1068	-	29944 B
ParseLookup1	Default	1000	8,833.3 ns	175.97 ns	490.53 ns	8,923.9 ns	0.7019	-	-	3320 B
ParseLookup2	Default	1000	22,266.0 ns	442.74 ns	1,158.56 ns	21,734.3 ns	0.7019	-	-	3320 B
ParseStasteMachine	Default	1000	6,128.0 ns	108.26 ns	128.88 ns	6,147.1 ns	0.8698	-	-	4104 B
ParseConvert	Default	1000	45,591.6 ns	908.72 ns	1,414.77 ns	45,348.1 ns	7.4463	-	-	35320 B
ParseDic	Default	1000	23,885.4 ns	110.17 ns	92.00 ns	23,873.6 ns	0.7019	-	-	3320 B
ParseRegex	Default	1000	5,510,959.0 ns	33,224.21 ns	29,452.40 ns	5,517,129.5 ns	31.2500	15.6250	-	148313 B
ParseSimd	Default	10000	188,767.6 ns	3,709.28 ns	4,951.78 ns	186,187.5 ns	83.2520	-	-	392896 B
ParseLookup1	Default	10000	75,740.7 ns	1,509.18 ns	1,853.41 ns	75,747.4 ns	9.1553	0.3662	-	43136 B
ParseLookup2	Default	10000	217,959.2 ns	5,740.22 ns	11,059.45 ns	213,729.0 ns	9.0332	0.2441	-	43136 B
ParseStasteMachine	Default	10000	54,167.0 ns	1,071.17 ns	1,466.23 ns	54,317.8 ns	9.3384	0.3662	-	43920 B
ParseConvert	Default	10000	455,312.7 ns	9,078.46 ns	9,322.91 ns	454,262.1 ns	76.6602	4.3945	-	363136 B
ParseDic	Default	10000	236,517.2 ns	2,347.11 ns	2,080.65 ns	236,509.0 ns	9.0332	0.2441	-	43136 B
ParseRegex	Default	10000	7,760,028.0 ns	39,873.13 ns	37,297.35 ns	7,754,200.2 ns	257.8125	7.8125	-	1244956 B
ParseSimd	Default	100000	2,252,414.6 ns	26,816.03 ns	25,083.73 ns	2,253,281.6 ns	878.9063	835.9375	824.2188	3398327 B
ParseLookup1	Default	100000	844,646.0 ns	7,211.42 ns	6,745.56 ns	845,149.3 ns	95.7031	69.3359	68.3594	362854 B
ParseLookup2	Default	100000	2,238,950.8 ns	42,998.95 ns	51,187.20 ns	2,240,066.0 ns	93.7500	66.4063	66.4063	362743 B
ParseStasteMachine	Default	100000	593,549.1 ns	11,356.83 ns	13,519.50 ns	592,267.7 ns	83.0078	40.0391	40.0391	345683 B
ParseConvert	Default	100000	4,995,479.4 ns	86,894.57 ns	81,281.23 ns	4,978,550.9 ns	765.6250	140.6250	62.5000	3562641 B
ParseDic	Default	100000	2,556,374.2 ns	49,970.68 ns	88,822.83 ns	2,563,986.6 ns	93.7500	66.4063	66.4063	362797 B
ParseRegex	Default	100000	29,308,015.6 ns	314,524.24 ns	278,817.54 ns	29,341,885.7 ns	2593.7500	250.0000	62.5000	12156608 B
ParseSimd	Default	1000000	31,504,537.5 ns	624,777.69 ns	990,963.63 ns	31,804,790.1 ns	1062.5000	1062.5000	1062.5000	29778298 B
ParseLookup1	Default	1000000	9,860,476.9 ns	89,285.30 ns	83,517.53 ns	9,851,358.5 ns	140.6250	125.0000	125.0000	3097681 B
ParseLookup2	Default	1000000	25,764,063.9 ns	89,020.32 ns	83,269.66 ns	25,727,866.8 ns	125.0000	125.0000	125.0000	3097672 B
ParseStasteMachine	Default	1000000	5,491,525.6 ns	60,442.08 ns	56,537.56 ns	5,488,614.8 ns	125.0000	101.5625	101.5625	1571364 B
ParseConvert	Default	1000000	49,334,426.7 ns	350,510.63 ns	327,867.86 ns	49,409,953.5 ns	6818.1818	-	-	35098524 B
ParseDic	Default	1000000	28,297,394.5 ns	279,271.09 ns	261,230.35 ns	28,289,751.3 ns	125.0000	125.0000	125.0000	3097670 B
ParseRegex	Default	1000000	239,392,171.3 ns	4,331,922.98 ns	4,052,083.45 ns	239,547,237.3 ns	25000.0000	-	-	121093053 B
ParseSimd	Default	10000000	324,154,914.9 ns	6,073,125.69 ns	6,498,176.44 ns	324,871,152.0 ns	1000.0000	1000.0000	1000.0000	398441368 B
ParseLookup1	Default	10000000	95,645,622.3 ns	1,894,442.35 ns	2,835,512.11 ns	96,720,511.8 ns	-	-	-	43555056 B
ParseLookup2	Default	10000000	265,436,727.6 ns	5,279,863.17 ns	14,274,417.11 ns	267,626,024.5 ns	-	-	-	43555088 B
ParseStasteMachine	Default	10000000	55,162,209.4 ns	1,018,040.99 ns	952,276.18 ns	55,269,276.5 ns	-	-	-	22057717 B
ParseConvert	Default	10000000	485,470,815.9 ns	9,360,966.90 ns	10,016,129.71 ns	485,518,774.0 ns	68000.0000	-	-	363555816 B
ParseDic	Default	10000000	279,954,373.4 ns	5,275,073.50 ns	4,934,307.02 ns	279,560,809.5 ns	-	-	-	43555296 B
ParseRegex	Default	10000000	2,229,431,891.0 ns	47,299,004.43 ns	58,087,425.16 ns	2,212,341,288.5 ns	250000.0000	-	-	1223549040 B

// * Legends *
_len : Value of the '_len' parameter
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Median : Value separating the higher half of all measurements (50th percentile)
Gen 0 : GC Generation 0 collects per 1000 operations
Gen 1 : GC Generation 1 collects per 1000 operations
Gen 2 : GC Generation 2 collects per 1000 operations
Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
1 ns : 1 Nanosecond (0.000000001 sec)

// * Diagnostic Output - MemoryDiagnoser *

// ***** BenchmarkRunner: End *****

lduchosal · 2020-04-06T09:55:21Z

Benchmark on Feature complete implementation only..

// * Summary *

BenchmarkDotNet=v0.12.0, OS=macOS 10.15.4 (19E266) [Darwin 19.4.0]
Intel Core i7-8750H CPU 2.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=3.1.200
[Host] : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Job-PKYYLI : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT

Runtime=.NET Core 3.1

Method	_len	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated
ParseStasteMachine	10	786.1 ns	16.70 ns	15.62 ns	0.1984	-	-	936 B
ParseRegex	10	4,059,551.3 ns	21,944.37 ns	19,453.11 ns	-	-	-	28434 B
ParseStasteMachine	100	1,348.8 ns	5.96 ns	5.57 ns	0.2842	-	-	1344 B
ParseRegex	100	4,100,269.9 ns	54,937.82 ns	51,388.87 ns	7.8125	7.8125	-	40091 B
ParseStasteMachine	1000	5,361.1 ns	29.78 ns	27.86 ns	0.8698	-	-	4104 B
ParseRegex	1000	4,265,290.3 ns	33,367.12 ns	31,211.63 ns	31.2500	15.6250	-	148313 B
ParseStasteMachine	10000	45,081.0 ns	202.43 ns	189.35 ns	9.3384	0.4272	-	43920 B
ParseRegex	10000	6,037,076.9 ns	39,285.25 ns	34,825.35 ns	257.8125	23.4375	-	1244974 B
ParseStasteMachine	100000	469,396.6 ns	6,485.16 ns	6,066.22 ns	83.4961	41.0156	41.0156	345683 B
ParseRegex	100000	24,066,503.8 ns	164,110.50 ns	145,479.68 ns	2593.7500	250.0000	62.5000	12156609 B
ParseStasteMachine	1000000	4,322,987.1 ns	58,760.00 ns	52,089.21 ns	132.8125	109.3750	109.3750	1571404 B
ParseRegex	1000000	203,545,505.2 ns	607,934.18 ns	507,652.57 ns	25000.0000	-	-	121091632 B
ParseStasteMachine	10000000	49,810,917.9 ns	773,578.13 ns	723,605.47 ns	90.9091	90.9091	90.9091	22057721 B
ParseRegex	10000000	2,002,033,220.9 ns	6,521,205.88 ns	6,099,940.02 ns	250000.0000	-	-	1223553024 B

// * Hints *
Outliers
FormatBenchmark.ParseRegex: Runtime=.NET Core 3.1 -> 1 outlier was removed (4.17 ms)
FormatBenchmark.ParseRegex: Runtime=.NET Core 3.1 -> 1 outlier was removed (6.16 ms)
FormatBenchmark.ParseRegex: Runtime=.NET Core 3.1 -> 1 outlier was removed (24.51 ms)
FormatBenchmark.ParseStasteMachine: Runtime=.NET Core 3.1 -> 1 outlier was removed (4.45 ms)
FormatBenchmark.ParseRegex: Runtime=.NET Core 3.1 -> 2 outliers were removed (205.34 ms, 208.47 ms)

// * Legends *
_len : Value of the '_len' parameter
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Gen 0 : GC Generation 0 collects per 1000 operations
Gen 1 : GC Generation 1 collects per 1000 operations
Gen 2 : GC Generation 2 collects per 1000 operations
Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
1 ns : 1 Nanosecond (0.000000001 sec)

// * Diagnostic Output - MemoryDiagnoser *

Added Parse functionnality and some unit tests

f7e63d1

drewnoakes reviewed Mar 29, 2020

View reviewed changes

added argument to parse & tests

0962039

lduchosal mentioned this pull request Mar 30, 2020

Decode method : public byte[] Decode(string hexdump) #1

Closed

drewnoakes reviewed Mar 30, 2020

View reviewed changes

lduchosal added 4 commits April 1, 2020 14:22

Benchmark & optimizations

e193de4

SIMD & becnhmarks

320b9ff

Fix & rename

8ea809b

Fix & renames

57bdcc5

drewnoakes reviewed Apr 2, 2020

View reviewed changes

HexDump.Benchmark/FormatBenchmark.cs Show resolved Hide resolved

lduchosal added 3 commits April 3, 2020 00:32

With StateMachine, it should parse strange ascii combinations

20273a3

More tests & fixes

2b24b84

More tests & refactors

f771020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Parse functionnality and some unit tests #3

Added Parse functionnality and some unit tests #3

lduchosal commented Mar 29, 2020

drewnoakes left a comment

drewnoakes Mar 29, 2020

lduchosal Mar 30, 2020

drewnoakes Mar 29, 2020

lduchosal Mar 30, 2020

drewnoakes Mar 30, 2020

lduchosal Apr 1, 2020

drewnoakes Mar 29, 2020

lduchosal Mar 30, 2020

drewnoakes Mar 30, 2020

lduchosal Apr 1, 2020

drewnoakes Mar 30, 2020

lduchosal Apr 1, 2020

drewnoakes Apr 2, 2020

lduchosal Apr 2, 2020

drewnoakes commented Apr 6, 2020

lduchosal commented Apr 6, 2020

lduchosal commented Apr 6, 2020

Added Parse functionnality and some unit tests #3

Are you sure you want to change the base?

Added Parse functionnality and some unit tests #3

Conversation

lduchosal commented Mar 29, 2020

drewnoakes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

drewnoakes commented Apr 6, 2020

lduchosal commented Apr 6, 2020

lduchosal commented Apr 6, 2020