Skip to content

byte counter is word counter . awk #3

@mogando668

Description

@mogando668

hi,

just a very minor comment -

summing up length($0) only works if the input is guaranteed to be ASCII-only.

i accidentally discovered that, even if gawk unicode mode, to get an exact byte count, for UTF8 inputs or even purely binary files like a .gz or a .mp4, a simple

match($0, /$/) - 1

does the trick. the minus 1 is needed since it matches the first available position, which is immediately after the input itself.

Conversely, if one definitely knows RT is a fixed-width of 1 byte (e.g. only \n ),
then a byte count is even simpler -

at each row, add up

byte_cnt += match($0, /$/)

then at END { } section, byte_cnt will be accurate. In byte/POSIX/C mode, match( ) doesn't offer any speed up, so for those, use length( ) instead.


% time ( pvE0 < "${m3r}" | gawk -e 'BEGIN { FS=RS="^$" } END { print match($0,/$/) - 1 }' | ecp); echo
 
      in0:  408MiB 0:00:00 [1011MiB/s] [1011MiB/s] [===================================>] 100%            
428814321
 
( pvE 0.1 in0 < "${m3r}" | gawk -e  | mawk ; )  13.25s user 0.71s system 100% cpu 13.865 total

% time ( pvE0 < "${m3r}" | gawk -b -e 'BEGIN { FS=RS="^$" } END { print  match($0,/$/) - 1 }' | ecp); echo
 
      in0:  408MiB 0:00:00 [1.13GiB/s] [1.13GiB/s] [===================================>] 100%            
428814321
 
( pvE 0.1 in0 < "${m3r}" | gawk -b -e  | mawk ; )  13.47s user 0.66s system 100% cpu 14.042 total

time ( pvE0 < "${m3r}" | gawk -b -e 'BEGIN { FS=RS="^$" } END { print length }' | ecp); echo
 
      in0:  408MiB 0:00:00 [1.15GiB/s] [1.15GiB/s] [===================================>] 100%            
428814321
 
( pvE 0.1 in0 < "${m3r}" | gawk -b -e  | mawk ; )  0.28s user 0.67s system 115% cpu 0.825 total

one can obtain a tiny speed-up summing row-by-row instead of all at once , while for mawk2, theirs is implemented in a manner such that match-only is hardly any slow down on small inputs:

 time ( pvE0 < "${m3r}" | gawk -e 'BEGIN { FS="^$" } { byte_cnt += match($0,/$/) } END { print  byte_cnt }' | ecp); echo
 
      in0:  408MiB 0:00:13 [30.3MiB/s] [30.3MiB/s] [===================================>] 100%            
428814321
 
( pvE 0.1 in0 < "${m3r}" | gawk -e  | mawk ; )  13.49s user 0.28s system 101% cpu 13.553 total
 time ( pvE0 < "${m3r}" | mawk2  'BEGIN { FS="^$" } { byte_cnt += match($0,/$/) } END { print  byte_cnt }' | ecp); echo
 
      in0:  408MiB 0:00:00 [1.47GiB/s] [1.47GiB/s] [===================================>] 100%            
428814321
 
( pvE 0.1 in0 < "${m3r}" | mawk2  | mawk ; )  0.11s user 0.28s system 124% cpu 0.310 total

 time ( pvE0 < "${m3r}" | mawk2  'BEGIN { FS="^$" } { byte_cnt += length($0) } END { print  byte_cnt+NR }' | ecp); echo
 
      in0:  408MiB 0:00:00 [1.50GiB/s] [1.50GiB/s] [===================================>] 100%            
428814321
 
( pvE 0.1 in0 < "${m3r}" | mawk2  | mawk ; )  0.10s user 0.27s system 124% cpu 0.300 total

here, i've thrown in a 224MB .7z binary file, and gawk does it just fine without any error messages (i've also added the gnu-wc output for reference) :

 f='./MV82_ConsolidatedDesktop/new_m3t_need_append.txt.7z'; gwc -lcm "${f}" | lgp3; time ( pvE0 < "${f}" | gawk -e 'BEGIN { FS="^$" } { byte_cnt += match($0,/$/) } END { print  byte_cnt - (RT=="") }' | ecp); echo

   920308 125659415 235672582 ./MV82_ConsolidatedDesktop/new_m3t_need_append.txt.7z

 
      in0:  224MiB 0:00:07 [28.6MiB/s] [28.6MiB/s] [===================================>] 100%            
235672582
 
( pvE 0.1 in0 < "${f}" | gawk -e  | mawk ; )  7.83s user 0.22s system 101% cpu 7.892 total

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions