NanoCSV, Faster C++11 multithreaded header-only CSV parser

NanoCSV is a faster C++11 multithreaded header-only CSV parser with only STL dependency. NanoCSV is designed for CSV data with numeric values.

Status

In development. Not recommended to use NanoCSV in production at the moment.

Requirements

C++11 compiler(with thread support)

Usage

// defined this only in **one** c++ file.
#define NANOCSV_IMPLEMENTATION
#include "nanocsv.h"

int main(int argc, char **argv)
{
  if (argc < 2) {
    std::cout << "csv_parser_example input.csv (num_threads) (delimiter)\n";
  }

  std::string filename("./data/array-4-5.csv");
  int num_threads = -1; // -1 = use all system threads
  char delimiter = ' '; // delimiter character.

  if (argc > 1) {
    filename = argv[1];
  }

  if (argc > 2) {
    num_threads = std::atoi(argv[2]);
  }

  if (argc > 3) {
    delimiter = argv[3][0];
  }

  nanocsv::ParseOption<float> option;
  option.delimiter = delimiter;
  option.req_num_threads = num_threads;
  option.verbose = true; // verbse message will be stored in `warn`.
  option.ignore_header = true; // Parse header(the first line. default = true).

  std::string warn;
  std::string err;

  nanocsv::CSV<float> csv;

  bool ret = nanocsv::ParseCSVFromFile(filename, option, &csv, &warn, &err);

  if (!warn.empty()) {
    std::cout << "WARN: " << warn << "\n";
  }


  if (!ret) {

    if (!err.empty()) {
      std::cout << "ERROR: " << err << "\n";
    }

    return EXIT_FAILURE;
  }

  std::cout << "num records(rows) = " << csv.num_records << "\n";
  std::cout << "num fields(columns) = " << csv.num_fields << "\n";

  // values are 1D array of length [num_records * num_fields]
  // std::cout << csv.values[4 * num_fields + 3] << "\n";

  // header string is stored in `csv.header`
  if (!option.ignore_header) {
    for (size_t i = 0; i < csv.header.size(); i++) {
      std::cout << csv.header[i] << "\n";
    }
  }


  return EXIT_SUCCESS;
}

NaN, Inf

nanocsv supports parsing

nan, -nan as NaN, -NaN
inf, -inf as Inf, -Inf

Support for N/A and null value

In default, missing value(e.g. N/A(including invalid numeric string), NaN) are replaced by nan, and null(empty) value(e.g. "") are replaced by nan.

You can control the behavior with the following parametes in ParseOption.

replace_na : Replace N/A, NaN value?
- na_value : The value to be replaced for N/A, NaN value
replace_null : Replace null(empty) value?
- null_value : The value to be replaced for null value

Parse Text CSV

Parsing Text CSV(each field is just a string) is also supported. (Use differnt API. See the source code for details.)

Compiler options

NANOCSV_NO_IO : Disable I/O(file access, stdio, mmap).
NANOCSV_WITH_RYU : Use ryu library to parse floating-point string. https://github.com/ulfjack/ryu . This will give precise handling of floating point values.
- NANOCSV_WITH_RYU_NOINCLUDE: Do not include Ryu header files in nanocsv.h. This is useful when you want to include Ryu header files outside of nanocsv.h.

TODO

Performance

Dataset is 8192 x 4096, 800 MB in file size(generated by tools/gencsv/gen.py)

Thradripper 1950X
DDR4 2666 64 GB memory

1 thread.

total parsing time: 3833.33 ms
  line detection : 1264.99 ms
  alloc buf      : 0.016351 ms
  parse          : 2508.83 ms
  construct      : 55.726 ms

16 thread.

total parsing time: 545.646 ms
  line detection : 159.078 ms
  alloc buf      : 0.077979 ms
  parse          : 337.207 ms
  construct      : 46.7815 ms

23 threads

Since 23 threads are faster than 32 thread for 1950x.

total parsing time: 494.849 ms
  line detection : 127.176 ms
  alloc buf      : 0.050988 ms
  parse          : 314.287 ms
  construct      : 50.7568 ms

Roughly 7.7 times faster than signle therad parsing.

Note on memory consumption

Not sure, but it should not exceed 3 * filesize, so guess 2.4 GB.

In python

Using numpy.loadtxt to load data takes 23.4 secs.

23 threaded naocsv parsing is Roughly 40 times faster than numpy.loadtxt.

References

RFC 4180 https://www.ietf.org/rfc/rfc4180.txt

License

MIT License

Third-party license

stack_container : Copyright (c) 2006-2008 The Chromium Authors. BSD-style license.
acutest : MIT license. Used for unit tester.
ryu : Apache 2.0 or Boost 1.0 dual license.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
data		data
img		img
ryu		ryu
sandbox/parse		sandbox/parse
tests		tests
tools/gencsv		tools/gencsv
.clang-format		.clang-format
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
csv_parser_example.cc		csv_parser_example.cc
nanocsv.h		nanocsv.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NanoCSV, Faster C++11 multithreaded header-only CSV parser

Status

Requirements

Usage

NaN, Inf

Support for N/A and null value

Parse Text CSV

Compiler options

TODO

Performance

1 thread.

16 thread.

23 threads

Note on memory consumption

In python

References

License

Third-party license

About

Releases

Packages

Languages

License

lighttransport/nanocsv

Folders and files

Latest commit

History

Repository files navigation

NanoCSV, Faster C++11 multithreaded header-only CSV parser

Status

Requirements

Usage

NaN, Inf

Support for N/A and null value

Parse Text CSV

Compiler options

TODO

Performance

1 thread.

16 thread.

23 threads

Note on memory consumption

In python

References

License

Third-party license

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages