Data Movement in the Node.js API

Concurrency and Large Data Sets in Node.js

Node.js provides concurrency through multiple waits for IO responses, instead of multiple threads. This strategy avoids the challenges and risks of multi-threaded programming for middle-tier clients that are IO rather than compute intensive.

In the Node.js API, that general architectural principle of Node.js means that better throughput for large data sets requires multiple concurrent requests to each e-node (instead of serial requests to one e-node).

In other words, while the Node.js client is submitting a request or processing a response, many other pending requests are waiting for the server to respond. Because the round trip over the network is typically much more expensive (especially in the cloud), the client can typically submit many requests, and / or process many responses, during the time required for a single round trip to the server.

Optimal Concurrency

The optimal level of concurrency would provide full utilization of both the client and server:

From the client perspective, a new response becomes available to process at the moment that submitting a new request finishes. In essence, neither the single Node.js thread nor the request submitting or response processing routines ever wait. From the server perspective, thread and memory consumption is at the sweet spot, with allowance for other requests to the server. Clients should avoid exceeding the optimum concurrency level for either client or server.

Detection of Server Factors

To determine the appropriate level of concurrency, the client must become aware of server capacity, as reflected by the number of hosts for the database, the number of threads available on those hosts, and (for query management) the number of forests. The Node.js API objects for data movement calls the internal endpoints to inspect server state during initialization.

IO With Node.js Streams

Node.js provides streams as the standard representation for large data sets. Conforming to this standard, the Node.js API data movement functions are factories that return:

an object of type stream.Writable to the application for sending request input to the server
an object of type stream.Readable to the application for receiving response output from the server.

Typically, the Node.js API reads the input stream repeatedly, accumulating a batch of data in a buffer, and then making a batch request when the buffer is full or the input stream ends.

Typically, the Node.js API writes each item in a response batch separately to the output stream, and ends the output stream when the last response has been processed.

Where the data movement requires both input and output, the factory function returns: a duplex stream to the application, for sending request input to the server, and receiving response output from the server. By doing this, the client implementation in the Node.js API has a readable stream for receiving the request input from the application and a writable stream for sending the response output to the application.

Data Movement Functions

When using multiple data movement functions in a pipeline to handle special cases (instead of the provided conveniences), the application has the responsibility for configuring each function to share the available client and server concurrency.

Data movement functions take options such as - the batch size, the number of concurrent requests per forest or host, success and / or error callbacks for the batch.

Each data movement function maintains state for its operations (similar to the use of the Operation object for single-request calls).

When processing data in memory, clients need to work with request or response data as JavaScript in-memory objects. Alternatively when, dispatching request or response data from other sources or sinks (such as other databases), clients can achieve better throughput by working with request or response data as JavaScript strings or buffers.

In particular, most of the existing request functions return a ResultProvider, to let the application choose whether to get the response data as a Node.js Promise or as a Node.js Stream.
By contrast, a data movement function must:

write response data to the output stream
execute an application callback to determine the disposition of any error on a batch request

Node-client-api - 2.8.0

Ingesting Documents using - writeAll api

The Node.js API documents object adds a writeAll() function equivalent to the DMSDK WriteBatcher with the following signature:

writeAll(options)

The properties of the options object:

Example -

An example for using the writeAll api has been added to the examples folder on node-client-api - https://github.com/marklogic/node-client-api/blob/develop/examples/writeAll-documents.js

JS docs - https://docs.marklogic.com/jsdoc/documents.html#writeAll

Node-client-api - 2.9.0

Collecting Document uris- queryAll api

The Node.js API documents object adds a queryAll() function equivalent to the DMSDK QueryBatcher with the following signature:

queryAll(query, options)

The parameters:

Screen Shot 2021-11-23 at 11 06 24 AM

The return value:

a stream.Readable that sends document URI output read from the database to the application in string mode or (for arrays of strings) object mode.

The properties of the options object:

Screen Shot 2022-02-17 at 4 28 20 PM

Example and JS Docs - An example for using the queryAll api has been added to the examples folder on node-client-api -

https://github.com/marklogic/node-client-api/blob/develop/examples/queryAll-documents.js

JS docs - https://docs.marklogic.com/jsdoc/documents.html#queryAll

Exporting Documents - readAll api

The Node.js API documents object adds a readAll() function equivalent to the DMSDK ExportListener with the following signature:

readAll(options)

The parameters: Screen Shot 2022-02-17 at 4 30 01 PM

The return value:

a stream.Duplex that receives document URI input from the application in string mode or (for arrays of strings) object mode and sends document descriptors with the content and/or document URI as output to the application in object mode

The properties of the options object: Screen Shot 2022-02-17 at 4 31 29 PM Screen Shot 2022-02-17 at 4 33 04 PM