To export million transactions from NetSuite to a FTP based system, after applying some transformation, every month.Assuming the data set size can be around 700MB-1GB for this use case, or porbably even more.
Prime Challenges:When you export data from one system to another, there are lot many things to consider, but, lets just focus on challenges that are specific to large number of records.
Exporting large data of 1GB can take considerable amount of time using conventional approaches.
2) RAM availability on client machine.
Assuming the machine you have access to, has limited RAM and there are other jobs running on the machine.
Consuming even near to a GB of RAM, application can run into process out of memory error. Heap usage can go very high if conventional programming approaches are used. You need a programming model tat makes best use of available RAM
Using RESTlets/SOAP APIs:
Doing a saved search for millions of records can take considerable amount of time using Suitescripts or SuiteTalk and also, considerable amount of CPU resources. To Exploit the power of SuiteCloud Plus license i.e. parallel webservice endpoints, you will run into aspects of multi-threaded programming and even then due to limited capability of saved search APIs you would not get the performance and optimal RAM usage you are looking for . So, these approaches might not be acceptable for this kind of scenario.
Let’s take each component and then jump into the final solution
a) SuiteAnalytics Connect ,
- It gives direct access to DBMS, you can read data at faster speed. Much faster than saved searches.
- No practical limit on number of rows unlike, the saved searches
- This does not solves the problem of heap usage
b) Non-Blocking Asynchronous streams in Node.js
- In an application where data exchange is high, Nework and I/O is going to be high, Node.js appears to be an optimal fit. It has streaming APIs that helps to read data in chunks, rather than loading the whole file into RAM. This helped us to transform the data and upload the final file into chunks than loading the entire file into RAM. If you are interested to more into streaming API, take a look here
“The advantage we get is that the whole SQL result or the data in CSV file is never in memory, Only a chunk of data as is in memory. While, your IO processor is busy fetching a chunk from network or IO devices, this ideal time of CPU can be utilized in other tasks for better performance.
Streams are sort of unix pipes that let you easily read data from a source and pipe it to a destination as in it flows and the API handles the backpressure as well. ”
- Still, we are left with a limitation that no good ODBC driver is available in Node.js for windows. NetSuite has good support for JDBC, so, for DB Access Java and JDBC appears more useful here.
c) JDBC and Java
1) To overcome limitation b#2 you can use Java’s JDBC to get data. You can invoke a Java process from Node.js as child process to fetch data. JDBC API also allows to specify the batch size of rows to be fetched via SQL so, that network and memory utilisation can be optimal.
2) You can dump the data from Java into console and Node.js being master process can stream the data into a file immediately (Note that the entire data is never put once into console), for effective RAM utilisation.
Below is an overview of the final solution
NetSuite is kind enough to allow running parallel queries, You can further split the SQL based on transaction date ranges (Obviously, this depends if your query can be broken), this will allow you pulling the data at even faster speed, giving further performance boost, as it will utilize the network and I/O lag time in other requests.
You will be able to export large amount of data, transform them and upload to FTP in a matter of few minutes (for my test in most scenarios it took 10-15 minutes. In the worst scenario, it was under 20 minutes.), and RAM usage was very optimal (not even close to half a GB).