[FAQ]: Loading a huge CSV file

Adrian · June 15, 2020, 9:10am

When the job engine runs, if it is a file: URL then it just uses it. However, if it is a repository: URL, then it needs to copy the entire file to the local tmp directory so it has a file: URL. (the job engine does not know what machine(s) dacapo has its store on).

This is because the following steps are happening:
1/ The data service receives a request from the Designer and creates a job in the job engine.
2/ The job engine fetches the xxxMB from DaCapo into a local folder (slow)
3/ The job engine turns the CSV into XML in a local folder (slow)
4/ The job engine writes the XML back into DaCapo /Temp//tmp-0000XXX.xml (slow)
5/ The data service loads the XML keeps the first 500 records and counts how many records there are (slow)
As explained above, this is slow because the data service, the job engine and dacapo can all be running on different machines.

Therefore, it’s recommendable to call the csv file through the URL instead. Below is an example of a URL:

file:/<PATH of csv file on the local drive>/example.csv