Concatenating millions of xml files in a single json file using bash - javascript

I would like to merge 2.5 million smallish xml files from a directory tree into one large json file, and I was trying to do this using bash using find and the xml2json utility.
I'm pretty new to bash and haven't done anything very complicated with it. My intuition is something like following (but this is a long way from working):
find . -exec xml2json {} ; cat >> merged.json
Problem #1: I can't figure out how to use the xml2json utility with -exec.
find . -exec /usr/bin/xml2json < {}
doesn't work (seems like it's waiting for more input?). Neither does
find . -exec /usr/bin/xml2json {}
How do I get this working?
Problem #2: What is the most efficient way to concatenate the files? Obviously just using cat isn't going to create a well-formed json file, but can I just concatenate in brackets at the start and end and commas in between? Or should I use something like jq's -s? Do I need to stream it or parallelize it?
If it turns out bash is bad for this, efficient alternatives in JavaScript, R, or Python would also be useful. Thanks.

Related

ENAMETOOLONG nodeJs ffmpeg-fluent

I am currently using ffmpeg-fluent to merge video files. (https://github.com/fluent-ffmpeg/node-fluent-ffmpeg)
Unfortunately, my loop in which I put files to merge fail at thousandth file with below exception :
Error ENAMETOOLONG in /nodes_modules/fluent-ffmpeg.
My question is :
how can i bypass this error for writing command with a number of unlimited character?
This is an old question and unfortunately there's no code to point the problem for sure. But getting ENAMETOOLONG from ffmpeg on Windows usually means the command is really too long. And merging thousands of files makes this seem quite natural.
It still happens now in 2020. But one can work this around. We need to put the source filenames (for merger) into a text file and provide this text file as an input to ffmpeg.
A raw ffmpeg call would look like:
ffmpeg -f concat -safe 0 -i mylist.txt -c copy output.wav
with mylist.txt being like:
file '/path/to/file1.wav'
file '/path/to/file2.wav'
file '/path/to/file3.wav'
With fluent-ffmpeg it's not intuitive but still feasible:
const cmd = ffmpeg();
cmd.input('mylist.txt')
.inputOption(['-f concat', '-safe 0'])
.output('out.wav')
.run();
Note: Be careful with absolute paths of the list file and source files inside the list on Windows. Most ffmpeg versions would add the directory of the list to the source file resulting in similar corrupted path:
c:/ffmpeg/lists/c:/audiofiles/file1.wav
But you still can solve this if you go with url-formatted source files:
file 'file:c:/audiofiles/file1.wav'
file 'file:c:/audiofiles/file2.wav'
file 'file:c:/audiofiles/file3.wav'
I'm sure this will be helpful to someone searching for this ffmpeg error :-)

Exclude everything but Javascript files in RequireJS Optimizer

I am using RequireJS optimizer in a gulp recipe to compile and concatenate my Modules but redundant 3rd party library files like bower.json and *.nuspec files are being copied to my output directory.
I have successfully managed to exclude full directories using fileExclusionRegExp in the requirejs.optimize options object with the following expression:
/^\.|^styles$|^templates$|^tests$|^webdriver$/
However, I cannot figure out how to exclude everything but .js file extensions. I could use the following:
/^\.|.json$|.nuspec$|^styles$|^templates$|^tests$|^webdriver$/
to exclude specific extensions but if a new type were to appear later, I would have to notice and then change the regex. Also, the regex would probably become unruly and hard to maintain with time. I have tried to use the following expressions:
/^\.|!js$|^styles$|^templates$|^tests$|^webdriver$/
/^\.|!.js$|^styles$|^templates$|^tests$|^webdriver$/
/^\.|^.js$|^styles$|^templates$|^tests$|^webdriver$/
/^\.|[^.js$]|^styles$|^templates$|^tests$|^webdriver$/
/^\.|[^.js]$|^styles$|^templates$|^tests$|^webdriver$/
The results ranged from doing nothing (the first 3, to breaking the build, last 2) any help anyone could provide would be appreciated.
Thanks
Try this regex:
^\.|\.(json|nuspec)$|^(styles|templates|tests|webdriver)$

Parsing flags/arguments from a string in javascript

I'm building a "terminal-thingy" in javascript. The idea is that each command is a separate .js file, in AMD format, and everything is loaded with requirejs.
I want the commands to be called like:
command -s "string u-l: extra" -g http://domain.com/random.txt -r -a --test fixed
and that would then translate into something like:
command({'-s': 'string u-l: extra', '-g': 'http://domain.com/random.txt', '-r': true, '-a': true, '--test': 'fixed'});
But this is where I get stuck, I've tried running different scenarios in my head, but I cant find any good answer, but I can come up with conflicts:
split() - what if there is some extra spaces, that breaks everything
regex - regex relies on getting similar string every time, what if I want to have something like "wget http://code.jquery.com/jquery-1.8.3.min.js"?
defining rules in the command itself - still need to figure out parsing
piping - what if I want to have piping, I have to figure out how to not break on wrong pipes, i.e: "command -s 'random | pipe' | command2 asd"
Any ideas/advices would be appreciated, I'm stuck with this.
Would things be easier if you separated :
parsing (with a special purpose lib like https://github.com/jfd/optparse-js ?)
translating the parsed input into a list of required modules (you would have to define a mapping between arguments and command modules, if I understand correctly)
requiring the said modules, and then passing the relevant arguments to each module ?

merging my CSS and JS files breaks code (working on mac, not on my server)

I'm merging (and afterwards minify with YUI compressor) my CSS and JS files.
My web application works fine when just linking the separate files.
Now I want to merge the files as one CSS file, so I just basically do the following:
find /myapp/js/ -type f -name "incl_*.js" -exec cat {} + > ./temporary/js_backend_merged.js
That merges all my javascript files perfectly. When I do this on my mac, all goes well and I can use the merged file in my application with no problems
When I merge the same files with the same command on my CentOS server, this doesn't work, my JS start throwing errors. I have the same problem when merging CSS files, the CSS doesn't render correctly on my Centos box when merged. It does when I merge them on my MAC.
Also, I did the same process before on my previous centos server, with no problems at all.
I'm thinking in the direct of a character set problem on the server maybe?
Who can solve this little mistery that took 2 complete days of my time already with no luck at all...
UPDATE: the problem is that the command: find /myapp/js/ -type f -name "incl_*.js" -exec cat {} + > ./temporary/js_backend_merged.js orders files from incl_01 to incl_02, ... correctly on the mac, but the same command orders these files differently on the server
I see that I can use sort -n to sort results, but I cant get the above command working correctly with the sort option added to it.
(find /myapp/js/ -type f -name "incl_*.js" | sort | xargs cat) > ./temporary/js_backend_merged.js
That'll use find the list of files, sort them, then pipe to xargs, takes the list of files from stdin, and applies the 'cat' command to them.
Then the whole thing is redirected to js_backend_merged.js

Automatically concat several JavaScript files in one?

I have many js files, with some naming convention. If there any tool that concatenate all this files in one, so I don't need to include them all separately ?
You could look at using a JS Minifier, like:
http://code.google.com/p/minify/
http://code.google.com/closure/compiler/
http://www.crockford.com/javascript/jsmin.html
Something like:
$> cat file1.js file2.js ... > allfiles.js
?
In bash:
for file in `ls *js` ; do cat "$file" >> outputscript.js; echo >> "$file"; done;
You might find some helpful information here. Several answers advocate creating multiple javascript files during development, just so developers don't go crazy scrolling through thousands of lines of code. Those files are then merged during deployment -- several approaches are suggested. Several respondents also advocate using minification, which can reduce the size of the resulting mega-file.
In addition to the straightforward answers like using cat or a minifier, you can also use Sprockets, which is a JavaScript build system written in Ruby. Prototype JS uses Sprockets (in fact it Sprockets was written for Prototype).

Categories