Translation string extraction

Translation string extraction - javascript

We have a translation extraction tool that we've written, that extracts strings that we've marked for translation in TypeScript. The JavaScript tool reads our Typescript files and has a regex like:
fileContent.match(/this.\translate\((.*?));/);
(simplified for readability, this works fine)
The translation method takes 3 parameters: 1. The string to be translated, 2. any variables that might be interpolated, 3. description. The last 2 are optional.
Examples of the implementation:
this.translate('text to translate');
this.translate('long text' +
'over multiple lines');
this.translate(`text to translate with backticks for interpolation`);
this.translate(`some test with a ${variable}`, [variable]);
this.translate(`some test with a ${variable}`, [variable], 'Description');
We need to extract these 3 parameters from text in JavaScript and have issues parsing it. We are currently using a regex to check the first opening string character (' or "`") and trying to match a closing character, but that is hard to do.
I'm currently trying to use eval (the script doesn't run in the browser, but CLI), like this:
function getParameters(text, variables, description){
return {text: text, variables: variables, description: description}
}
toEval = string.replace('this.translate', 'getParameters');
eval(toEval);
Which works perfect if there are no variables, but complains that "variables" not defined, when we pass in variables.
Can anyone suggest a good/better way to deal with this text extraction?

Instead of regex, you can use either babel or webpack to properly parse Javascript (or typescript) and extract all the information.
I have a webpack plugin that works on static strings only, but it should give a good starting point:
https://github.com/grassator/webpack-extract-translation-keys

Related

Regex to find functions with a single parameter in editors [duplicate]

I'm trying to use a powershell regex to pull version data from the AssemblyInfo.cs file. The regex below is my best attempt, however it only pulls the string [assembly: AssemblyVersion(". I've put this regex into a couple web regex testers and it LOOKS like it's doing what I want, however this is my first crack at using a powershell regex so I could be looking at it wrong.
$s = '[assembly: AssemblyVersion("1.0.0.0")]'
$prog = [regex]::match($s, '([^"]+)"').Groups[1].Value

You also need to include the starting double quotes otherwise it would start capturing from the start until the first " is reached.
$prog = [regex]::match($s, '"([^"]+)"').Groups[1].Value
^

Try this regex "([^"]+)"
Regex101 Demo

Regular expressions can get hard to read, so best practice is to make them as simple as they can be while still solving all possible cases you might see. You are trying to retrieve the only numerical sequence in the entire string, so we should look for that and bypass using groups.
$s = '[assembly: AssemblyVersion("1.0.0.0")]'
$prog = [regex]::match($s, '[\d\.]+').Value
$prog
1.0.0.0

For the generic solution of data between double quotes, the other answers are great. If I were parsing AssemblyInfo.cs for the version string however, I would be more explicit.
$versionString = [regex]::match($s, 'AssemblyVersion.*([0-9].[0-9].[0-9].[0-9])').Groups[1].Value
$version = [version]$versionString
$versionString
1.0.0.0
$version
Major Minor Build Revision
----- ----- ----- --------
1 0 0 0
Update/Edit:
Related to parsing the version (again, if this is not a generic question about parsing text between double quotes) is that I would not actually have a version in the format of M.m.b.r in my file because I have always found that Major.minor are enough, and by using a format like 1.2.* gives you some extra information without any effort.
See Compile date and time and Can I generate the compile date in my C# code to determine the expiry for a demo version?.
When using a * for the third and fourth part of the assembly version, then these two parts are set automatically at compile time to the following values:
third part is the number of days since 2000-01-01
fourth part is the number of seconds since midnight divided by two (although some MSDN pages say it is a random number)
Something to think about I guess in the larger picture of versions, requiring 1.2.*, allowing 1.2, or 1.2.3, or only accepting 1.2.3.4, etc.

Parse JS code for comments

I have a small NodeJS program that I use to extract code comments from files I point it to. It mostly works, but I'm having some issues dealing with it misinterpreting certain JS strings (glob patterns) as code comments.
I'm using the regex [^:](\/\/.+)|(\/\*[\W\w\n\r]+?\*\/) to parse the following test file:
function DoStuff() {
/* This contains the value of foo.
Foo is used to display "foo"
via http://stackoverflow.com
*/
this.foo = "http://google.com";
this.protocolAgnosticUrl = "//cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/core.js";
//Show a message about foo
alert(this.foo);
/// This is a triple-slash comment!
const globPatterns = [
'path/to/**/*.tests.js',
'!my-file.js',
"!**/folder/*",
'another/path/**/*.tests.js'
];
}
Here's a live demo to help visualize what is and is not properly captured by the regex: https://regex101.com/r/EwYpQl/1
I need to be able to only locate the actual code comments here, and not the comment-like syntax that can sometimes appear within strings.

I have to agree with the comments that for most cases it is better to use a parser, even when a RegExp can do the job for a specific and well defined use case.
The problem is not that you can't make it work for that very specific use case even thought there are probably plenty of edge cases that you don't really care about, nor have to, but that may break that solution. The actual problem is that if you start building around your sub-optimal solution and your requirements evolve overtime, you will start to patch those as they appear. Someday, you may find yourself with an extensive codebase full of patches that doesn't scale anymore and the only solution will probably be to start from scratch.
Anyway, you have been warned by a few of us, and is still possible that your use case is really that simple and will not change in the future. I would still consider moving from RegExp to a parser at some point, but maybe you can use this meanwhile:
(^ +\/\/(.*))|(["'`]+.*["'`]+.*\/\/(.*))|(["'`]+.*["'`]+.*\/\*([\W\w\n\r]+?)\*\/)|(^ +\/\*([\W\w\n\r]+?)\*\/)
Just in case, I have added a few other cases, such as comments that come straight after some valid code:
Edit to prove the first point and what is being said in the comments:
I have just answered this with the previous RegExp that was solving just the issue that you pointed out in your question (your RegExp was misinterpreting strings containing glob patterns as code comments).
So, I fixed that and I even made it able to match comments that start in the same line as a valid (non-commented) statement. Just a moment after posting that I notice that this last feature will only work if that statement contains a string.
This is the updated version, but please, keep in mind that this is exactly what we are warning you about...:
(^[^"'`\n]+\/\/(.*))|(["'`]+.*["'`]+.*\/\/(.*))|(["'`]+.*["'`]+.*\/\*([\W\w\n\r]+?)\*\/)|(^[^"'`\n]+\/\*([\W\w\n\r]+?)\*\/)
How does it work?
There are 4 main groups that compose the whole RegExp, the first two for single-line comments and the next two for multi-line comments:
(^[^"'`\n]+//(.*))
(["']+.*["']+.//(.))
(["']+.*["']+.*/*([\W\w\n\r]+?)*/)
(^[^"'`\n]+/*([\W\w\n\r]+?)*/)
You will see there are some repeated patterns:
^[^"'`\n]+: From the start of a line, match anything that doesn't include any kind of quote or line break.
` is for ES2015 template literals.
Line breaks are excluded as well to prevent matching empty lines.
Note the + will prevent matching comments that are not padded with at least one space. You can try replacing it with *, but then it will match strings containing glob patterns again.
["']+.*["']+.*: This is matching anything that is between quotes, including anything that looks like a comment but it's part of a string. Whatever you match after, it will be outside that string, so using another group you can match comments.

Finding text strings in JavaScript

I have a large valid JavaScript file (utf-8), from which I need to extract all text strings automatically.
For simplicity, the file doesn't contain any comment blocks in it, only valid ES6 JavaScript code.
Once I find an occurrence of ' or " or `, I'm supposed to scan for the end of the text block, is where I got stuck, given all the possible variations, like "'", '"', "\'", '\"', '", `\``, etc.
Is there a known and/or reusable algorithm for detecting the end of a valid ES6 JavaScript text block?
UPDATE-1: My JavaScript file isn't just large, I also have to process it as a stream, in chunks, so Regex is absolutely not usable. I didn't want to complicate my question, mentioning joint chunks of code, I will figure that out myself, If I have an algorithm that can work for a single piece of code that's in memory.
UPDATE-2: I got this working initially, thanks to the many advises given here, but then I got stuck again, because of the Regular Expressions.
Examples of Regular Expressions that break any of the text detection techniques suggested so far:
/'/
/"/
/\`/
Having studied the matter closer, by reading this: How does JavaScript detect regular expressions?, I'm afraid that detecting regular expressions in JavaScript is a whole new ball game, worth a separate question, or else it gets too complicated. But I appreciate very much if somebody can point me in the right direction with this issue...
UPDATE-3: After much research I found with regret that I cannot come up with an algorithm that would work in my case, because presence of Regular Expressions makes the task incredibly more complicated than was initially thought. According to the following: When parsing Javascript, what determines the meaning of a slash?, determining the beginning and end of regular expressions in JavaScript is one of the most complex and convoluted tasks. And without it we cannot figure out when symbols ', '"' and ` are opening a text block or whether they are inside a regular expression.

The only way to parse JavaScript is with a JavaScript parser. Even if you were able to use regular expressions, at the end of the day they are not powerful enough to do what you are trying to do here.
You could either use one of several existing parsers, that are very easy to use, or you could write your own, simplified to focus on the string extraction problem. I hardly imagine you want to write your own parser, even a simplified one. You will spend much more time writing it and maintaining it than you might think.
For instance, an existing parser will handle something like the following without breaking a sweat.
`foo${"bar"+`baz`}`
The obvious candidates for parsers to use are esprima and babel.
By the way, what are you planning to do with these strings once you extract them?

If you only need an approximate answer, or if you want to get the string literals exactly as they appear in the source code, then a regular expression can do the job.
Given the string literal "\n", do you expect a single-character string containing a newline or the two characters backslash and n?
In the former case you need to interpret escape sequences exactly like a JavaScript interpreter does. What you need is a lexer for JavaScript, and many people have already programmed this piece of code.
In the latter case the regular expression has to recognize escape sequences like \x40 and \u2026, so even in that case you should copy the code from an existing JavaScript lexer.
See https://github.com/douglascrockford/JSLint/blob/master/jslint.js, function tokenize.

Try code below:
txt = "var z,b \n;z=10;\n b='321`1123`321321';\n c='321`321`312`3123`';"
function fetchStrings(txt, breaker){
var result = [];
for (var i=0; i < txt.length; i++){
// Define possible string starts characters
if ((txt[i] == "'")||(txt[i] == "`")){
// Get our text string;
textString = txt.slice(i+1, i + 1 + txt.slice(i+1).indexOf(txt[i]));
result.push(textString)
// Jump to end of fetched string;
i = i + textString.length + 1;
}
}
return result;
};
console.log(fetchStrings(txt));

How does Google Closure Compiler handle quotes (string literals)?

The question 'How does Google Closure Compiler handle quotes (string literals)?' could also be re-phrased like:
Why does Closure replace/swap single quotes with double quotes?
How does Closure decide what quote- formatting/style to use?
(How) can I change this (default) behavior?
Note 1: this question is not about why Closure (or some other minifiers) have chosen to prefer double quotes (as is asked here and here).
Note 2: this question is not about single vs double quote discussions, however some understanding what GCC does to our code (and why) is rather useful!

It is often stated that (or asked why) Google Closure Compiler (GCC) replaces single quotes with double-quotes (even when the compilation_level is set to WHITESPACE_ONLY !!!):
example xmp_1.js:
alert('Hello world!'); // output: alert("Hello world!");
However... this is only half of 'the truth', because:
example xmp_2.js:
alert('Hello "world"!'); // output: alert('Hello "world"!');
// *NOT*: alert("Hello \"world\"!");
GCC is essentially a 'your raw javascript' to 'smaller (and more efficient) javascript' translator: so it does not 'blindly' replace single quotes with double quotes, but tries to choose an 'optimal quote-character' (after all.. one of the primary goals is to 'mini-fy' the script).
From the source-code (CompilerOptions.java) and this issue-report one can learn that:
If the string contains more single quotes than double quotes then the
compiler will wrap the string using double quotes and vice versa.
If the string contains no quotes or an equal number of single and
double quotes, then the compiler will default to using double quotes.
Like this example xmp_3.js:
alert('Hello "w\'orld"!'); // output: alert('Hello "w\'orld"!');
alert('Hello "w\'o\'rld"!'); // alert("Hello \"w'o'rld\"!");
Note how the above xmp_3 results in a 'mixed' output that uses both ' and " as outer quotation: the optimal choice followed by the default (when it didn't matter).
How to change/override the default double quotes to single quotes?
As it turned out there are some serious legitimate real-world cases where defaulting to single-quotes would have been better. As explained in the issue 836 (from Oct 8, 2012) referenced above:
The FT web app (app.ft.com) and the Economist app for Playbook deliver
JavaScript updates to the client along with other resources by
transmitting them as part of a JSON-encoded object. JSON uses double
quotes natively, so all the double quotes in the compiled JavaScript
need to be escaped. This inflates the size of the FT web app's JS by
about 20kB when transmitting a large update.
The reporter of the issue came with a gift: a patch that added the option prefer_single_quotes to change the default quote-character from double quote to single quote.
This issue was taken seriously enough that project member Santos considered changing the default double quote to single quote ('and see if anybody complains').. TWICE (also after the reporter/patch-contributer stated that he implemented it as an option so that it wouldn't have any backward-compatibility consequences since 'someone might be relying on strings being output with double quotes for some bizarre reason').
However, about one week later the patch was accepted (r2258), another week later reworked (r2257) and on Oct 30, 2012 Santos reported back that the option could now be enabled with:
--formatting=SINGLE_QUOTES
(so a third option besides PRETTY_PRINT and PRINT_INPUT_DELIMITER for the formatting-key).
(Note: in the current source-code one can currently still find numerous references to 'prefer_single_quotes' as well.)
Usage:
If you (download and) use the (local java) application:
java -jar compiler.jar --js xmp_1.js --formatting SINGLE_QUOTES
and you will see that: alert('Hello world!'); now compiles to alert('Hello world!');
However, at this time of writing, the Compiler Service API and UI (that most probably uses the API) located at http://closure-compiler.appspot.com, do not accept this third (new, although a year in existence) formatting-option: SINGLE_QUOTES and will throw an error:
17: Unknown formatting option single_quotes.
After digging (again) through the source, it seems (I'm not a Java-expert) that this is because jscomp/webservice/common/Protocol.java only accepts the older PRETTY_PRINT and PRINT_INPUT_DELIMITER
* All the possible values for the FORMATTING key.
*/
public static enum FormattingKey implements ProtocolEnum {
PRETTY_PRINT("pretty_print"),
PRINT_INPUT_DELIMITER("print_input_delimiter"),
;
I will update this answer should this option become available in the API and/or UI.
Hope this helps and saves someone some time, since the only documentation and reference google can find about SINGLE_QUOTES is currently in this one issue 836 and some comments in the source. Now it has some explanation on SO (where I'd expect it).

How can I convert JavaScript code into one big Java string

So I have 1000 lines of javascript. I need to turn it into a Java String so that I can output (via System.out.println or whatever).
I'm looking for an online tool to escape all the quotes... something geared toward my specific need would be nice as I don't want other special characters changed. Lines like:
var rgx = /(\d+)(\d{3})/;
need to stay intact.
The situation mandates the JavaScript be put into a String so please no workarounds.

Here's a link which features Crockford's implementation of the quote() function. Use it to build your own JavaScript converter.
Edit: I also slightly modified the function to output an ascii-safe string by default.
Edit2: Just a suggestion: It might be smarter to keep the JavaScript in an external file and read it at runtime instead of hardcoding it...
Edit3: And here's a fully-featured solution - just copy to a .html file and replace the dummy script:
<script src="quote.js"></script>
<script>
// this is the JavaScript to be converted:
var foo = 'bar';
var spam = 'eggs';
function fancyFunction() {
return 'cool';
}
</script>
<pre><script>
document.writeln(quote(
document.getElementsByTagName('script')[1].firstChild.nodeValue, true));
</script></pre>

You can compress the file using one of the available tools to achieve this effect:
YUI Compressor Online
Dean Edward's Packer
Douglas Crockford's JSMIN

You can use the jsmin tool to compress the Javascript to a single line (hopefully), but it doesn't escape the quotes. This can be done with search/replace in an editor or the server side scripting language used.

So everything I tried ended up breaking the javascript. I finally got it to work by doing the following:
Using Notepad++:
Hit Shift + Tab a bunch of times to unindent every line
Do View -> Show End Of Line
Highlight the LF char and do a Replace All to replace with empty string
Repeat for the CR char
Highlight a " (quote character) and do a Replace All with \" (escaped quote)... just typing the quote character into the Replace prompt only grabbed some of the quotes for some reason.
Now You have 1 enormously long line... I ended up having to break the 1 string apart into about 2000 character long lines.... The crazy long line was killing IE and/or breaking the Java String limit.

We Keep Coding

JavaScript is the programming language of the Web.

Translation string extraction - javascript

Instead of regex, you can use either babel or webpack to properly parse Javascript (or typescript) and extract all the information. I have a webpack plugin that works on static strings only, but it should give a good starting point: https://github.com/grassator/webpack-extract-translation-keys

Related

Regex to find functions with a single parameter in editors [duplicate]

Parse JS code for comments

Finding text strings in JavaScript

How does Google Closure Compiler handle quotes (string literals)?

How can I convert JavaScript code into one big Java string

Categories

Resources