Javascript regular expression to find double quotes between certain characters - javascript

I'm trying to create a string that can be parsed into JSON. The string is dynamically created based on the content in a CMS.
That content could contain HTML markup with double quotes, which confuses the JSON parser. So, I need to replace the double quotes in the HTML with " without replacing the double quotes which are actually a part of the JSON structure.
My Idea is to wrap the HTML inside markers, which I could use to identify everything between these markers as the quotes I want to replace.
For instance the string I want to parse into JSON could look like this...
str = '{"key1":"XXX<div id="divId"></div>YYY", "key2":"XXX<div id="divId"></div>YYY"}';
So, I want to replace every double quote between a XXX and a YYY with a ".
Something like...
str = str.replace(/XXX(")YYY/g, '"');
Hope that made sense. Thanks for any suggestions.

Given Stack Overflow's "we don't do your homework" principles, I don't think I'm going to work through the whole solution, but I can give you some pointers in half-finished code.
var xySearch = /this regex should find any text between XXX...YYY/g;
// note the g at the end! That's important
var result;
var doubleQuoteIndices = [];
// note the single-equals. I avoid them when possible inside a "condition" statement,
// but here it sort of makes sense.
while (result = xySearch.exec(str)) {
var block = result[0];
// inside of the block, find the index in str of each double-quote, and add it
// to doubleQuoteIndices. You will likely need result.lastIndex for the absolute position.
}
// loop backwards through str (so that left-side replacements don't change right-side indexes)
// to replace the characters at each doubleQuoteIndices with the appropriate HTML.
I kind of find that as great as regex's are for certain patterns, having the programming language do some of the work is often the best solution.

Related

Use jQuery to Auto Escape characters from a var

First, I'm not sure I've titled my question properly. Please feel free to correct me if needed.
My Issue:
I've created a variable, in jQuery called var siteTitle. This variable is available for other .js files to use and then get passed back to the .html page.
It all works great and there are no issues except when the var siteTitle will contain certain characters that need to be escaped. (quote, single quote, and ampersand to be specific)
What I would like to do is to use a bit of jQuery that would search a particular dom element and see if it is using any of those characters and then automatically escape them.
I've searched for some similar functions and can not seem to find exactly what I need ... the closet idea I have seen is something like this. Its not exactly what I need but it is something like what I am looking for.
pathto: function(path, file) {
var rtrim = function(str, list) {
var charlist = !list ? 's\xA0': (list + '').replace(/([\[\]\(\)\.\?\/\*\{\}\+\$\^\:])/g, '$1');
var re = new RegExp('[' + charlist + ']+$', 'g');
return (str + '').replace(re, '');
};
So, I am trying to write a function that will automatically convert those characters to be escaped or their html equivalent.
So, if the var siteTitle is used in a dom element like this:
<h1 class="titleText">' + siteTitle + '</h1>
I need to be able to make sure that any characters get escaped in that element.
Here is a jsFiddle that shows exactly what I am trying to do ...
https://jsfiddle.net/bbyrdhouse/5jb2fdsr/1/
Any help is greatly appreciated.
Since you're using jquery, use the .text() function to set the value into your HTML. It'll escape it appropriately.
var siteTitle = 'My Site "Title"';
$my('.titleText').text(siteTitle);
Also, in your fiddle, the siteTitle variable is not what you think it is, because the 2nd quotation closes that value since it's not yet escaped. I wrapped it in single quotes in my example.
Updated fiddle

Remove new line in javascript code in string

I have a string with a line-break in the source code of a javascript file, as in:
var str = 'new
line';
Now I want to delete that line-break in the code. I couldn't find anything on this, I kept getting stuff about \n and \r.
Thanks in advance!
EDIT (2021)
This question was asked a long, long time ago, and it's still being viewed relatively often, so let me elaborate on what I was trying to do and why this question is inherently flawed.
What I was trying to accomplish is simply to use syntax like the above (i.e. multi-line strings) and how I could accomplish that, as the above raises a SyntaxError.
However, the code above is just invalid JS. You cannot use code to fix a syntax error, you just can't make syntax errors in valid usable code.
The above can now be accomplished if we use backticks instead of single quotes to turn the string into a template literal:
var str = `new
line`;
is totaly valid and would be identical to
var str = 'new\n line';
As far as removing the newlines goes, I think the answers below address that issue adequately.
If you do not know in advance whether the "new line" is \r or \n (in any combination), easiest is to remove both of them:
str = str.replace(/[\n\r]/g, '');
It does what you ask; you end up with newline. If you want to replace the new line characters with a single space, use
str = str.replace(/[\n\r]+/g, ' ');
str = str.replace(/\n|\r/g,'');
Replaces all instances of \n or \r in a string with an empty string.

Confused with Regex JS pattern

ok i do have this following data in my div
<div id="mydiv">
<!--
what is your present
<code>alert("this is my present");</code>
where?
<code>alert("here at my left hand");</code>
oh thank you! i love you!! hehe
<code>alert("welcome my honey ^^");</code>
-->
</div>
well what i need to do there is to get the all the scripts inside the <code> blocks and the html codes text nodes without removing the html comments inside. well its a homework given by my professor and i can't modify that div block..
I need to use regular expressions for this and this is what i did
var block = $.trim($("div#mydiv").html()).replace("<!--","").replace("-->","");
var htmlRegex = new RegExp(""); //I don't know what to do here
var codeRegex = new RegExp("^<code(*n)</code>$","igm");
var code = codeRegex.exec(block);
var html = "";
it really doesn't work... please don't give the exact answer.. please teach me.. thank you
I need to have the following blocks for the variable code
alert("this is my present");
alert("here at my left hand");
alert("welcome my honey ^^");
and this is the blocks i need for variable html
what is your present
where?
oh thank you! i love you!! hehe
my question is what is the regex pattern to get the results above?
Parsing HTML with a regular expression is not something you should do.
I'm sure your professor thinks he/she was really clever and that there's no way to access the DOM API and can wave a banner around and justify some minor corner-case for using regex to parse the DOM and that sometimes it's okay.
Well, no, it isn't. If you have complex code in there, what happens? Your regex breaks, and perhaps becomes a security exploit if this is ever in production.
So, here:
http://jsfiddle.net/zfp6D/
Walk the dom, get the nodeType 8 (comment) text value out of the node.
Invoke the HTML parser (that thing that browsers use to parse HTML, rather than regex, why you wouldn't use the HTML parser to parse HTML is totally beyond me, it's like saying "Yeah, I could nail in this nail with a hammer, but I think I'm going to just stomp on the nail with my foot until it goes in").
Find all the CODE elements in the newly parsed HTML.
Log them to console, or whatever you want to do with them.
First of all, you should be aware that because HTML is not a regular language, you cannot do generic parsing using regular expressions that will work for all valid inputs (generic nesting in particular cannot be expressed with regular expressions). Many parsers do use regular expressions to match individual tokens, but other algorithms need to be built around them
However, for a fixed input such as this, it's just a case of working through the structure you have (though it's still often easier to use different parsing methods than just regular expressions).
First lets get all the code:
var code = '', match = [];
var regex = new RegExp("<code>(.*?)</code>", "g");
while (match = regex.exec(content)) {
code += match[1] + "\n";
}
I assume content contains the content of the div that you've already extracted. Here the "g" flag says this is for "global" matching, so we can reuse the regex to find every match. The brackets indicate a capturing group, . means any character, * means repeated 0 or more times, and ? means "non-greedy" (see what happens without it to see what it does).
Now we can do a similar thing to get all the other bits, but this time the regex is slightly more complicated:
new RegExp("(<!--|</code>)(.*?)(-->|<code>)", "g")
Here | means "or". So this matches all the bits that start with either "start comment" or "end code" and end with "end comment" or "start code". Note also that we now have 3 sets of brackets, so the part we want to extract is match[2] (the second set).
You're doing a lot of unnecessary stuff. .html() gives you the inner contents as a string. You should be able to use regEx to grab exactly what you need from there. Also, try to stick with regEx literals (e.g. /^regexstring$/). You have to escape escape characters using new RegExp which gets really messy. You generally only want to use new RegExp when you need to put a string var into a regEx.
The match function of strings accepts regEx and returns a collection of every match when you add the global flag (e.g. /^regexstring$/g <-- note the 'g'). I would do something like this:
var block = $('#mydiv').html(), //you can set multiple vars in one statement w/commas
matches = block.match(/<code>[^<]*<\/code>/g);
//[^<]* <-- 0 or more characters that aren't '<' - google 'negative character class'
matches.join('_') //lazy way of avoiding a loop - join into a string with a safe character
.replace(/<\/*code>/g,'') //\/* 0 or more forward slashes
.split('_');//return the matches string back to array
//Now do what you want with matches. Eval (ew) or append in a script tag (ew).
//You have no control over the 'ew'. I just prefer data to scripts in strings

Can't use javascript regex to get everything between html/xml tags

So I receive some xml in plaintext (and no I can't use DOM or JSON because apparently I am not allowed to), I want to strip all elements encased in a certain element and put them into an array, where I can strip out the text in the individual segments.
Now I am used to using POSIX regex and I will never actually understand the point behind PCRE regex, nor do I get the syntax.
Now here is the code I am using:
var strResponse = objResponse.text;
var strRegex = new RegExp("<item>(.*?)<\/item>","i");
var arrMatches = "";
var match;
while (match = strRegex.exec(strResponse)) {
arrMatches[] = match[1];
}
I have no idea why it won't find any matches with this code, can someone please help me on this and perhaps elaborate on what exactly it is I am continuously doing wrong with the PCRE syntax?
If those tags are in different rows the . will not match the newline characters and therefor your expression will not match. This is just a guess, I don't know your source.
You can try
var strRegex = new RegExp("<item>([\\s\\S]*?)<\\/item>","i");
[\\s\\S] is a character class. containing all whitespace and all non whitespace characters. linebreaks are covered by the whitespace characters.
The best way to complete this task is using the following, to parse it as proper HTML and navigate it with the DOM parser:
Javascript function to parse HTML string into DOM?
Regex has it with being very faulty and is in general not very good for parsing irregular text like HTML structure.

Trying to remove trailing text

I having the following code. I want to extract the last text (hello64) from it.
<span class="qnNum" id="qn">4</span><span>.</span> hello64 ?*
I used the code below but it removes all the integers
questionText = questionText.replace(/<span\b.*?>/ig, "");
questionText=questionText.replace(/<\/span>/ig, "");
questionText = questionText.replace(/\d+/g,"");
questionText = questionText.replace("*","");
questionText = questionText.replace(". ",""); i want to remove the first integer, and need to keep the rest of the integers
It's the third line .replace(/\d+/g,"") which is replacing the integers. If you want to keep the integers, then don't replace \d+, because that matches one or more digits.
You could achieve most of that all on one line, by the way - there's no need to have multiple replaces there:
var questionText = questionText.replace(/((<span\b.*?>)|(<\/span>)|(\d+))/ig, "");
That would do the same as the first three lines of your code. (of course, you'd need to drop the |(\d+) as per the first part of the answer if you didn't want to get rid of the digits.
[EDIT]
Re your comment that you want to replace the first integer but not the subsequent ones:
The regex string to do this would depend very heavily on what the possible input looks like. The problem is that you've given us a bit of random HTML code; we don't know from that whether you're expecting it to always be in this precise format (ie a couple of spans with contents, followed by a bit at the end to keep). I'll assume that this is the case.
In this case, a much simpler regex for the whole thing would be to replace eveything within <span....</span> with blank:
var questionText = questionText.replace(/(<span\b.*?>.*?<\/span>)/ig, "");
This will eliminate the whole of the <span> tags plus their contents, but leave anything outside of them alone.
In the case of your example this would provide the desired effect, but as I say, it's hard to know if this will work for you in all cases without knowing more about your expected input.
In general it's considered difficult to parse arbitrary HTML code with regex. Regex is a contraction of "Regular Expressions", which is a way of saying that they are good at handling strings which have 'regular' syntax. Abitrary HTML is not a 'regular' syntax due to it's unlimited possible levels of nesting. What I'm trying to say here is that if you have anything more complex than the simple HTML snippets you've supplied, then you may be better off using a HTML parser to extract your data.
This will match the complete string and put the part after the last </span> till the next word boundary \b into the capturing group 1. You just need to replace then with the group 1, i.e. $1.
searched_string = string.replace(/^.*<\/span>\s*([A-Za-z0-9]+)\b.*$/, "$1");
The captured word can consist of [A-Za-z0-9]. If you want to have anything else there just add it into that group.

Categories