How to search/match regx in HTML content using javascript/jQuery? - javascript

I have an regex and i need to extract data from document.body.outerHTML i tried various methods like .search() , .match() & .exec() but it didn't work i need to extract data from whole document.body.outerHTML. My class is as below.
class executionPrompt {
constructor(){
this.regex = /(B[0-9]{2})/
this.data = [];
}
start(){
var html = jQuery(document.body.outerHTML);
var matches = jQuery(html).html();
console.log(matches.match(this.regex));
}
}
i have whole document and regex but i am not able to get array for my matching string. Please help me.

From this answer
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. ...
i as commented by #Wiktor Stribiżew i have change my mind and as of now i have requested to server to grabs id then get its index from tree,
FYI: to match all occurrences, a g modifier is required, /B[0-9]{2}/g may be seems an acceptable answer thumbs up for all.

Related

Confused with Regex JS pattern

ok i do have this following data in my div
<div id="mydiv">
<!--
what is your present
<code>alert("this is my present");</code>
where?
<code>alert("here at my left hand");</code>
oh thank you! i love you!! hehe
<code>alert("welcome my honey ^^");</code>
-->
</div>
well what i need to do there is to get the all the scripts inside the <code> blocks and the html codes text nodes without removing the html comments inside. well its a homework given by my professor and i can't modify that div block..
I need to use regular expressions for this and this is what i did
var block = $.trim($("div#mydiv").html()).replace("<!--","").replace("-->","");
var htmlRegex = new RegExp(""); //I don't know what to do here
var codeRegex = new RegExp("^<code(*n)</code>$","igm");
var code = codeRegex.exec(block);
var html = "";
it really doesn't work... please don't give the exact answer.. please teach me.. thank you
I need to have the following blocks for the variable code
alert("this is my present");
alert("here at my left hand");
alert("welcome my honey ^^");
and this is the blocks i need for variable html
what is your present
where?
oh thank you! i love you!! hehe
my question is what is the regex pattern to get the results above?
Parsing HTML with a regular expression is not something you should do.
I'm sure your professor thinks he/she was really clever and that there's no way to access the DOM API and can wave a banner around and justify some minor corner-case for using regex to parse the DOM and that sometimes it's okay.
Well, no, it isn't. If you have complex code in there, what happens? Your regex breaks, and perhaps becomes a security exploit if this is ever in production.
So, here:
http://jsfiddle.net/zfp6D/
Walk the dom, get the nodeType 8 (comment) text value out of the node.
Invoke the HTML parser (that thing that browsers use to parse HTML, rather than regex, why you wouldn't use the HTML parser to parse HTML is totally beyond me, it's like saying "Yeah, I could nail in this nail with a hammer, but I think I'm going to just stomp on the nail with my foot until it goes in").
Find all the CODE elements in the newly parsed HTML.
Log them to console, or whatever you want to do with them.
First of all, you should be aware that because HTML is not a regular language, you cannot do generic parsing using regular expressions that will work for all valid inputs (generic nesting in particular cannot be expressed with regular expressions). Many parsers do use regular expressions to match individual tokens, but other algorithms need to be built around them
However, for a fixed input such as this, it's just a case of working through the structure you have (though it's still often easier to use different parsing methods than just regular expressions).
First lets get all the code:
var code = '', match = [];
var regex = new RegExp("<code>(.*?)</code>", "g");
while (match = regex.exec(content)) {
code += match[1] + "\n";
}
I assume content contains the content of the div that you've already extracted. Here the "g" flag says this is for "global" matching, so we can reuse the regex to find every match. The brackets indicate a capturing group, . means any character, * means repeated 0 or more times, and ? means "non-greedy" (see what happens without it to see what it does).
Now we can do a similar thing to get all the other bits, but this time the regex is slightly more complicated:
new RegExp("(<!--|</code>)(.*?)(-->|<code>)", "g")
Here | means "or". So this matches all the bits that start with either "start comment" or "end code" and end with "end comment" or "start code". Note also that we now have 3 sets of brackets, so the part we want to extract is match[2] (the second set).
You're doing a lot of unnecessary stuff. .html() gives you the inner contents as a string. You should be able to use regEx to grab exactly what you need from there. Also, try to stick with regEx literals (e.g. /^regexstring$/). You have to escape escape characters using new RegExp which gets really messy. You generally only want to use new RegExp when you need to put a string var into a regEx.
The match function of strings accepts regEx and returns a collection of every match when you add the global flag (e.g. /^regexstring$/g <-- note the 'g'). I would do something like this:
var block = $('#mydiv').html(), //you can set multiple vars in one statement w/commas
matches = block.match(/<code>[^<]*<\/code>/g);
//[^<]* <-- 0 or more characters that aren't '<' - google 'negative character class'
matches.join('_') //lazy way of avoiding a loop - join into a string with a safe character
.replace(/<\/*code>/g,'') //\/* 0 or more forward slashes
.split('_');//return the matches string back to array
//Now do what you want with matches. Eval (ew) or append in a script tag (ew).
//You have no control over the 'ew'. I just prefer data to scripts in strings

Regex extraction of one letter inside html chunk

Hi need to extract ONE letter from a string.
The string i have is a big block of html, but the part where i need to search in is this text:
Vahvistustunnus M :
And I need to get the M inside the nbsp's
So, who is the quickest regex-guru out there? :)
Ok, according to this page in the molybdenum api docs, the results will be all of the groups concatenated together. Given that you just want the char between the two 's then it's not good enough to match the whole thing and then pull out the group. Instead you'll need to do something like this:
(?<=Vahvistustunnus )[a-zA-Z](?= )
Warning
This might not work for you because lookbehinds (?<=pattern) are not available in all regex flavors. Specifically, i think that because molybdenum is a firefox extension, then it's likely using ECMA (javascript) regex flavor. And ECMA doesn't support lookbehinds.
If that's the case, then i'm gonna have to ask someone else to answer your question as my regex ninja (amateur) skills don't go much further than that. If you were using the regex in javascript code, then there are ways around this limitation, but based on your description, it sounds like you have to solve this problem with nothing but a raw regex?
Looks like it uses JavaScript and if so
var str = "Vahvistustunnus M :";
var patt = "Vahvistustunnus ([A-Z]) :";
var result = str.match(patt)[1];
should work.

How to add special characters like & > in XML file using JavaScript

I am generating XML using Javascript. It works fine if there are no special characters in the XML. Otherwise, it will generate this message: "invalid xml".
I tried to replace some special characters, like:
xmlData=xmlData.replaceAll(">",">");
xmlData=xmlData.replaceAll("&","&");
//but it doesn't work.
For example:
<category label='ARR Builders & Developers'>
Thanks.
Consider generating the XML using DOM methods. For example:
var c = document.createElement("category");
c.setAttribute("label", "ARR Builders & Developers");
var s = new XMLSerializer().serializeToString(c);
s; // => "<category label=\"ARR Builder & Developers\"></category>"
This strategy should avoid the XML entity escaping problems you mention but might have some cross-browser issues.
This will do the replacement in JavaScript:
xml = xml.replace(/</g, "<");
xml = xml.replace(/>/g, ">");
This uses regular expression literals to replace all less than and greater than symbols with their escaped equivalent.
JavaScript comes with a powerful replace() method for string objects.
In general - and basic - terms, it works this way:
var myString = yourString.replace([regular expression or simple string], [replacement string]);
The first argument to .replace() method is the portion of the original string that you wish to replace. It can be represented by either a plain string object (even literal) or a regular expression.
The regular expression is obviously the most powerful way to select a substring.
The second argument is the string object (even literal) that you want to provide as a replacement.
In your case, the replacement operation should look as follows:
xmlData=xmlData.replace(/&/g,"&");
xmlData=xmlData.replace(/>/g,">");
//this time it should work.
Notice the first replacement operation is the ampersand, as if you should try to replace it later you would screw up pre-existing well-quoted entities for sure, just as "&gt;".
In addition, pay attention to the regex 'g' flag, as with it the replacement will take place all throughout your text, not only on the first match.
I used regular expressions, but for simple replacements like these also plain strings would be a perfect fit.
You can find a complete reference for String.replace() here.

Can't use javascript regex to get everything between html/xml tags

So I receive some xml in plaintext (and no I can't use DOM or JSON because apparently I am not allowed to), I want to strip all elements encased in a certain element and put them into an array, where I can strip out the text in the individual segments.
Now I am used to using POSIX regex and I will never actually understand the point behind PCRE regex, nor do I get the syntax.
Now here is the code I am using:
var strResponse = objResponse.text;
var strRegex = new RegExp("<item>(.*?)<\/item>","i");
var arrMatches = "";
var match;
while (match = strRegex.exec(strResponse)) {
arrMatches[] = match[1];
}
I have no idea why it won't find any matches with this code, can someone please help me on this and perhaps elaborate on what exactly it is I am continuously doing wrong with the PCRE syntax?
If those tags are in different rows the . will not match the newline characters and therefor your expression will not match. This is just a guess, I don't know your source.
You can try
var strRegex = new RegExp("<item>([\\s\\S]*?)<\\/item>","i");
[\\s\\S] is a character class. containing all whitespace and all non whitespace characters. linebreaks are covered by the whitespace characters.
The best way to complete this task is using the following, to parse it as proper HTML and navigate it with the DOM parser:
Javascript function to parse HTML string into DOM?
Regex has it with being very faulty and is in general not very good for parsing irregular text like HTML structure.

match text between two html custom tags but not other custom tags

I have something like the following;-
<--customMarker>Test1<--/customMarker>
<--customMarker key='myKEY'>Test2<--/customMarker>
<--customMarker>Test3 <--customInnerMarker>Test4<--/customInnerMarker> <--/customMarker>
I need to be able to replace text between the customMarker tags, I tried the following;-
str.replace(/<--customMarker>(.*?)<--\/customMarker>/g, 'item Replaced')
which works ok. I would like to also ignore custom inner tags and not match or replace them with text.
Also I need a separate expression to extract the value of the attribute key='myKEY' from the tag with Text2.
Many thanks
EDIT
actually I am trying to find things between comment tags but the comment tags were not displaying correctly so I had to remove the '!'. There's a unique situation that required comment tags... in anycase if anyone knows enough regex to help, it would be great. thank u.
In the end, I did something like the following (incase anyone else needs this. enjoy!!! But note: Word about town is that using regex with html tags is not ideal, so do your own research and make up your mind. For me, it had to be done this way, mostly bcos i wanted to, but also bcos it simplified the job in this instance);-
var retVal = str.replace(/<--customMarker>(.*?)<--\/customMarker>/g, function(token, match){
//question 1: I would like to also ignore custom inner tags and not match or replace them with text.
//answer:
var replacePattern = /<--customInnerMarker*?(.*?)<--\/customInnerMarker-->/g;
//remove inner tags from match
match = $.trim(match.replace(replacePattern, ''));
//replace and return what is left with a required value
return token.replace(match, objParams[match]);
//question 2: Also I need a separate expression to extract the value of the attribute key='myKEY' from the tag with Text2.
//answer
var attrPattern = /\w+\s*=\s*".*?"/g;
attrMatches = token.match(attrPattern);//returns a list of attributes as name/value pairs in an array
})
Can't you use <customMarker> instead? Then you can just use getElementsByTagName('customMarker') and get the inner text and child elements from it.
A regex merely matches an item. Once you have said match, it is up to you what you do with it. This is part of the problem most people have with using regular expressions, they try and combine the three different steps. The regex match is just the first step.
What you are asking for will not be possible with a single regex. You're going to need a mini state machine if you want to use regular expressions. That is, a logic wrapper around the matches such that it moves through each logical portion.
I would advise you look in the standard api for a prebuilt engine to parse html, rather than rolling your own. If you do need to do so, read the flex manual to get a basic understanding of how regular expressions work, and the state machines you build with them. The best example would be the section on matching multiline c comments.

Categories