Delimiting documents with regular expressions

Delimiting documents with regular expressions - javascript

I'm working on several documents that are within just a file, and before working on the documents, I need to define where one document begins and ends. For this, I am using the following regex:
MINISTÉRIO\sDO\sTRABALHO\sE\sEMPREGO(?:[^P]*(?:P(?!ÁG\s:\s\d+\/\d+)[^P]*)*)PÁG\s:\s\d+\/(\d+)\b(?:\D*(?:(?!\1\/\1)\d\D*)*)\1\/\1(?:[^Z]*(?:Z(?!6:\s\d+)[^Z]*)*)Z6:\s\d+
Example is here
Is working 100%, the problem is, sometimes the text does not come this way I showed.. it comes with spaces and lines. As you can see here, the document is the same as the previous one, but the regular expression does not work. I wonder why is not working and how to fix to make it work ?
Also, I need modify the regex, not the text, cause the only real part that I have access is the regex.
OBS: I'm using Node.JS, that's why i'm tagging with JS this post.

Related

Regex replace everything not starting with multiple times in line

I am making a parser of sorts in javascript that takes a mathematical expression given to the script as a string, and evaluates it and does some other things with it. If the users want to use builtin Javascript mathematical functions, they have to enter the following string e.g. "1 + Math.log(x)". That becomes very tedious when things get nested e.g "Math.abs(Math.log(Math.pow(x, 2))) + Math.log2(x)". As you can see, the "Math." part of it not only takes longer to write, but makes it less readable. I want to remove that "Math." part. The way I've done it is using simple regex that basically has a list of all Javascript Math constants and methods and simply prepends the "Math." part of it. Simple enough:
input = input.replace(/(E|PI|SQRT2|SQRT1_2|LN2|LN10|LOG2E|LOG10E)/g, "Math.$1");
The same things happens for the methods. This works fine. But as always that's not very flexible and leaves room for misunderstanding and somebody may coma along and insist on typing Math.log(x) which will in turn be replaced with Math.Math.log(x), which won't work.
What I want to know is, is there some way to match any of these predefined strings (constants and methods) so that they will only be matched via regex if they don't have the "Math." part in front of it. I have tried this
^(?!Math\.)(log2|log|exp|abs)
but it is quite useless, as this doesn't work with nesting and even multiple operands. Is there any way to do this purely in regex, as this would make the entire process more elegant. Any help would be appreciated.

You can use the following trick so that it gets replaced even if it matches or not:
(?:Math\.)?(log2|log|exp|abs|pow)
And replace with Math.$1
See DEMO

Regex replace with multiple wildcards works in PHP, not in JavaScript

I'm attempting to implement center alignment for two Markdown parsers:
In PHP for Parsedown (successfully)
In JavaScript for Bootstrap Markdown (without success)
The idea I'm following and finding the easiest is to work with the final HTML output, and just snap inline styling onto the tags.
The following regex does what I need, it adds style="text-align:center;" to any element so far*, as needed:
$text = preg_replace('/\<(.*?)\>\->(.*?)<\-\<\/(.*?)\>/', '<$1 style="text-align:center;">$2</$3>', $text);
That is, <p>text</p> becomes <p style="text-align:center;">text</p>.
However, when I attempted to port this into JavaScript to also make it available for previewing on client-side, the pattern does not match as it should:
content = content.replace('/\<(.*?)\>\->(.*?)<\-\<\/(.*?)\>/', '<$1 style="text-align:center;">$2</$3>');
The replacement in content does not occur.
I'm aware there are slight differences between Regex of PHP and JavaScript, but I have found examples for all the expected behavior here on both sides, working.
*If someone is wondering by any chance, I'm also successfully adding the center alignment to tags that already have a style attribute - on server side only, so far.

You'll need to use the literal syntax for regular expression in JavaScript, like so:
content = content.replace(/\<(.*?)\>\->(.+)<\-\<\/(.+)\>/gi, '<$1 style="text-align:center;">$2</$3>');
Note that the gi at the end of the regular expression simply enables global searching (that is, replace all occurrences matching the pattern) and case-insensitive matching. They are both technically optional, but you will most likely want the g flag enabled for certain. However, keeping the i flag is up to you (depends on whether or not your content contains &GT;, for example).

regex for background-image:url('URL');

I trying to make a regex for finding: background-image:url('URL'); Where the URL is a external link for an image.
Been trying for something like this:
/\s*?[ \t\n]background-image:url('https?:\/\/(?:[a-z\-]+\.)+[a-z]{2,6}(?:\/[^\/#?]+)+\.(?:jpe?g|gif|png)$');/i
But couldn't get it to work.
I am using this with javascript/jquery

Does this get what you want?:
/\s*?[ \t\n]background-image:url\('.+?'\);/i

I think you can simplify it to this if you know it will only change with the URL in the middle. I probably went overboard with the \ escapes but better to be safe than sorry.
/background\-image\:url\(\'.*?\'\)\;/

Epascarello hit the nail on the head. Is this source you control? Or at least a predictable website? What are multiple different examples of input and your expected results?
Will this always be inline in double quotes, and therefore your URL will always be in single quotes? Some old websites use double-quotes in their CSS Files or header CSS.
Do you want to capture the whole thing? Or are you just trying to extract the resulting URL?
SirCapsAlot brings up a good question, are you just looking for background image URL's in general? Because they can use the Background property also, or even be set in JavaScript with .backgroundImage="url(image.jpg)".
And you definitely only want the ones that include http(s)?
With the limited requirements you gave, this is the best Regex:
background-image\s*:\s*url\('(https?://[^']+)
Comment here if you have answers to my questions which may alter your requirements, and thusly my answer.
Breakdown:
background-image:\s*url //Find the literal text to begin
\(' //Find the literal opening parens and quote
( //Begin Capture Group 1
https?:// //Require the match of https:// (the s is optional because of the ?)
[^']+ //Require that everything until the next quote is matched
) //Capture the result into Group 1
A Co-Worker pointed out that I might have been downvoted for not capturing the closing tick. Note: Capturing the closing tick would be a wasted step, and is not necessary for this regex to work.
He also pointed out somebody might have downvoted me for requiring http or https in the url portion. But the user's question was specifically for external URLs, not internal ones. So this is a valid requirement and gets him closer to what he asked.
Sooo... not sure why this got a downvote.

How to adding special html chars without using innerHTML

So I'm working on a micro lib, html.js, and basically it creates text nodes with document.createTextNode but when I want to create a text node with a b I get a&nbsp;b so I'm wondering how to escape the & char, without using innerHTML ideally..

Javascript supports the \uXXXX notation, so in the case of a non-breaking space, that would be \u00A0.
document.createTextNode('a\u00A0b');
That's as far as you can get. It's a text node, consisting only of text, and there's no difference between texts created from entity references or from normal characters.
If that's not what you want, you should take a second look at innerHtml. Can't you read it, modify it and put it back?

There's not much functionality in js to encode/decode html entities. Seems like there some libraries out there, though, that can help you achieve this. Here is one I found on goodle.. haven't tried it, but you can check it out, or look for others.
http://www.strictly-software.com/htmlencode

Fast method to find regex matches in a large document using javascript?

I need to search the text in a HTML document for reg-exes(emails, phone numbers, etc) and words. The matches need to be highlighted and be made anchor-able so that a link can be generated to jump to the location of the matches. So not only does it need to find matches using patterns in needs to do a replace do add the proper html code.
I am currently using jquery but I am not very happy with the speed. In a 1.5mb file it takes about 5 seconds to match 2 regexes and it increases when I add more search criteria.
Does anyone know of a fast method to find regex matches in a large document using javascript?

You say you're "using jQuery" but you don't say how. Have you tried a "highlight" plugin (or, as it sounds like you'd need, a derivation of one)? I've used this one: http://johannburkard.de/blog/programming/javascript/highlight-javascript-text-higlighting-jquery-plugin.html and it doesn't seem slow to me. Again, you'd have to work on it to make it add the markup you need, but that should be pretty clear - it's not very big.
It seems like what you'd want to do for performance is take your regular expressions and combine them into what amounts to a "token grammar". In other words, you don't want to start from scratch looking for each regex individually throughout the entire document. Instead, you'd want to proceed through it with a regex that matches each possible target (one at a time of course), and each time it finds one you'd replace it with whatever's appropriate. That way you could make just one pass over the document, no matter how big it is and no matter how many patterns you're looking for.
edit Mr. Burkard's plugin doesn't let you search with regexes; it uses "indexOf" internally. Hmm.

We Keep Coding

JavaScript is the programming language of the Web.