Best way to match and catch doubly-entified character entities/references?

Best way to match and catch doubly-entified character entities/references? - javascript

I'm talking about stuff like &amp; which will then render to: & when it actually should render to &. In this I asked how to match entities, but it seems that isn't really possible or realistic with regexes. What then is the best way to match double entities?
EDIT: Is this a good way to do it? .replace(/&(?=#?x?[0-9a-z]+);/i, '&');
(I'm using javascript)

I'd go with
pattern &([a-zA-Z0-9]+?;)\1
replacement &$1
to replace just double amps, or:
pattern &([#a-zA-Z0-9]+?;)
EDIT:
your pattern
/&(?=#?x?[0-9a-z]+);/i
looks also good to me.
Note: none of these is something you can trust

Possibly:
&[a-zA-Z]+;
Though not fool proof.

Normalize your data first. Use whatever you know about encoding to decode them back to form where character/piece of data have only one possible encoding. After that match this normalized data with normalized pattern.

Related

RegEx: Removing select parts of a URL, except variables

Trying to remove parts of a URL via RegEx.
I'm getting my content via an AJAX requst thus I cannot use
$(location).attr('search').split("&")[2]
My RegEx (Regex101)
Any direct answer will be greatly appreciated as I cannot comprehend RegEx, other better or more efficient suggestions will also be greatly valued.

Straightforward approach: this question already has helpful answers on how to parse a URL with and without a RegEx. Try one of those methods, then keep the parts you want and discard the ones you don't.

Not sure if I got your question fully. For the example you have, this is the regexp I have for your matching. May not be the elegant ones, but it should work for your example.
http[s]*://opskins.com/?loc=shop_search&app=730_2
http[s]://[a-z]+.com/\?[a-z]+=[a-z_]+&[a-z]+=[0-9_]+

regex works in regex tester but not in pattern

Its quite a simple but in my opinion weird problem i basically have this regex and entered a few tests and they work.
(?=^\*)|(?=^.{1,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-\{\}]{1,63}\.?)+(?:[a-zA-Z\{\}]{1,})$)
https://regex101.com/r/hU6tP0/2
But when i try to use it in html it fails. But if i test it in javascript it works.
http://jsfiddle.net/ek6kby2q/9/
I don't have much knowledge at all about regex so maybe anyone know whats going wrong or got any tips to make the regex better is welcome.

As an html attribute, the pattern must match all the string from the beginning to the end, that's why (?=^\*) fails to do it, since it matches zero characters.
Use this pattern instead:
\*.*|(?!.{255})(?:[A-Za-z_{}-][\w{}-]{0,62}\.?)+[A-Za-z{}]+
(You can omit the anchors since they are implicit)

How can I remove escaping from a RegExp pattern?

I'm trying to simplify input for a particular regex for my users. A simple example of the regex might be
\b(C|C\+\+|Java)\b
I'm now giving the user the option of appending another branch at the end of the regex by inputting the raw string into a <input type="text"> field. The branch will be interpreted literally, so I need to escape it. I've used https://stackoverflow.com/a/2593661/785663 to get RegExp.quote to do this. I then store the complete regex in a database.
Now, when I retrieve the regex from the database and split it back up and display the branches to the user, I need to remove all the escape characters again. Is there some pre-made function for this or do I need to roll my own?
Yes, I know I ought to replace this with a list of strings to search for. But this only a part of a larger (regex based) picture.

The optimal solution is to change your design: store the unescaped regex, then only escape it when you actually use it. That way you don't have to worry about this messy business of converting it back and forth all the time.
If you use this regex a lot and are worried about the overhead of having to escape it all the time, then store both the unescaped and escaped versions. Update both whenever the user makes a change.
p.s. Allowing user-entered regexes may make your site vulnerable to attack. (Update: Though in this case it is less likely to be a problem, since you are only allowing literal strings)

How To Create This RegExp

I am looking to find this in a string: XXXX-XXX-XXX Where the X is any number.
I need to find this in a string using JavaScript so bonus points to those who can provide me the JavaScript too. I tried to create a regex and came out with this: ^[0-9]{4}\-[0-9]{3}\-[0-9]{3}$
Also, I would love to know of any cheat sheets or programs you guys use to create your regular expressions.

i suppose this is what you want:
\d{4}-\d{3}-\d{3}
in doubt? Google for "RegEx Testers"

With your attempt:
^[0-9]{4}\-[0-9]{3}\-[0-9]{3}$
Since the - is not a metacharacter, there is no need to escape it -- thus you are looking for explicit backslash characters.
Also, you've anchored the match at the beginning and end of the string -- this will match only strings that consist only of your number. (Well, assuming the rest were correct.)
I know most people like the {3} style of counting, but when the thing being matched is a single digit, I find this more legible:
\d\d\d\d-\d\d\d-\d\d\d
Obviously if you wanted to extend this to matching hexadecimal digits, extending this one would be horrible, but I think this is far more legible than alternatives:
\d{4}-\d{3}-\d{3}
[[:digit:]]{4}-[[:digit:]]{3}-[[:digit:]]{3}
[0-9]{4}-[0-9]{3}-[0-9]{3}
Go with whatever is easiest for you to read.
I tend to use the perlre(1) manpage as my main reference, knowing full well that it is far more featureful than many regexp engines. I'm prepared to handle the differences considering how conveniently available the perlre manpage is on most systems.

var result = (/\d{4}\-\d{3}\-\d{3}/).exec(myString);

Wipe a string but keep its middle part

With a string like "HorsieDoggieBirdie", is there a non-capturing regex replace that would kill "Horsie" and "Birdie", yet keep "Doggie" intact? I can only think of a capturing solution:
s/(Horsie)(Doggie)(Birdie)/$2/g
Is there a non-capturing solution like:
s/Horsie##Doggie##Birdie//g
where ## is some combination of regex codes? The specific problem is in JavaScript (innerHTML.replace) but I'll take Perl suggestions, too.

You don't have to capture the Horsie or the Birdie.
s/Horsie(Doggie)Birdie/$1/g;
A similar thing should work for Javascript as well. This is probably as efficient as it gets, and at least as fast as using look-around assertions; although you should benchmark it if you want to know for sure. (The results, of course, will depend on the horsies, doggies and birdies in question.)
Mandatory disclaimer: you should know what happens when you use regular expressions with HTML...

You can use Look-Around Assertions:
s/(?:Horsie(?=Doggie))|(?:(?<=Doggie)Birdie)//g;

We Keep Coding

JavaScript is the programming language of the Web.

Best way to match and catch doubly-entified character entities/references? - javascript

I'd go with pattern &([a-zA-Z0-9]+?;)\1 replacement &$1 to replace just double amps, or: pattern &([#a-zA-Z0-9]+?;) EDIT: your pattern /&(?=#?x?[0-9a-z]+);/i looks also good to me. Note: none of these is something you can trust

Possibly: &[a-zA-Z]+; Though not fool proof.

Normalize your data first. Use whatever you know about encoding to decode them back to form where character/piece of data have only one possible encoding. After that match this normalized data with normalized pattern.

Related

RegEx: Removing select parts of a URL, except variables

regex works in regex tester but not in pattern

How can I remove escaping from a RegExp pattern?

How To Create This RegExp

Wipe a string but keep its middle part

Categories

Resources