Regex: Match string with substrings with the same pattern

Regex: Match string with substrings with the same pattern - javascript

I'm trying to match a string with a pattern, that can have sub strings with the same pattern.
Here's a example string:
Nicaragua [[NOTE|note|Congo was a member of ICCROM from 1999 and Nicaragua from 1971. Both were suspended by the ICCROM General Assembly in November 2013 having omitted to pay contributions for six consecutive calendar years (ICCROM [[Statutes|s|url|www.iccrom.org/about/statutes/]], article 9).]]. Another [[link|url|google.com]] that might appear.
and here's the pattern:
[[display_text|code|type|content]]
So, what I want with that is get the string within the brackets, and then look for some more string that match the pattern within the top level one.
and what I want is match this:
[[NOTE|s|note|Congo was a member of ICCROM from 1999 and Nicaragua from 1971. Both were suspended by the ICCROM General Assembly in November 2013 having omitted to pay contributions for six consecutive calendar years (ICCROM [[Statutes|s|url|www.iccrom.org/about/statutes/]], article 9).]]
1.1 [[Statutes|s|url|www.iccrom.org/about/statutes/]]
[[link|s|url|google.com]]
I was using this /(\[\[.*]])/ but it gets everything until the last ]].
What I want with that is be able to identify the matched string and convert them to HTML elements, where |note| is going to be a blockquote tag and |url| an a tag. So, a blockquote tag can have link tag inside it.
BTW, I'm using CoffeeScript to do that.
Thanks in advance.

In general, regex is not good at dealing with nested expressions. If you use greedy patterns, they'll match too much, and if you use non-greedy patterns, as #bjfletcher suggests, they'll match too little, stopping inside the outer content. The "traditional" approach here is a token-based parser, where you step through characters one by one and build an abstract syntax tree (AST) which you then reformat as desired.
One slightly hacky approach I've used here is to convert the string to a JSON string, and let the JSON parser do the hard work of converting into nested objects: http://jsfiddle.net/t09q783d/1/
function toPoorMansAST(s) {
// escape double-quotes, as they'll cause problems otherwise. This converts them
// to unicode, which is safe for JSON parsing.
s = s.replace(/"/g, "\u0022");
// Transform to a JSON string!
s =
// Wrap in array delimiters
('["' + s + '"]')
// replace token starts
.replace(/\[\[([^\|]+)\|([^\|]+)\|([^\|]+)\|/g,
'",{"display_text":"$1","code":"$2","type":"$3","content":["')
// replace token ends
.replace(/\]\]/g, '"]},"');
return JSON.parse(s);
}
This gives you an array of strings and structured objects, which you can then run through a formatter to spit out the HTML you'd like. The formatter is left as an exercise for the user :).

Related

Javascript - how to use regex process the following complicated string

I have the following string that will occur repeatedly in a larger string:
[SM_g]word[SM_h].[SM_l] "
Notice in this string after the phrase "[SM_g]word[Sm_h]" there are three components:
A period (.) This could also be a comma (,)
[SM_l]
"
Zero to all three of these components will always appear after "[SM_g]word[SM_h]". However, they can also appear in any order after "[SM_g]word[SM_h]". For example, the string could also be:
[SM_g]word[SM_h][SM_l]"
or
[SM_g]word[SM_h]"[SM_l].
or
[SM_g]word[SM_h]".
or
[SM_g]word[SM_h][SM_1].
or
[SM_g]word[SM_h].
or simply just
[SM_g]word[SM_h]
These are just some of the examples. The point is that there are three different components (more if you consider the period can also be a comma) that can appear after "[SM_h]word[SM_g]" where these three components can be in any order and sometimes one, two, or all three of the components will be missing.
Not only that, sometimes there will be up to one space before " and the previous component/[SM_g]word[SM_h].
For example:
[SM_g]word[SM_h] ".
or
[SM_g]word[SM_h][SM_l] ".
etc. etc.
I am trying to process this string by moving each of the three components inside of the core string (and preserving the space, in case there is a space before &\quot; and the previous component/[SM_g]word[SM_h]).
For example, [SM_g]word[SM_h].[SM_l]" would turn into
[SM_g]word.[SM_l]"[SM_h]
or
[SM_g]word[SM_h]"[SM_l]. would turn into
[SM_g]word"[SM_l].[SM_h]
or, to simulate having a space before "
[SM_g]word[SM_h] ".
would turn into
[SM_g]word ".[SM_h]
and so on.
I've tried several combinations of regex expressions, and none of them have worked.
Does anyone have advice?

You need to put each component within an alternation in a grouping construct with maximum match try of 3 if it is necessary:
\[SM_g]word(\[SM_h])((?:\.|\[SM_l]| ?"){0,3})
You may replace word with .*? if it is not a constant or specific keyword.
Then in replacement string you should do:
$1$3$2
var re = /(\[SM_g]word)(\[SM_h])((?:\.|\[SM_l]| ?"){0,3})/g;
var str = `[SM_g]word[SM_h][SM_l] ".`;
console.log(str.replace(re, `$1$3$2`));

This seems applicable for your process, in other word, changing sub-string position.
(\[SM_g])([^[]*)(\[SM_h])((?=([,\.])|(\[SM_l])|( ?&\\?quot;)).*)?
Demo,,, in which all sub-strings are captured to each capture group respectively for your post processing.
[SM_g] is captured to group1, word to group2, [SM_h] to group3, and string of all trailing part is to group4, [,\.] to group5, [SM_l] to group6, " ?&\\?quot;" to group7.
Thus, group1~3 are core part, group4 is trailing part for checking if trailing part exists, and group5~7 are sub-parts of group4 for your post processing.
Therefore, you can get easily matched string's position changed output string in the order of what you want by replacing with captured groups like follows.
\1\2\7\3 or $1$2$7$3 etc..
For replacing in Javascript, please refer to this post. JS Regex, how to replace the captured groups only?
But above regex is not sufficiently precise because it may allow any repeatitions of the sub-part of the trailing string, for example, \1\2\3\5\5\5\5 or \1\2\3\6\7\7\7\7\5\5\5, etc..
To avoid this situation, it needs to adopt condition which accepts only the possible combinations of the sub-parts of the trailing string. Please refer to this example. https://regex101.com/r/6aM4Pv/1/ for the possible combinations in the order.
But if the regex adopts the condition of allowing only possible combinations, the regex will be more complicated so I leave the above simplified regex to help you understand about it. Thank you:-)

How to get the nth (Unicode) character from a string in JavaScript

Suppose we have a string with some (astral) Unicode characters:
const s = 'Hi 👋 Unicode!'
The [] operator and .charAt() method don't work for getting the 4th character, which should be "👋":
> s[3]
'�'
> s.charAt(3)
'�'
The .codePointAt() does get the correct value for the 4th character, but unfortunately it's a number and has to be converted back to a string using String.fromCodePoint():
> String.fromCodePoint(s.codePointAt(3))
'👋'
Similarly, converting the string into an array using splats yields valid Unicode characters, so that's another way of getting the 4th one:
> [...s][3]
'👋'
But i can't believe that going from string to number back to string, or having to split the string into an array are the only ways of doing this seemingly trivial thing. Isn't there a simple method for doing this?
> s.simpleMethod(3)
'👋'
Note: i know that the definition of "character" is somewhat fuzzy, but for the purpose of this question a character is simply the symbol that corresponds to a Unicode codepoint (no combining characters, no grapheme clusters, etc).
Update: the String.fromCodePoint(str.codePointAt(n)) method is not really viable, since the nth position there doesn't take previous astral symbols into account: String.fromCodePoint('👋🙈'.codePointAt(1)) // => '�'
(I feel kinda dumb asking this; like i'm probably missing something obvious. But previous answers to this questions don't work on strings with Unicode simbols on astral planes.)

The string iterator is the only thing that iterates through code points rather than UCS-2/UTF-16 code units. So:
const string = 'Hi 👋 Unicode!';
for (const symbol of string) {
console.log(symbol);
}
So to get a specific code point based on its index from a string:
const string = 'Hi 👋 Unicode!';
// Note: The spread operator uses the string iterator under the hood.
const symbols = [...string];
symbols[3]; // '👋'
Still, this would break with grapheme clusters, or emoji sequences such as 👨‍👩‍👧‍👦 (👨 + U+200D ZERO WIDTH JOINER + 👩 + U+200D ZERO WIDTH JOINER + 👧 + U+200D ZERO WIDTH JOINER + 👦). Text segmentation helps with that.
Do you actually need to get the 4th code point in the string, though? What’s your use case?

You can use the new u flag to regexp if it's available to you.
const chars = 'Hi 👋 Unicode!'.match(/./ug);
console.log(chars);

The accepted answer to this question is out of date.
There is now a member of the String object called .at()/1 which does exactly what you're hoping for. If you have shims, shams, a transcompiler like TypeScript or Babel, etc, just set whatever your local configuration is, and you should be good to go.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/at
Amusingly, the spec for this feature, as well as the most common implementation shim (the one that I use,) is written by the person who authored the now out-of date accepted answer here. So even when he's out of date, he's still up to date.
If shimming or transcompiling isn't appropriate for you, there's a library called jsesc that can handle it for you through simple escaping. I'll give you three guesses who wrote the library. First two don't count.
https://www.npmjs.com/package/jsesc

What Regex would capture both the beginning and end from of a string?

I am trying to edit a DateTime string in typescript file.
The string in question is 02T13:18:43.000Z.
I want to trim the first three characters including the letter T from the beginning of a string AND also all 5 characters from the end of the string, that is Z000., including the dot character. Essentialy I want the result to look like this: 13:18:43.
From what I found the following pattern (^(.*?)T) can accomplish only the first part of the trim I require, that leaves the initial result like this: 13:18:43.000Z.
What kind of Regex pattern must I use to include the second part of the trim I have mentioned? I have tried to include the following block in the same pattern (Z000.)$ but of course it failed.
Thanks.
Any help would be appreciated.

There is no need to use regular expression in order to achieve that. You can simply use:
let value = '02T13:18:43.000Z';
let newValue = value.slice(3, -5);
console.log(newValue);
it will return 13:18:43, assumming that your string will always have the same pattern. According to the documentation slice method will substring from beginIndex to endIndex. endIndex is optional.

as I see you only need regex solution so does this pattern work?
(\d{2}:)+\d{2} or simply \d{2}:\d{2}:\d{2}
it searches much times for digit-digit-doubleDot combos and digit-digit-doubleDot at the end
the only disadvange is that it doesn't check whether say there are no minutes>59 and etc.
The main reason why I didn't include checking just because I kept in mind that you get your dates from sources where data that are stored are already valid, ex. database.

Solution
This should suffice to remove both the prefix from beginning to T and postfix from . to end:
/^.*T|\..*$/g
console.log(new Date().toISOString().replace(/^.*T|\..*$/g, ''))
See the visualization on debuggex
Explanation
The section ^.*T removes all characters up to and including the last encountered T in the string.
The section \..*$ removes all characters from the first encountered . to the end of the string.
The | in between coupled with the global g flag allows the regular expression to match both sections in the string, allowing .replace(..., '') to trim both simultaneously.

Javascript Regex capture multiline content from each newline entry

I am doing Javascript Regex to process and transform some raw data to 2D array.
Task Briefing (JS only):
Transforming raw string data to 2D array.
Raw Data Input :
Here is a piece of sample with 4 entries, a new entry will go to a newline. Entry 3 comes with multiline content.
2012/12/1, AM12:21 - user1‬: entry1_wasehhjdsaj
2012/12/2, AM9:42 - user2‬: entry2_bahbahbah_dsdeead
2012/12/2, AM9:44 - user3‬: entry3_Line1_ContdWithFollowingLine_bahbahbah
entry3_Line2_ContdWithABoveLine_bahbahbah_erererw
entry3_Line3_ContdWithABoveLine_bahbahbah_dsff
2012/12/4, AM11:48 - user7‬: entry4_bahbahbah_fggf
(raw string data, without empty line. )
Updated: Sorry for misleading, the end of contents do not necessary come with same END pattern, but just a line break.
How the pattern actually ends? (Thanks #Tim Pietzcker's comment).
The content should be end with a line break and following with next entry timestamp start. (You can assume the entry contents do not contain any similar timestamp pattern.)
I understand this may be a trouble regex question, so ANY OTHER JS METHOD ACHIEVING SAME GOAL WILL ALSO BE ACCEPT.
My current regex with capture group:
/^([0-9]{4}|[0-9]{2})[\/]([0]?[1-9]|[1][0-2])[\/]([0]?[1-9]|[1|2][0-9]|[3][0|1]), ([A|P])M([1-9]|1[0-2]):([0-5]\d) - (.*?): (.*)/gm
Desired capture result:
MATCH 1
2012
12
1
A
12
21
user1‬
entry1_wasehhjdsaj
MATCH 2
2012
12
2
A
9
42
user2‬
entry2_bahbahbah_dsdeead
MATCH 3
2012
12
2
A
9
44
user3‬
entry3_Line1_ContdWithFollowingLine_bahbahbah entry3_Line2_ContdWithABoveLine_bahbahbah_erererw entry3_Line3_ContdWithABoveLine_bahbahbah_dsff
MATCH 4
(to be skipped...)
Problem:
There is a problem when I capture Entry 3, I can't capture the 2nd & 3rd line content of Entry 3. If the entry only contains ONE line content, the regex work fine.
How can I capture Entry 3 with Multi-line content? I try to work with m modifier, but I have no idea how to deal with Multi-line contents and newline entry at the same time.
If it is impossible achieve with js regex, please suggest another js approach to transform the raw data to 2D array as ultimate goal.
THANKS!
the end of contents do not necessary come with same END pattern, but just a line break.
Testing: https://regex101.com/r/eS9pY5/1

Multiline doesn't work that way in javascript, but you can workaround it with [\s\S]. This class matches every character and \n as well. Note the *? instead of * after it, to stop it from being greedy and only go until the 1st END:
^([0-9]{4}|[0-9]{2})[\/]([0]?[1-9]|[1][0-2])[\/]([0]?[1-9]|[1|2][0-9]|[3][0|1]), ([A|P])M([1-9]|1[0-2]):([0-5]\d) - (.*?): ([\s\S]*?END)$
See: https://regex101.com/r/mT8rI4/3

Dots (.) don't match newline characters. There is a character class that matches everything ([\S\s]), but you don't want to use that without precautions - otherwise [\S\s]* would match all the entries at once.
So you need to tell the regex engine to stop matching when the next match begins. We can use a negative lookahead assertion for that, and we'll just feed the timestamp pattern into that:
/^([0-9]{4}|[0-9]{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12][0-9]|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d) - ([^:]*): ((?:(?!^([0-9]{4}|[0-9]{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12][0-9]|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d))[\S\s])*)/gm
Test it live on regex101.com.

Here is a single regex that will match the strings you have the way you need:
^(\d{4}|\d{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d) - (.*?): ((?:(?!(?:\d{4}|\d{2})\/(?:0?[1-9]|1[0-2])\/(?:0?[1-9]|[12]\d|3[01]))[\s\S])*)(?=\n|$)
See demo
The last capturing group is no longer a greedy dot matching .* but a tempered greedy token (?:(?!([0-9]{4}|[0-9]{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12][0-9]|3[01]))[\s\S])* matching everything up to the end of string or the date pattern.
If we unroll it to make more efficient:
^(\d{4}|\d{2})\/(0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]), ([AP])M([1-9]|1[0-2]):([0-5]\d) - (.*?): (\D*(?:\d(?!(?:\d{3}|\d)\/(?:0?[1-9]|1[0-2])\/(0?[1-9]|[12]\d|3[01]))\D*)*)(?=\n|$)
See another demo

Confused with Regex JS pattern

ok i do have this following data in my div
<div id="mydiv">
<!--
what is your present
<code>alert("this is my present");</code>
where?
<code>alert("here at my left hand");</code>
oh thank you! i love you!! hehe
<code>alert("welcome my honey ^^");</code>
-->
</div>
well what i need to do there is to get the all the scripts inside the <code> blocks and the html codes text nodes without removing the html comments inside. well its a homework given by my professor and i can't modify that div block..
I need to use regular expressions for this and this is what i did
var block = $.trim($("div#mydiv").html()).replace("<!--","").replace("-->","");
var htmlRegex = new RegExp(""); //I don't know what to do here
var codeRegex = new RegExp("^<code(*n)</code>$","igm");
var code = codeRegex.exec(block);
var html = "";
it really doesn't work... please don't give the exact answer.. please teach me.. thank you
I need to have the following blocks for the variable code
alert("this is my present");
alert("here at my left hand");
alert("welcome my honey ^^");
and this is the blocks i need for variable html
what is your present
where?
oh thank you! i love you!! hehe
my question is what is the regex pattern to get the results above?

Parsing HTML with a regular expression is not something you should do.
I'm sure your professor thinks he/she was really clever and that there's no way to access the DOM API and can wave a banner around and justify some minor corner-case for using regex to parse the DOM and that sometimes it's okay.
Well, no, it isn't. If you have complex code in there, what happens? Your regex breaks, and perhaps becomes a security exploit if this is ever in production.
So, here:
http://jsfiddle.net/zfp6D/
Walk the dom, get the nodeType 8 (comment) text value out of the node.
Invoke the HTML parser (that thing that browsers use to parse HTML, rather than regex, why you wouldn't use the HTML parser to parse HTML is totally beyond me, it's like saying "Yeah, I could nail in this nail with a hammer, but I think I'm going to just stomp on the nail with my foot until it goes in").
Find all the CODE elements in the newly parsed HTML.
Log them to console, or whatever you want to do with them.

First of all, you should be aware that because HTML is not a regular language, you cannot do generic parsing using regular expressions that will work for all valid inputs (generic nesting in particular cannot be expressed with regular expressions). Many parsers do use regular expressions to match individual tokens, but other algorithms need to be built around them
However, for a fixed input such as this, it's just a case of working through the structure you have (though it's still often easier to use different parsing methods than just regular expressions).
First lets get all the code:
var code = '', match = [];
var regex = new RegExp("<code>(.*?)</code>", "g");
while (match = regex.exec(content)) {
code += match[1] + "\n";
}
I assume content contains the content of the div that you've already extracted. Here the "g" flag says this is for "global" matching, so we can reuse the regex to find every match. The brackets indicate a capturing group, . means any character, * means repeated 0 or more times, and ? means "non-greedy" (see what happens without it to see what it does).
Now we can do a similar thing to get all the other bits, but this time the regex is slightly more complicated:
new RegExp("(<!--|</code>)(.*?)(-->|<code>)", "g")
Here | means "or". So this matches all the bits that start with either "start comment" or "end code" and end with "end comment" or "start code". Note also that we now have 3 sets of brackets, so the part we want to extract is match[2] (the second set).

You're doing a lot of unnecessary stuff. .html() gives you the inner contents as a string. You should be able to use regEx to grab exactly what you need from there. Also, try to stick with regEx literals (e.g. /^regexstring$/). You have to escape escape characters using new RegExp which gets really messy. You generally only want to use new RegExp when you need to put a string var into a regEx.
The match function of strings accepts regEx and returns a collection of every match when you add the global flag (e.g. /^regexstring$/g <-- note the 'g'). I would do something like this:
var block = $('#mydiv').html(), //you can set multiple vars in one statement w/commas
matches = block.match(/<code>[^<]*<\/code>/g);
//[^<]* <-- 0 or more characters that aren't '<' - google 'negative character class'
matches.join('_') //lazy way of avoiding a loop - join into a string with a safe character
.replace(/<\/*code>/g,'') //\/* 0 or more forward slashes
.split('_');//return the matches string back to array
//Now do what you want with matches. Eval (ew) or append in a script tag (ew).
//You have no control over the 'ew'. I just prefer data to scripts in strings

We Keep Coding

JavaScript is the programming language of the Web.

Regex: Match string with substrings with the same pattern - javascript

Related

Javascript - how to use regex process the following complicated string

How to get the nth (Unicode) character from a string in JavaScript

What Regex would capture both the beginning and end from of a string?

Javascript Regex capture multiline content from each newline entry

Confused with Regex JS pattern

Categories

Resources