Regex to parse hash url from HTML content

Regex to parse hash url from HTML content - javascript

I have regex to parse all hash url in HTML content.
/(\#)([^\s]+")/g
HTML content will be as
Some text some linksome content some link1
Expected is
#some-hash1, #some-hash2
But current regex is returning as (ending double come along with hash):
#some-hash1", #some-hash2"
I am unable to understand why its come along with double quotes. Any suggestion that will be very helpful.

I wouldn't use regex for this because it's overkill and because you can simply loop through the anchors pulling the value of their hrefs...
var anchors = document.querySelectorAll('a');
var hrefs = [];
anchors.forEach(function(e){
hrefs.push(e.getAttribute('href'));
});
console.log(hrefs);
link 1
link 2

Use non-capturing parenthesis,
/(\#)([^\s]+(?="))/g
DEMO
var z = 'Some text some linksome content some link1';
console.log( z.match(/(\#)([^\s]+(?="))/g) );

I am assuming that you are looking at the content of $2 for your result.
If so, the problem is the " inside the second capture group. Changing /(\#)([^\s]+")/g to /(\#)([^\s]+")/g results in the correct result.
I suggest joining the capture groups. Then /(\#[^\s]+)"/g will return $1=>#some-hash1, #some-hash2
Since $1 will always just return #, I suppose you trim it off elsewhere in your program, so perhaps you should use /\#([^\s]+)"/g which will return some-hash1, some-hash2 without the #

Just move double quote out the brackets:
(\#)([^\s]+)"
See how it works: https://regex101.com/r/fmrDyu/1

Related

Regex to convert markdown to html

My goal is to take a markdown text and create the necessary bold/italic/underline html tags.
Looked around for answers, got some inspiration but I'm still stuck.
I have the following typescript code, the regex matches the expression including the double asterisk:
var text = 'My **bold\n\n** text.\n'
var bold = /(?=\*\*)((.|\n)*)(?<=\*\*)/gm
var html = text.replace(bold, '<strong>$1</strong>');
console.log(html)
Now the result of this is : My <\strong>** bold\n\n **<\strong> text.
Everything is great aside from the leftover double asterisk.
I also tried to remove them in a later 'replace' statement, but this creates further issues.
How can I ensure they are removed properly?

With your pattern (?=\*\*)((.|\n)*)(?<=\*\*) you assert (not match) with (?=\*\*) that there is ** directly to the right.
Then directly after that, you capture the ** using ((.|\n)*) so then it becomes part of the match.
Then at the end you assert again with (?<=\*\*) that there is ** directly to the left, but ((.|\n)*) has already matched it.
This way so you will end up with all the ** in the match.
You don't need lookarounds at all, as you are already using a capture group.
In Javascript you could match the ** on the left and right and capture any character in a capture group:
\*\*([^]*?)\*\*
Regex demo
But I would suggest using a dedicated parser to parse markdown instead of using a regex.

Just make another call to replaceAll removing the ** with and empty string.
var text = 'My **bold\n\n** text.\n'
var bold = /(?=\*\*)((.|\n)*)(?<=\*\*)/gm
var html = text.replace(bold, '<strong>$1</strong>');
html = html.replaceAll(/\*\*/gm,'');
console.log(html)

comparing and replacing using regex in javascript : leaving a word in between

I am trying to replace a pattern as below:
Original :
welocme
Need to be replaced as :
welcome
Tried the below approach:
String text = "welocme";
Pattern linkPattern = Pattern.compile("a href=\"#");
text = linkPattern.matcher(text).replaceAll("a href=\"javascript:call()\"");
But not able to add the idvalue in between. Kindly help me out.
Thanks in advance.

how about a simple
text.replaceAll("#idvalue","javascript:call('idvalue')")
for this case only. If you are looking to do something more comprehensive, then as suggested in the other answer, an XML parser would be ideal.

Try getting the part that might change and you want to keep as a group, e.g. like this:
text = text.replaceAll( "href=\"#(.*?)\"", "href=\"javascript:call('$1')" );
This basically matches and replaces href="whatever" with whatever being caught by capturing group 1 and reinserted in the replacement string by using $1 as a reference to the content of group 1.
Note that applying regex to HTML and Javascript might be tricky (single or double quotes allowed, comments, nested elements etc.) so it might be better to use a html parser instead.

Add a capture group to the matcher regex and then reference the group in the replacemet. I found using the JavaDoc for Matcher, that you need to use '$' instead of '\' to access the capture group in the replacement.
Code:
String text = "welcome";
System.out.println("input: " + text);
Pattern linkPattern = Pattern.compile("a href=\"#([^\"]+)\"");
text = linkPattern.matcher(text).replaceAll("a href=\"javascript:call('$1')\"");
System.out.println("output: " +text);
Result:
input: welcome
output: welcome

Regex find all closing html tags, all opening separately

I have a javascript function performing filtering on strings. I currently have the filter stripping out all html tags.
return String(text).replace(/<[^>]+>/gm, '');
I've realized that I actually need to perform two operations:
First, to replace all closing tags with <br> and then a second operation to remove all opening tags.
I'm not too familiar with regEx. How could I specify /<[^>]+>/gm to be only opening or closing?

You need to use double replace function.
> var str = "<h1>foo bar</h1>"
undefined
> str.replace(/<\w[^>]*>/, "").replace(/<\/[^>]+>/, "<br>")
'foo bar<br>'
OR
Use single replace function which uses a capturing group based regex.
> var str = "<h1>foo bar</h1>"
> str.replace(/<(\w+\b)[^>]*>([^<>]*)<\/\1>/, '$2<br>')
'foo bar<br>'
We must back-reference (\1) to the first capturing group instead of second because the 1st itself contain the tag-name.

How to extract a particular text from url in JavaScript

I have a url like http://www.somedotcom.com/all/~childrens-day/pr?sid=all.
I want to extract childrens-day. How to get that? Right now I am doing it like this
url = "http://www.somedotcom.com/all/~childrens-day/pr?sid=all"
url.match('~.+\/');
But what I am getting is ["~childrens-day/"].
Is there a (definitely there would be) short and sweet way to get the above text without ["~ and /"] i.e just childrens-day.
Thanks

You could use a negated character class and a capture group ( ) and refer to capture group #1. The caret (^) inside of a character class [ ] is considered the negation operator.
var url = "http://www.somedotcom.com/all/~childrens-day/pr?sid=all";
var result = url.match(/~([^~]+)\//);
console.log(result[1]); // "childrens-day"
See Working demo
Note: If you have many url's inside of a string you may want to add the ? quantifier for a non greedy match.
var result = url.match(/~([^~]+?)\//);

Like so:
var url = "http://www.somedotcom.com/all/~childrens-day/pr?sid=all"
var matches = url.match(/~(.+?)\//);
console.log(matches[1]);
Working example: http://regex101.com/r/xU4nZ6
Note that your regular expression wasn't actually properly delimited either, not sure how you got the result you did.

Use non-capturing groups with a captured group then access the [1] element of the matches array:
(?:~)(.+)(?:/)
Keep in mind that you will need to escape your / if using it also as your RegEx delimiter.

Yes, it is.
url = "http://www.somedotcom.com/all/~childrens-day/pr?sid=all";
url.match('~(.+)\/')[1];
Just wrap what you need into parenteses group. No more modifications into your code is needed.
References: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp

You could just do a string replace.
url.replace('~', '');
url.replace('/', '');
http://www.w3schools.com/jsref/jsref_replace.asp

Extract text from HTML with Javascript regex

I am trying to parse a webpage and to get the number reference after <li>YM#. For example I need to get 1234-234234 in a variable from the HTML that contains
<li>YM# 1234-234234 </li>
Many thanks for your help someone!
Rich

currently, your regex only matches if there is a single number before the dash and a single number after it. This will let you get one or more numbers in each place instead:
/YM#[0-9]+-[0-9]+/g
Then, you also need to capture it, so we use a cgroup to captue it:
/YM#([0-9]+-[0-9]+)/g
Then we need to refer to the capture group again, so we use the following code instead of the String.match
var regex = /YM#([0-9]+-[0-9]+)/g;
var match = regex.exec(text);
var id = match[1];
// 0: match of entire regex
// after that, each of the groups gets a number

(?!<li>YM#\s)([\d-]+)
http://regexr.com?30ng5
This will match the numbers.

Try this:
(<li>[^#<>]*?# *)([\d\-]+)\b
and get the result in $2.

We Keep Coding

JavaScript is the programming language of the Web.

Regex to parse hash url from HTML content - javascript

Use non-capturing parenthesis, /(\#)([^\s]+(?="))/g DEMO var z = 'Some text some linksome content some link1'; console.log( z.match(/(\#)([^\s]+(?="))/g) );

Just move double quote out the brackets: (\#)([^\s]+)" See how it works: https://regex101.com/r/fmrDyu/1

Related

Regex to convert markdown to html

comparing and replacing using regex in javascript : leaving a word in between

Regex find all closing html tags, all opening separately

How to extract a particular text from url in JavaScript

Extract text from HTML with Javascript regex

Categories

Resources