Extracting <a> from Text in javascript - javascript

I have an array which is getting text from a website using JSON parsing.
That text have an <a> tag sometimes at the end of text and sometime in between the text.
I want to extract that <a> tag.

you could use regex, like:
var str = "test sdf <a href='www.google.com'>test</a> sdfsdf";
var anchor = str.match(/<a[^>]*>([^<]+)<\/a>/);
console.log( anchor[0] ); //returns <a href='www.google.com'>test</a>

I'm not sure exactly what you're trying to extract, but a regular expression might be what you're looking for:
/\<a.*?\>/.exec('Hello World!')
Output:
["<a href="foo.html">"]

Related

Replacing Text without losing inner tags using Javascript

Let suppose I have a string:
var firstString = "<h3>askhaks</h3><h3>1212</h3><h1 style='color:red;
text-decoration:underline;'><a href=''><span id='123'><i class='fa fa-inr'></i>
</span> Hello! Admin</span></a></h1><p>This is the content of page 'String
Replace in Javascript'</p><h1>First</h1><span><h1>Hello! Admin</h1>Thank You for
visiting this page</span><h1>Third</h1>";
I want to change text of first <h1> tag without losing all other inner tags i.e. <a href=''><span id='123'><i class='fa fa-inr'></i> </span>
Just want to replace Hello! Admin with another text. I am able to replace text of first <h1> tag with the below code without losing the inline styling added to <h1> but I am loosing the inner tags.
var innerText = document.getElementsByTagName('h1')[0].innerHTML;
How to achieve this?
If you want to change the text before inserting it into the document, you can use DOMParser on the input HTML string, get its trimmed textContent, and then replace the substring in the input with your desired string, thus preserving all HTML tags:
var firstString = "<h3>askhaks</h3><h3>1212</h3><h1 style='color:red;text-decoration:underline;'><a href=''><span id='123'><i class='fa fa-inr'></i></span> Hello! Admin</span></a></h1><p>This is the content of page 'String Replace in Javascript'</p><h1>First</h1><span><h1>Second</h1>Thank You for visiting this page</span><h1>Third</h1>";
const doc = new DOMParser().parseFromString(firstString, 'text/html');
const h1Text = doc.querySelector('h1').textContent.trim();
console.log(firstString.replace(h1Text, 'foo bar new string'));
document.getElementsByTagName("h1").innerHTML = "Hello!";
If it’s not in the DOM and it’s a string you can use a regex in a replace like:
str.replace(“(<h1>)(.+)(<\/h1>)”, “$1YOUR_NEW_TEXT$2”);

Convert raw html to text with javascript and regex

I have raw html with link tags and the goal I want to achieve is extract href attribute from tags and all text between tags except tags.
For example:
<br>#EXTINF:-1 tvg-name="1377",Страшное HD<br>
<a title="Ссылка" rel="nofollow" href="http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C01_STRASHNOEHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2" target="_blank">http://46.61.226.18/hl…variant.m3u8?version=2</a>
<br>#EXTINF:-1 tvg-name="983" ,Первый канал HD<br>
<a title="Ссылка" rel="nofollow" href="http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C06_1TVHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2" target="_blank">http://46.61.226.18/hl…variant.m3u8?version=2</a>
have to convert to:
#EXTINF:-1 tvg-name="1377",Страшное HD
http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C01_STRASHNOEHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2
#EXTINF:-1 tvg-name="983" ,Первый канал HD
http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C06_1TVHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2
I tried different regex's:
Here what I did
var source_text = $("#source").val();
var delete_start_of_link_tag = source_text.replace(/<a(.+?)href="/gi, "");
delete beginning of the tag to the href attribute
var delete_tags = delete_start_of_link_tag.replace(/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/gi, "");
delete all tags </a>, <br>
example
And then I want to delete all text after href values to the end of the line.
What regex should i use in replace method or maybe where is a some different way to do this converting?
Formatting Anchor Tags
In your example , you are not replacing the "> part form the html.
So check this example
use this code to remove everything after href close quote(' or ")
var delete_tags = delete_start_of_link_tag.replace(/".*/gi, "");
And few things to notice are
1.The value in href is enclosed in single quote(') or double quotes("), both are valid.
2.The exact regex to match all href in a given string or content is href=[\"|'].*?[\"|']
3.Some patterns in href values , I came across are below.
http://www.so.com
https://www.so.com
www.so.com
//so.com
/socom.html
javascript*
mailto*
tel*
So if you want to format URL's then you have consider the above cases and i may have missed some.
Looks like you're already using jQuery.
Get the href of each anchor
$('a').each(function(){
var href = $(this).attr('href');
});
Get the text of each anchor:
$('a').each(function(){
var text = $(this).text();
});
You haven't shown a wrapper element around these but you can get the text (without tags) of any selection.
var text = $('#some_id').text();
Example

Importing string from HTML into markdown parser

I am using a markdown parser that works great if I pass it a string like this:
el.innerHTML = marked('#Introduction:\nHere you can write some text.');
But if I have that string inside HTML and send it to parser like
el.innerHTML = marked(otherEl.innerHTML);
it does not get parsed. Why is this? Does the string format of .innerHTML do something I am missing?
jsFiddle: http://jsfiddle.net/5p8be1b4/
My HTML:
<div id="editor">
<div class="contentTarget"></div>
<div class="contentSource">#Introduction:\nHere you can write some text.</div>
</div>
div.contentTarget should receive HTML, parsed markdown. But it receives a un-parsed string only.
In the image bellow is the jsFiddle output. A unformated div.contentTarget, the original div.contentSource where I get the innerHTML to use in div.contentTarget and in the bottom, a working parsed div#tester which received a string directly into the parser.
The issue is around your newlines. When you put \n inside a string in javascript, you're putting an actual newline character in it.
The same \n inside your HTML content is just that, \n. It is not a newline. If you change your HTML to this (with an actual newline), it works as expected:
<div class="contentSource">#Introduction:
Here you can write some text.</div>
Updated fiddle
Alternatively, if you change your javascript string to:
test.innerHTML = marked('#Introduction:\\nHere you can write some text.');
So that the string actually contains \n rather than a newline, you see the same erroneous behaviour.
Got it.
In your html, you have \n, but it's supposed to be a line-break, and you should use br becasue this is supposed to be html.
<div class="contentSource">#Introduction:<br/>Here you can write some text.</div>
instead of:
<div class="contentSource">#Introduction:\nHere you can write some text.</div>
When you debug the code, if you send the innerHTML to marked, it shows this as a function parameter:
#Introduction:\nHere you can write some text.
But when you send the string in the js, it shows the parameter like this:
#Introduction:
Here you can write some text.
Hope this helps.
JsFiddle: http://jsfiddle.net/gbrkj901/11/
Your HTML is rendering differently because Javascript automatically interprets \n as a newline.
Consider the following:
alert('a\ntest');
Which will have an alert with 2 lines.
And now, consider the following:
<span>a\ntest</span>
<script>
alert(document.getElementsByTagName('span')[0].innerHTML);
</script>
This will show a\ntest.
To fix it, use this:
el.innerHTML = marked(otherEl.innerHTML.replace(/\\n/g,'\n'));
Or, a more general and secure way:
el.innerHTML = marked(
otherEl
.innerHTML
.replace(
/\\([btnvfr"'\\])/g,
function(_,c){
return {
b:'\b',
t:'\t',
v:'\v',
n:'\n',
r:'\r',
'"':'"',
"'":"'",
'\\':'\\'
}[c];
}
)
);
Or, if you like it minimal and you are ready to have cthulhu knocking on your front door, use this:
el.innerHTML = marked(otherEl.innerHTML.replace(/\\([btnvfr])/g,function(_,c){return eval('return "\\'+c+'"');}));

Replacing HTML String & Avoiding Tags (regex)

I'm trying to use JS to replace a specific string within a string that contains html tags+attributes and styles while avoiding the inner side of the tags to be read or matched (and keep the original tags in the text).
for example, I want <span> this is span text </span> to be become: <span> this is s<span class="found">pan</span> text </span> when the keyword is "pan"
I tried using regex with that ..
My regex so far:
$(this).html($(this).html().replace(new RegExp("([^<\"][a-zA-Z0-9\"'\=;:]*)(" + search + ")([a-zA-Z0-9\"'\=;:]*[^>\"])", 'ig'), "$1<span class='found'>$2</span>$3"));
This regex only fails in cases like <span class="myclass"> span text </span> when the search="p", the result:
<s<span class="found">p</span>an class="myclass"> s<span class="found">p</span>an text</s<span class="found">p</span>an>
*this topic should help anyone who seeks to find a match and replace the matched string while avoiding strings surrounded by specific characters to be replaced.
As thg435 say, the good way to deal with html content is to use the DOM.
But if you want to avoid something in a replace, you can match that you want to avoid first and replace it by itself.
Example to avoid html tags:
var text = '<span class="myclass"> span text </span>';
function callback(p1, p2) {
return ((p2==undefined)||p2=='')?p1:'<span class="found">'+p1+'</span>';
}
var result = text.replace(/<[^>]+>|(p)/g, callback);
alert(result);

How to delete string from string with regex and jQuery/JS?

I wonder, how to delete:
<span>blablabla</span>
from:
<p>Text wanted <span>blablabla</span></p>
I'm getting the text from p using:
var text = $('p').text();
And for some reason I'd like to remove from var text the span and its content.
How can I do it?
It's impossible to remove the <span> from the variable text, because it doesn't exist there — text is just text, without any trace of elements.
You have to remove the span earlier, while there is still some structure:
$('p').find('span').remove();
Or serialize the element using HTML (.html()) rather than plain text.
Don't edit HTML using regular expressions — HTML is not a regular language and regular expressions will fail in non-trivial cases.
var html = $('p').html();
var tmp = $('<p>').html(html);
tmp.find('span').remove();
var text = tmp.text();
text = text.replace(/<span>.*<\/span>/g, '');
to remove the unwanted whitespace before the <span> use
text = text.replace(/\s*<span>.*<\/span>/g, '');
leaving you with
<p>Text wanted</p>

Categories