Convert raw html to text with javascript and regex

Convert raw html to text with javascript and regex - javascript

I have raw html with link tags and the goal I want to achieve is extract href attribute from tags and all text between tags except tags.
For example:
<br>#EXTINF:-1 tvg-name="1377",Страшное HD<br>
<a title="Ссылка" rel="nofollow" href="http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C01_STRASHNOEHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2" target="_blank">http://46.61.226.18/hl…variant.m3u8?version=2</a>
<br>#EXTINF:-1 tvg-name="983" ,Первый канал HD<br>
<a title="Ссылка" rel="nofollow" href="http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C06_1TVHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2" target="_blank">http://46.61.226.18/hl…variant.m3u8?version=2</a>
have to convert to:
#EXTINF:-1 tvg-name="1377",Страшное HD
http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C01_STRASHNOEHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2
#EXTINF:-1 tvg-name="983" ,Первый канал HD
http://4pda.ru/pages/go/?u=http%3A%2F%2F46.61.226.18%2Fhls%2FCH_C06_1TVHD%2Fbw3000000%2Fvariant.m3u8%3Fversion%3D2
I tried different regex's:
Here what I did
var source_text = $("#source").val();
var delete_start_of_link_tag = source_text.replace(/<a(.+?)href="/gi, "");
delete beginning of the tag to the href attribute
var delete_tags = delete_start_of_link_tag.replace(/<\/?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*|\s*)\/?>/gi, "");
delete all tags </a>, <br>
example
And then I want to delete all text after href values to the end of the line.
What regex should i use in replace method or maybe where is a some different way to do this converting?

Formatting Anchor Tags
In your example , you are not replacing the "> part form the html.
So check this example
use this code to remove everything after href close quote(' or ")
var delete_tags = delete_start_of_link_tag.replace(/".*/gi, "");
And few things to notice are
1.The value in href is enclosed in single quote(') or double quotes("), both are valid.
2.The exact regex to match all href in a given string or content is href=[\"|'].*?[\"|']
3.Some patterns in href values , I came across are below.
http://www.so.com
https://www.so.com
www.so.com
//so.com
/socom.html
javascript*
mailto*
tel*
So if you want to format URL's then you have consider the above cases and i may have missed some.

Looks like you're already using jQuery.
Get the href of each anchor
$('a').each(function(){
var href = $(this).attr('href');
});
Get the text of each anchor:
$('a').each(function(){
var text = $(this).text();
});
You haven't shown a wrapper element around these but you can get the text (without tags) of any selection.
var text = $('#some_id').text();
Example

Related

Changing asp.net core TagHelpers with Javascript

I have
<a id="continue-link" asp-controller="Account" asp-action="Register" asp-route-id="1">Continue </a>
in my asp.net core application, that generate this html when compiled:
Continue
How can i change asp-route-id value from javascript? I tried with $().attr but it's not recognized.

You have to change your href attribute in generated html.
You can achive this by geting your href attribute, split it into array, then change value in array and join it again into one string with separator and then replace href attribute in your a element.
Code example:
var $link = $('#continue-link');
var href = $link.attr('href').split('/');
href[3] = 4; //here you set your new asp-route-id value
$link.attr('href', href.join('/'));
Check this codepen to see how it work.

jQuery modifies string that's partial html

I have a problem with replacing links of a string with jQuery. If I pass the string to jQuery, using
$("<div />").html(myContentString))
jQuery strips out all non-closed html tags. If I have an unopened ul-element, and only have the closing in the string, jQuery strips that away.
Any idea on how to iterate over all href attributes of a "partial html string"? I made a quick example which will illustrate this better:
http://jsfiddle.net/yporqgod/
var content = "<li>Option <a href='jadda1'>number 1</a></li><li>Option number <a href='2'>dos</a></li></ul></div>",
$content = $('<div/>').html(content);
console.log($content.html()); /* no </ul> nor </div> */
$content.find("a").each(function() {
var thisHref = $(this).attr('href');
$(this).attr('href', 'exchanged-' + thisHref);
});
console.log($content.html());`
Thanks in advance.
DS.
The reason for having non-closed html tags in a string is that I have to split a bigger chunk of html into three pieces. Don't ask, that's a completely other story ;)

That's just how DOM and jQuery work. The best solution i can give you is to manually add them back when you're done:
console.log($content.html() + '</ul></div>');

How can I Strip all regular html tags except <a></a>, <img>(attributes inside) and <br> with javascript?

When a user create a message there is a multibox and this multibox is connected to a design panel which lets users change fonts, color, size etc.. When the message is submited the message will be displayed with html tags if the user have changed color, size etc on the font.
Note: I need the design panel, I know its possible to remove it but this is not the case :)
It's a Sharepoint standard, The only solution I have is to use javascript to strip these tags when it displayed. The user should only be able to insert links, images and add linebreaks.
Which means that all html tags should be stripped except <a></a>, <img> and <br> tags.
Its also important that the attributes inside the the <img> tag that wont be removed. It could be isplayed like this:
<img src="/image/Penguins.jpg" alt="Penguins.jpg" style="margin:5px;width:331px;">
How can I accomplish this with javascript?
I used to use this following codebehind C# code which worked perfectly but it would strip all html tags except <br> tag only.
public string Strip(string text)
{
return Regex.Replace(text, #"<(?!br[\x20/>])[^<>]+>", string.Empty);
}
Any kind of help is appreciated alot

Does this do what you want? http://jsfiddle.net/smerny/r7vhd/
$("body").find("*").not("a,img,br").each(function() {
$(this).replaceWith(this.innerHTML);
});
Basically select everything except a, img, br and replace them with their content.

Smerny's answer is working well except that the HTML structure is like:
var s = '<div><div>Link<span> Span</span><li></li></div></div>';
var $s = $(s);
$s.find("*").not("a,img,br").each(function() {
$(this).replaceWith(this.innerHTML);
});
console.log($s.html());
The live code is here: http://jsfiddle.net/btvuut55/1/
This happens when there are more than two wrapper outside (two divs in the example above).
Because jQuery reaches the most outside div first, and its innerHTML, which contains span has been retained.
This answer $('#container').find('*:not(br,a,img)').contents().unwrap() fails to deal with tags with empty content.
A working solution is simple: loop from the most inner element towards outside:
var $elements = $s.find("*").not("a,img,br");
for (var i = $elements.length - 1; i >= 0; i--) {
var e = $elements[i];
$(e).replaceWith(e.innerHTML);
}
The working copy is: http://jsfiddle.net/btvuut55/3/

with jQuery you can find all the elements you don't want - then use unwrap to strip the tags
$('#container').find('*:not(br,a,img)').contents().unwrap()
FIDDLE

I think it would be better to extract to good tags. It is easy to match a few tags than to remove the rest of the element and all html possibilities. Try something like this, I tested it and it works fine:
// the following regex matches the good tags with attrinutes an inner content
var ptt = new RegExp("<(?:img|a|br){1}.*/?>(?:(?:.|\n)*</(?:img|a|br){1}>)?", "g");
var input = "<this string would contain the html input to clean>";
var result = "";
var match = ptt.exec(input);
while (match) {
result += match;
match = ptt.exec(input);
}
// result will contain the clean HTML with only the good tags
console.log(result);

Extracting Anchor tag with text separately

I am in JavaScript, I want to extract link from my anchor tag and text of the anchor separately. i.e. I have following anchor tag.
Learn More
I want to get the "href" in separate variable and the text "Learn More" in separate variable.
How I can do this?

try this
anchor.getAttribute("href")
OR
<a href="relativeURL" >
var link = element.a.getAttribute('href')
EDIT : I have updated code , try with that

Give it an id
<a id="myLink" href="http://www.defconpaintball.com/hiring">Learn More</a>
Here is a fiddle, you can play with.
var anchor = document.getElementById("myLink");
alert(anchor.getAttribute("href")); // Extract link
alert(anchor.innerHTML); // Extract Text
EDIT:
Don't know what is preventing you from parsing the whole content.
Something like this should be enough:
var anchors = document.getElementsByTagName("a");
for(var i in anchors)
if(anchors[i].innerHTML=="Learn More")
alert(anchors[i].getAttribute("href"));
Here is another fiddle.

Extracting <a> from Text in javascript

I have an array which is getting text from a website using JSON parsing.
That text have an <a> tag sometimes at the end of text and sometime in between the text.
I want to extract that <a> tag.

you could use regex, like:
var str = "test sdf <a href='www.google.com'>test</a> sdfsdf";
var anchor = str.match(/<a[^>]*>([^<]+)<\/a>/);
console.log( anchor[0] ); //returns <a href='www.google.com'>test</a>

I'm not sure exactly what you're trying to extract, but a regular expression might be what you're looking for:
/\<a.*?\>/.exec('Hello World!')
Output:
["<a href="foo.html">"]

We Keep Coding

JavaScript is the programming language of the Web.

Convert raw html to text with javascript and regex - javascript

Related

Changing asp.net core TagHelpers with Javascript

jQuery modifies string that's partial html

How can I Strip all regular html tags except <a></a>, <img>(attributes inside) and <br> with javascript?

Extracting Anchor tag with text separately

Extracting <a> from Text in javascript

Categories

Resources