error in parsing web page using javascript - javascript

I am trying to parse a page using javascript this is part of page:
<div class="title">
<h1>
Affect and Engagement in Game-BasedLearning Environments
</h1>
</div>
This is link tom page source:view-source:http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6645369?tp=&arnumber=6645369
I am using this:
$(data).find('h1').each(function()
{console.log($(this).text());
});
Now I am able to get the value inside header but the value displayed have lots of space in front and back.I tried to replace the whitespace by using replace function but replce isn't happening.I don't understand what is there in front and back of the value of header.I somehow want to remove the extra space.

Replace only replaces the first instance found, it might have only removed one space... try this instead, using regular expression syntax:
text.replace(/ /g, '');
This should remove all spaces, even the ones inside your string text. To avoid this, you may only want to replace double spaces instead:
text.replace(/ /g, '');
Also you may want to remove new lines:
text.replace(/\n/g, '');
Here is an example JSFiddle
If you know for sure that your string is only surrounded on either end by spaces, but you want to preserve everything inside, you can use trim:
text.trim();

Since your already using jQuery, you can take advantage of their $.trim function which removes leading and trailing whitespace.
$(data).find('h1').each(function() {
console.log($.trim($(this).text()));
});
Reference: $.trim()

Try using Javascript's Trim() to get rid of the whitespaces that's present on both sides.
Function's Reference:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/Trim

Your HTML actually contains these paces within your <h1> element, so it is to be expected that they are present in the result of .text().
Normally you'd just use .trim(). However, you'll likely want to replace line breaks inside the text as well.
$(data).find('h1').each(function() {
var text = $(this).text();
// Replaces any multiple whitespace sequences with a single space.
// Even inside the text.!
// E.g. " \r\n \t" -> " "
text = text.replace(/\s+/g, " ");
// Trim leading/trailing whitespace.
text = text.trim();
console.log(text);
});
Fiddle for your pleasure.

Related

Parser: How to Turn a JavaScript String into One Line of JavaScript?

I'm trying to make a parser for formatting JavaScript in a contextual format. First I want to be able to convert the input JavaScript into one line of JavaScript and then format the code based on my requirements. This does not remove all of the enters or white space:
txt = $.trim(txt);
txt = txt.replace("\n", "");
How can I convert the text into one line?
Use a regular expression with the "global" flag set:
txt.replace(/\n/g, "");
However, you should be careful about removing linebreaks in Javascript. You might break code that was depending on semicolon insertion. Why don't you use an off-the shelf parser like Esprima?
Use :
\s character that represents any space character (Carriage return, Line Feed, Tabs, Spaces, ...)
the "greedy" g flag.
var text = txt.replace(/\s+/g, ' ');
Hope it helps
If the text comes from some operating systems, it may have the \r\n line ending, so it is worth removing both...
You should also use /\r/g this replaces ALL \rs not just the first one.
var noNewLines = txt.replace(/\r/g, "").replace(/\n/g, "");
You have to be pretty sure there are no single-line comments and that there are no missing semi-colons.
You can try to minify your code, using something like https://javascript-minifier.com/
however this will also change your variable names

Is it possible to combine those two regex or improve my code?

I would like to know if there is a way to combine those two regex below OR a way to combine my two tasks in another way.
1) /(<\/[a-z]>|<[a-z]*>)/g
2) /\s{2,}/g
Specifically, they are used to replace this:
This is <b>a test</b> and this <i> is also a test</i>
Into this:
This is <b> a test </b> and this <i> is also a test </i>
The first regex is used to add a space before and after every opening and closing tags and the second regex, is used to match every occurence of two or more space characters to be removed.
Here is the code
var inputString = 'This is <b>a test</b> and this <i> is also a test</i>',
spacedTags = inputString.replace(/(<\/[a-z]>|<[a-z]*>)/g, ' $1 '),
sanitizedSting = spacedTags.replace(/\s{2,}/g, ' ')
console.log(sanitizedSting);
and the jsfiddle.
I know those can be done using DOM manipulation which will probably be even faster but I'm trying to avoid this.
Thank you
If you look for trailing and preceding spaces, then use the inner capture group as the replacement value you can achieve something similar.
var inputString = 'This is <b>a test</b> and this <i> is also a test</i>',
spacedTags = inputString.replace(/(\s*(<\/[a-z]>|<[a-z]*>)\s*)/g, ' $2 ');
console.log(spacedTags);
JS Fiddle
This looks for anything that matches a beginning or ending tag optionally surrounded by whitespace. it then uses the inner match as the replacement with added spaces on either side.
Both implementations, though, always leave a trailing space after any closing tag. "</i> "
I haven't looked in to the performance changes from this, but it attempts to address the issue of one regular expression.
Is your problem that you may add a space where there already is one? In that case, discard all spaces before and after your tag:
sanitizedSting = inputString.replace(/\s*(<\/?[a-z]*>)\s*/g, ' $1 ');
This also adds a space at the end if you end with a tag (frankly, there are other problems with this exact code).

replace similar string in a text using javascript regex

we have a text like:
this is a test :rep more text more more :rep2 another text text qweqweqwe.
or
this is a test :rep:rep2 more text more more :rep2:rep another text text qweqweqwe. (without space)
we should replace :rep with TEXT1 and :rep2 with TEXT2.
problem:
when try to replace using something like:
rgobj = new RegExp(":rep","gi");
txt = txt.replace(rgobj,"TEXT1");
rgobj = new RegExp(":rep2","gi");
txt = txt.replace(rgobj,"TEXT2");
we get TEXT1 in both of them because :rep2 is similar with :rep and :rep proccess sooner.
If you require that :rep always end with a word boundary, make it explicit in the regex:
new RegExp(":rep\\b","gi");
(If you don't require a word boundary, you can't distinguish what is meant by "hello I got :rep24 eggs" -- is that :rep, :rep2, or :rep24?)
EDIT:
Based on the new information that the match strings are provided by the user, the best solution is to sort the match strings by length and perform the replacements in that order. That way the longest strings get replaced first, eliminating the risk that the beginning of a long string will be partially replaced by a shorter substring match included in that long string. Thus, :replongeststr is replaced before :replong which is replaced before :rep .
If your data is always consistent, replace :rep2 before :rep.
Otherwise, you could search for :rep\s, searching for the space after the keyword. Just make sure you replace the space as well.

Remove image elements from string

I have a string that contains HTML image elements that is stored in a var.
I want to remove the image elements from the string.
I have tried: var content = content.replace(/<img.+>/,"");
and: var content = content.find("img").remove(); but had no luck.
Can anyone help me out at all?
Thanks
var content = content.replace(/<img[^>]*>/g,"");
[^>]* means any number of characters other than >. If you use .+ instead, if there are multiple tags the replace operation removes them all at once, including any content between them. Operations are greedy by default, meaning they use the largest possible valid match.
/g at the end means replace all occurrences (by default, it only removes the first occurrence).
$('<p>').html(content).find('img').remove().end().html()
The following Regex should do the trick:
var content = content.replace(/<img[^>"']*((("[^"]*")|('[^']*'))[^"'>]*)*>/g,"");
It first matches the <img. Then [^>"']* matches any character except for >, " and ' any number of times. Then (("[^"]*")|('[^']*')) matches two " with any character in between (except " itself, which is this part [^"]*) or the same thing, but with two ' characters.
An example of this would be "asf<>!('" or 'akl>"<?'.
This is again followed by any character except for >, " and ' any number of times. The Regex concludes when it finds a > outside a set of single or double quotes.
This would then account for having > characters inside attribute strings, as pointed out by #Derek 朕會功夫 and would therefore match and remove all four image tags in the following test scenario:
<img src="blah.png" title=">:(" alt=">:)" /> Some text between <img src="blah.png" title="<img" /> More text between <img /><img src='asdf>' title="sf>">
This is of course inspired by #Matt Coughlin's answer.
Use the text() function, it will remove all HTML tags!
var content = $("<p>"+content+"</p>").text();
I'm in IE right now...this worked great, but my tags come out in upper case (after using innerHTML, i think) ... so I added "i" to make it case insensitive. Now Chrome and IE are happy.
var content = content.replace(/<img[^>]*>/gi,"");
Does this work for you?:
var content = content.replace(/<img[^>]*>/g, '')
You could load the text as a DOM element, then use jQuery to find all images and remove them. I generally try to treat XML (html in this case) as XML and not try to parse through the strings.
var element = $('<p>My paragraph has images like this <img src="foo"/> and this <img src="bar"/></p>');
element.find('img').remove();
newText = element.html();
console.log(newText);
To do this without regex or libraries (read jQuery), you could use DOMParser to parse your string, then use plain JS to do any manipulations and re-serialize to get back your string.

getElementById replace HTML

<script type="text/javascript">
var haystackText = document.getElementById("navigation").innerHTML;
var matchText = 'Subscribe to RSS';
var replacementText = '<ul><li>Some Other Thing Here</li></ul>';
var replaced = haystackText.replace(matchText, replacementText);
document.getElementById("navigation").innerHTML = replaced;
</script>
I'm attempting to try and replace a string of HTML code to be something else. I cannot edit the code directly, so I'm using Javascript to alter the code.
If I use the above method Matching Text on a regular string, such as just 'Subscribe to RSS', I can replace it fine. However, once I try to replace an HTML string, the code 'fails'.
Also, what if the HTML I wish to replace contains line breaks? How would I search for that?
<ul><li>\n</li></ul>
??
What should I be using or doing instead of this? Or am I just missing a small step? I did search around here, but maybe my keywords for the search weren't optimal to find a result that fit my situation...
Edit: Gonna mention, I'm writing this script in the footer of my page, well after the text I wish to replace, so it's not an issue of the script being written before what I want to overwrite to appear. :)
Currently you are using String.replace(substring, replacement) that will search for an exact match of the substring and replace it with the replacement e.g.
"Hello world".replace("world", "Kojichan") => "Hello Kojichan"
The problem with exact matches is that it doesn't allow anything else but exact matches.
To solve the problem, you'll have to start to use regular expressions. When using regular expression you have to be aware of
special characters such as ?, /, and \ that need to escaped \?, \/, \\
multiline mode /regexp/m
global matching if you want to replace more than one instance of the expression /regexp/g
closures for allowing multiple instances of white space \s+ for [1..n] white-space characters and \s* for [0..n] white-space characters.
To use regular expression instead of substring matching you just need to change String.replace("substring", "replacement") to String.replace(/regexp/, "replacement") e.g.
"Hello world".replace(/world/, "Kojichan") => "Hello Kojichan"
From MDN:
Note: If a <div>, <span>, or <noembed> node has a child text node that
includes the characters (&), (<), or (>), innerHTML returns these
characters as &amp, &lt and &gt respectively. Use element.textContent
to get a correct copy of these text nodes' contents.
So since textContent (or innerText) won't get you the HTML, you'd have to modify your search string appropriately.
You can use Regular Expressions.
Recommend to use Regular Expression. Notice that ? and / are special characters in Regular Expression. And for global multi-line matching, you need g and m flags set in the regular expression.
Regular expression matching of HTML (other than plain text) that comes out of a web page is a bad idea and is troublesome to make work cross browser (particularly in IE). The HTML that comes out of a web page does not always look the same as what was put in because some browser reconstitute the HTML and don't actually store what went in. Attributes can change order, quote marks can change or disappear, entities can change, etc...
If you want to modify whole tags, then you should directly access the DOM and operate on the actual objects in the page.

Categories