Problem with regexp in userscript for chrome - javascript

This might be a noob question, but I have tried to find an answere here and on other sites and I have still not find the answere. At least not so that I understand enough to fix the problem.
This is used in a userscript for chrome.
I'm trying to select a date from a string. The string is the innerHTML from a tag that I have managed to select. The html structure, and also the string, is something like this: (the div is the selected tag so everything within is the content of the string)
<div id="the_selected_tag">
link
" 2011-02-18 23:02"
thing
</div>
If you have a solution that helps me select the date without this fuzz, it would also be great.
The javascript:
var pattern = /\"\s[\d\s:-]*\"/i;
var tag = document.querySelector('div.the_selected_tag');
var date_str = tag.innerHTML.match(pattern)[0]
When I use this script as ordinary javascript on a html document to test it, it works perfectly, but when I install it as a userscript in chrome, it doesn't find the pattern.
I can't figure out how to get around this problem.

Dump innerHTML into console. If it looks fine then start building regexp from more generic (/\d+/) to more specific ones and output everything into a console. There is a bunch of different quote characters in different encodings, many different types of dashes.
[\d\s:-]* is not a very good choice because it would match " 1", " ". I would rather write something as specific as possible:
/" \d{4}-\d{2}-\d{2} \d{2}:\d{2}"/
(Also document.querySelector('div.the_selected_tag') would return null on your sample but you probably wanted to write class instead of id)

It's much more likely that tag.innerHTML doesn't contain what you think it contains.

Related

Hyperlink href incorrectly quoted in innerHTML?

Take this very simple example HTML:
<html>
<body>This is okay & fine, but the encoding of this link seems wrong.</body>
<html>
On examining document.body.innerHTML (e.g. in the browser's JS console, in JS itself, etc.), this is the value I see:
This is okay & fine, but the encoding of this link seems wrong.
This behaviour is the same across browsers but I can't understand it, it seems wrong.
Specifically, the link in the orginal document is to http://example.com?a=1&b=2, whereas if the value of innerHTML is treated as HTML then it links to http://example.com?a=1&b=2 which is NOT the same (e.g. If I created a new document, which actually had innerHTML as its inner HTML, and I clicked on the link then the browser would be sent to a materially different URL as far as I can see).
(EDIT #3: I'm wrong about the above. Firstly, yes, those two URLs are different; but secondly, the innerHTML which I thought was wrong is right, and it correctly represents the first URL, not the second! See the end of my own answer below.)
This is different from the issue discussed in question innerHTML gives me & as & !. In my case (which is the opposite to the case in that question) the original HTML is correct and it looks to me as if it is the innerHTML which is wrong (i.e. because it is HTML which does not represent what the original HTML represented).
(EDIT #2: I was wrong about this, too: it's not really different. But I think it is not widely known that & is the correct way to represent & inside an href, not just within body text. Once you realise that, then you can see that these are the same issue really.)
Can anyone explain this?
(EDIT #1+4: This only occurred to me a bit late, after writing my original question, but: "is & actually correct within the href text, and & technically incorrect?" As I said when I first wrote those words, that "seems very unlikely! I've certainly never seen HTML written that way." But however 'unlikely', or not, that is the case, and is the main part of what I wasn't understanding!)
Also related and would be useful, can anyone explain how to cleanly get HTML which does correctly represent the target of document links? You definitely can't just un-encode all HTML character references within innerHTML, because (as shown in the example I've used, and also as discussed in innerHTML gives me & as & !) the ones in the main run of text should be encoded, and just un-encoding everything would make these wrong.
I originally thought this was not a duplicate of innerHTML gives me & as & ! (as discussed above; and in a way it still isn't, if it's agreed that it's not as obvious or widely known that the same issues apply inside href as in body text). It's still definitely not a duplicate of A href in innerHTML (which somehwat unclearly asks about how to set innerHTML using JS).
Most browser tools don't show the actual HTML because it wouldn't be of much help:
HTML is often generated dynamically after page load with the help of CSS and JavaScript.
HTML is often broken and the browser needs to repair it in order to generate the memory representation needed for rendering and other stuff.
So the HTML you see is not the actual source but it's generated on the fly from the current status of the document, which of course includes all the fixed applied (in your case, the invalid HTML entities).
The following example hopefully illustrates all the combinations:
const section = document.querySelector("section");
const invalid = document.createElement("p");
invalid.innerHTML = 'Invalid HTML (dynamic)';
const valid = document.createElement("p");
valid.innerHTML = 'Valid HTML (dynamic)';
section.appendChild(valid);
section.appendChild(invalid);
const paragraphs = document.querySelectorAll("p");
for (p of paragraphs) {
console.log(p.innerHTML);
}
const links = document.querySelectorAll("a");
for (a of links) {
console.log(a.getAttribute("href"));
}
<section>
<p>Invalid HTML (static)</p>
<p>Valid HTML (static)</p>
<section>
Is & actually correct within the href text, and & technically incorrect? It seems very unlikely! I've certainly never seen HTML written that way.
There's no such thing as "technically correct", let alone today when HTML is pretty well standardised. (Well, yes, there're two competing standards bodies and specs are continuously evolving, but the basics were set up long ago.)
The & symbol starts a character entity and &b is an invalid character entity. Period.
But it works! Doesn't that mean it's technically correct?
It works because browsers are explicitly designed to deal with completely broken markup, what's known as tag soup, because it was thought that it would ease usage:
<p><strong>Hello, World!</u>
<body><br itspartytime="yeah">
<pink>It works!!!</red>
But HTML entities are just an encoding artefact. That doesn't mean that URLs are not allowed to contain literal ampersands, it just means that —when in HTML context— they need to be represented as &. It's the same as when you type a backslash in a JavaScript string to escape some quotes: the backslash does not become part of your data.
Having thought up a possible (but I thought 'unlikely') explanation - which I put in as an edit in the original question - I've realised that it is the answer:
Using & to represent & inside an href is technically incorrect, and & is technically correct
I gathered this initially from this SO answer https://stackoverflow.com/a/16168585/795690, and I think it is relevant that (as it also says in that answer) the idea that & is the correct way to represent & in an href is not as widely understood as the idea that & is the correct way to represent & in body text.
Once you do understand this, it makes sense that what the browser is doing is right, and that the innerHTML value which comes back represents the link correctly.
EDIT:
#ÁlvaroGonzález gives a much longer answer, and it took me a while to see how everything he says applies, so I thought I'd try to explain what I didn't understand starting from where I started from, in case it helps someone else!
If you start with raw HTML with <a href="http://example.com/?a=1&b=1"> and then you inspect the DOM in the browser, or look at the value of the href attribute in JS then you see "http://example.com/?a=1&b=1" everywhere. So it looks as if nothing has changed, and nothing was wrong. What I didn't understand is that actually the browser has parsed a technically incorrect href (with invalid entities) to be able to display this to you! (Yes, LOTS of people use this 'broken' format!)
To see this first hand, load this longer HTML example into your browser:
<html>
<body style="font-family: sans-serif">
<p>Now & then http://example.com/?a=1&b=2</p>
<p>Now & then http://example.com/?a=1&b=2</p>
<p>Now &amp; then http://example.com/?a=1&amp;b=2</p>
</body>
</html>
then in your javascript console try running this code taken from #ÁlvaroGonzález's answer:
const paragraphs = document.querySelectorAll("p");
for (p of paragraphs) {
console.log(p.innerHTML);
}
const links = document.querySelectorAll("a");
for (a of links) {
console.log(a.getAttribute("href"));
}
Also try clicking on the links to see where they go.
Once you've made sense of everything that you see there, it is no longer surprising how innerHTML works!

Scan dom/webpage after certain pattern and get domtag

I write a small chrome extension which includes adding buttons add specific positions.
These positions are mostly random and can't be determined with normal css/jQuery selectors.
I need to scan the whole page for a certain text pattern (regex).
After I found matches I need to get the dom tag where the text is in.
I tried parsing the whole source with body.innerHtml but I cant get the tag obj afterwards.
Any ideas on how to accomplish such a task are highly appreciated!
Sounds like you could use :contains() for this.
$(":contains('Your Text')")
For finding text using a regular expression use .filter()
var regex = new RegExp("Your Text");
$("*").filter(function () {
return regex.test($(this).text());
});

Javascript regex to replace ampersand in all links href on a page

I've been going through and trying to find an answer to this question that fits my need but either I'm too noob to make other use cases work, or their not specific enough for my case.
Basically I want to use javascript/jQuery to replace any and all ampersands (&) on a web page that may occur in a links href with just the word "and". I've tried a couple different versions of this with no luck
var link = $("a").attr('href');
link.replace(/&/g, "and");
Thank you
Your current code replaces the text of the element within the jQuery object, but does not update the element(s) in the DOM.
You can instead achieve what you need by providing a function to attr() which will be executed against all elements in the matched set. Try this:
$("a").attr('href', function(i, value) {
return value.replace(/&/g, "and");
});
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
link
link
Sometimes when replacing &, I've found that even though I replaced &, I still have amp;. There is a fix to this:
var newUrl = "#Model.UrlToRedirect".replace(/&/gi, '%').replace(/%amp;/gi, '&');
With this solution you replace & twice and it will work. In my particular problem in an MVC app, window.location.href = #Model.UrlToRedirect, the url was already partially encoded and had a query string. I tried encoding/decoding, using Uri as the C# class, escape(), everything before coming up with this solution. The problem with using my above logic is other things could blow up the query string later. One solution is to put a hidden field or input on the form like this:
<input type="hidden" value="#Model.UrlToRedirect" id="url-redirect" />
then in your javascript:
window.location.href = document.getElementById("url-redirect").value;
in this way, javascript won't take the c# string and change it.

Extract data from url with JavaScript

EDIT_2: I forgot to specify its for Android app, so i dont think this is any use, i made a new post instead :( Added Android TAG..
EDIT: Im making an Android App
I need help to extract a number from an url, generated by JavaScript!
Site is:
http://www.oddsportal.com/sure-bets/
And the path looks like this:
<span class="logos l60"> </span>
<div class="odds-nowrp" xodd="xzoxfxzox">2.62</div> // <- 2.62 is the numer i need
For full path see this screenshot:
What library would do this best? (I know Jsoup cant do it) I have searched a few like:
HtmlUnit
Java Script Engine
Apache Commons BSF
Rhino
But i cant really make sense of it or find any examples for android which look like my problem
or find any examples for android which look like my problem
You need it for android?
Pretty much any library allowing DOM traversing will allow you to do this providing you know how to find your value.
is this value exactly at the same position in DOM every time?
is it wrapped by an easy to identify element? i.e. with a static ID
are there any other value that look alike in the DOM that you don't want?
Based on that, using JQuery for example, you could select it like this :
$('.table-main td.center > a[href^="/bookmaker"] + div[xodd]')
or this:
$('.table-main tr:nth-child(3) div.odds-nowrp[xodd]')
Use Jquery:
var number = $(".odds-nowrp").text();
you can just use regex if you have the url already in escaped string format
reg = /[A-z\"\>\<=?()0-9 \/]*(\d+.\d+)[A-z\"\>\<=?()0-9 \/]*/
reg.exec(url)[1] // this will return your number
if it's already rendered and the xodd value doesn't change, you could do something like this
document.querySelectorAll('.odds-nowrp[xodd=xzoxfxzox]')[0].innerText

Javascript Bookmarklet Unresponsive

Javascript newb here. Creating a bookmarklet to automate a simple task at work. Mostly a learning exercise. It will scan a transcript on CNN.com, for instance: (http://transcripts.cnn.com/TRANSCRIPTS/1302/28/acd.01.html). It will grab the lead stories at the top of the page, the name and title of the guests on the show, and format them so that they can be copy pasted into another document.
I've come up with a simple version that includes some jQuery that grabs the subheading and then uses a regular expression to find the names of the guests (it will also exclude everything between (begin videoclip) and (end videoclip), but I haven't gotten that far yet. It then alerts them (will eventually print them in a pop-up window, alert is just for troubleshooting purposes).
I'm using http://benalman.com/code/test/jquery-run-code-bookmarklet/ to create the bookmarklet. My problem is that once the bookmarklet is created it is completely unresponsive. Click on it and nothing happens. I've tried minimizing the code first with no result. My guess is that cnn.com's javascript is conflicting with mine but I'm not sure how to get around that. Or do I need to include some code to load and store the text on the current page? Here's the code (I've included comments, but I took these out when I used the bookmarklet generator.) Thanks for any help!
//Grabs the subheading
var leadStories=$(".cnnTransSubHead").text();
//Scans the webpage for guest name and title. Includes a regular expression to find any
//string that starts with a capital letter, includes a comma, and ends in a colon.
var scanForGuests=/[A-Z ].+,[A-Z0-9 ].+:/g;
//Joins the array created by scanForGuests with a semicolon instead of a comma
var guests=scanForGuests.join(‘; ‘);
//Creates an alert in the proper format including stories and guests.
alert(“Lead Stories: “ + leadStories + “. ” + guests + “. SEE TRANSCRIPT FIELD FOR FULL TRANSCRIPT.“)
Go to the page. Open up developer tools (ctrl+shift+j in chrome) and paste your code in the console to see what's wrong.
The $ in var leadStories = $(".cnnTransSubHead").text(); is from jQuery and the link provided does not have jQuery loaded into the page.
On any modern browser you should be able to achieve the same results without jQuery:
var leadStories = document.getElementsByClassName('cnnTransSubHead')
.map(function(el) { return el.innerText } );
next we have:
var scanForGuests=/[A-Z ].+,[A-Z0-9 ].+:/g;
var guests=scanForGuests.join('; ');
scanForGuests IS a regular expression, you never actually matched it to anything - so .join() is going to throw an error. I'm not exactly sure what you're trying to do. Are you trying to scan the full text of the page for that regex? In that case something like this would be your best bet
document.body.innerText.match(scanForGuests);
keep in mind that while innerText removes html markup, it's far from perfect and what pops up in it is very much at the mercy of how the page's html is structured. That said, on my quick test it seems to work.
Finally, for something like this you should use an immediately invoked function or you're sticking all your variables into the global context.
So putting it all together you get something like this:
(function() {
var leadStories = document.getElementsByClassName('cnnTransSubHead')
.map(function(el) { return el.innerText } );
var scanForGuests=/[A-Z ].+,[A-Z0-9 ].+:/g;
var guests = document.body.innerText.match(scanForGuests).join("; ");
alert("Leads: " + leadStories + " Guests: " + guests);
})();

Categories