css selector code to scrape/parse data from tricky website

css selector code to scrape/parse data from tricky website - javascript

Im having difficulty developing an adequate "CSS selector" code to scrape/parse the odds for the following HTML. I am relatively new to node.js. but ive successfully scraped similar websites in the past. Unfortunately this website is a little more tricky(for me anyway!). I can tell the problem must be the CSS selector code im using. could you please show me how to write a code that allows me to scrape the odds 11/2 from..
<div class="market"..............>
<header class=..........>
<div class="market-content">
<div class ="selection">
<div class="selection name" data- bind="html:selection.getTitle()"> Aston Villa </div>
<selection-button params="....>
<div>
<div class="odds-button"..........>
<span class="price">
<span class="odds-convert"......> 11/2 </span>

Hard to say with just that snippet of HTML, but for that (using jQuery):
$('.odds-button > . price > .odds-convert').text();
Of course it could be that selector matches somewhere else too, then you'd have to make it more specific by including a longer path. However, making it too specific from the get go makes it too brittle if the structure of the HTML changes.

I don't know exactly how you're scraping the content in Node.js, which libraries or techniques you're using, but this is how I'd do it client-side:
var oddsElement = document.querySelector([
".market",
".market-content",
".selection",
// obviously, change below to match your data attribute
".selection.name[data-SOME_KEY=\"SOME_VALUE\"]",
".odds-button",
"span.price",
"span.odds-convert"
].join(" "));
if (typeof oddsElement === "object") {
var odds = (oddsElement.textContent || oddsElement.innerText);
// or could use "let" keyword in strict-mode Node.js for block-scope
} else { // no match
console.warn("Odds cannot be found.");
}

Related

Inject html after review widget loads for schema markup

My webstore uses Kudobuzz for product reviews, but our e-commerce platform (PDG) isn't supported for SEO markup data.
This widget does not support schema markup on it's own, so I want to somehow select the relevant pieces and inject the schema markup to the various divs/spans that make up the widget. One problem is figuring out how to inject code that google can parse, and another is figuring out how to make the actual selectors for this super bloated widget.
Here is a codepin of the widget and some markup data that is already on the site: http://codepen.io/anon/pen/GpddpO
Here is a link to a product page if you want to see how everything works: https://www.asseenontvhot10.com/product/2835/Professional-Leather--Vinyl-Repair-Kit
This is (roughly) the markup I'm trying to add if it helps:
<div itemscope itemtype="http://schema.org/Review">
<div itemprop="reviewBody">Blah Blah it works 5 star</div>
<div itemprop="author" itemscope itemtype="http://schema.org/Person">
Written by: <span itemprop="name">Author</span></div>
<div itemprop="itemReviewed" itemscope itemtype="http://schema.org/Thing">
<span itemprop="name">Stop Snore</span></div>
<div><meta itemprop="datePublished" content="2015-10-07">Date published: 10/07/2015</div>
<div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
<meta itemprop="worstRating" content="1"><span itemprop="ratingValue">5</span> / <span itemprop="bestRating">5</span> stars</div>
</div>

Theoretically you could write a very small amount of microdata using css :before and :after - with content but it would need all spaces and symbols converted into ISO format, eg.
#name:before { "\003cspan\2002itemprop\0022name\2033"}
#name:after { content: "\2044\003cspan003e"
even spaces need to be substitued with \2002 or an equivalent whitespace
code
should wrap this microdata to your HTML to any element called name:
<span itemprop="name">...</span>
Clearly this can only work if the widget lets you have clear ids or class names for the elements added, and it may be useless you know the type of object reviewed first (eg Book, Movie, since this needs to go at the start in the example I gave - which is incomplete). The code would need to be nested correctly so if you want further help can you edit your question with example HTML for a completed review.
Writing your own JSON-LD script at the top of the page is another option - it would be a different question (if you get stuck) but isn't embedded within the data itself
Edit
it's a good idea to test the css in a separate environment first, eg setup a jsfiddle

Are they any syntax highlighting plugins that will allow you to embed an ignorable html element into a snippet?

I am trying to make dynamic code examples for our api that can be constructed from from input html elements.
A paired down example looks like this, I give the user an input to name the device they would like to create.
<input class="observable-input" data-key="deviceName" type="text" value="deviceKey" />
I would then like that input to update code examples (replacing the device name in the example with the one the user inputs).
<code lang="python">
device = { "name": "<span data-observeKey="deviceName">Name</span>" }
client.createDevicewrite(device)
</code>
I have all of the code setup for observing a change in the input and updating the code examples, this works great. All of the syntax highlighters I have looked at, usually chop the snippet up and rerender the example wrapped with its own html (for styling). Is there an option/configurable way to get a syntax highlighter to not strip the these tags, or is there a different approach I should be looking at for preserving the syntax highlighting and still supporting dynamic updates without having to do a full text search of each snippet's rendered tags.
The example output of the pygment (current syntax highlighter I'm using).
<li>
<div class="line">
<span class="n">device</span>
<span class="o">=</span>
<span class="n">{</span>
<span class="s">"name"</span>
<span class="p">:</span>
<span class="s">"Name"</span>
<span class="n">}</span>
</div>
</li>

I decided to just go with a brute force approach, it ended up being decently performant, ill leave my code here if anyone is interested in what I did
https://gist.github.com/selecsosi/5d41dae843b9dea4888f
Since i use backbone, lodash, and jquery as my base app frameworks the gist uses those. I have a manager which will push updates from inputs to spans on the page which I use to dynamically update the code examples

a few questions regarding innerHTML

I am converting an old programmer's joke program created here in Brazil that is simmilar to MIT's SCIgen but using artistic jargon instead of businnes gibberish.
As the program is far too old (geocities era old) it uses lots of document.write instead of innerHTHML of course.
First question is, is it safe to place like tons of code inside of innerHTMLs? As the original program loads 4 sets of arrays with every possible piece of text that can be combined to form a pseudo-essay, this is a piece of code:
new_window.document.write("<body bgcolor=\"#000000\">");
new_window.document.write("<body text=\"#00FF00\">")
new_window.document.write("<p align=\"center\"><b>"+atitle+"</b><hr></p>");
firstshot = 1;
paragraph = 0;
while(lines > 0) {
if (firstshot == 1) {
if (lines % 101 == 0 && lines % 19 == 0) {
new_window.document.write(tab0.chr(1,0)+tab0.chr(0,1)+tab3.chr(0,0)
.....
...
this continues in a inch long non nested chunk of code, the entire code is here http://jsfiddle.net/jmqdx09g/
I'm experimenting and this is what I got so far:
<body>
<div id="target"></div>
<div id = "myDiv"></div>
<span id = "mySpan"></span>
<br>
<button id="restore">restore</button>
<p> </p>
<form name="form1" method="post" action="">
<input type="submit" name="remove" id="remove" value="remove">
</form>
<p> </p>
</body>
<script type = "text/javascript">
var message =
'<li>Home</li>'+
'<li>About</li>'+
'<li>Contact</li>'+
'<li>Works</li>'+
' <li>Projects</li>'+
'<li>Curriculum</li>'
var message2 =
'<div class="content">'+
'<iframe src="/yourpage.html" frameborder="0" width="600" height="650" scrolling="no">'+
'<p>Your browser does not support iframes.</p>'+
'</iframe></div>'
document.getElementById("myDiv").innerHTML = message; // use innerHTML for block and inline HTML elements
document.getElementById("remove").addEventListener("click", function ()
{
document.getElementById("myDiv").innerHTML = message2;
});
document.getElementById("restore").addEventListener("click", function ()
{
document.getElementById("myDiv").innerHTML = message;
});
and it works as expected which is load a few html stuff, on a press of a button stuff is replaced by an iframe
is the iframe the best solution for this or replacing the entire html with js is the way to go?
the
var message =
'somecode'+
'somecode'+
looks safe until now, but as far as I get into the converting, am I going to have headaches or this method is straight forward as it looks like?
should I use window.onload instead of replacing the content holder div?

My two cents worth...
Is it safe to place like tons of code inside of innerHTMLs?
Safe, yes... Easily maintainable, no. Front end code is for the client so if they choose to hack themselves let them... Of course anything that is sent back to the server should be sanitised and not trusted, but that is a completely different issue.
In my opinion the greatest problem is maintainability.
Next the JS, refactor this into a separate file, start caching variables makes the code easier to look at.
Finally, do you need the iFrame? Or a new window? Couldn't you simply append the "artistic jargon" to the bottom of the current html? Thus saving the headache of the iframes.
I am a complete advocate of non-jQuery, but maybe for you using jQuery's HTML editing API might be a great idea. Could help to abstract some issues into a more readable and maintainable form. Then again, vanilla JS is really awesome and if it can be done that way its a great way to learn.

jQuery highlight pieces of text in an element across tags

I want to select and return searched text using jQuery.
The problem is; parts of the text may be located in <span> or other inline elements, so when searching for 'waffles are tasty' in this text: 'I'm not sure about <i>cabbages</i>, but <b>waffles</b> <span>are</span> <i>tasty</i>, indeed.', you wouldn't get any matches, while the text appears uninterrupted to people.
Let's use this HTML as an example:
<div id="parent">
<span style="font-size: 1.2em">
I
</span>
like turtles
<span>
quite a
</span>
lot, actually.
<span>
there's loads of
</span>
tortoises over there, OMG
<div id="child">
<span style="font-size: 1.2em">
I
</span>
like turtles
<span>
quite a
</span>
lot, actually.
TURTLES!
</div>
</div>
With this (or similar) JavaScript:
$('div#parent').selectText({query: ['i like', 'turtles', 'loads of tortoises'], caseinsensitive: true}).each(function () {
$(this).css('background-color', '#ffff00');
});
//The (hypothetical) SelectText function would return an array of wrapped elements to chain .each(); on them
You would want to produce this output: (without the comments, obviously)
<div id="parent">
<span style="font-size: 1.2em">
<span class="selected" style="background-color: #ffff00">
I
</span> <!--Wrap in 2 separate selection spans so the original hierarchy is disturbed less (as opposed to wrapping 'I' and 'like' in a single selection span)-->
</span>
<span class="selected" style="background-color: #ffff00">
like
</span>
<span class="selected" style="background-color: #ffff00"> <!--Simple match, because the search query is just the word 'turtles'-->
turtles
</span>
<span>
quite a
</span>
lot, actually.
<span>
there's
<span class="selected" style="background-color: #ffff00">
loads of
</span> <!--Selection span needs to be closed here because of HTML tag order-->
</span>
<span class="selected" style="background-color: #ffff00"> <!--Capture the rest of the found text with a second selection span-->
tortoises
</span>
over there, OMG
<div id="child"> <!--This element's children are not searched because it's not a span-->
<span style="font-size: 1.2em">
I
</span>
like turtles
<span>
quite a
</span>
lot, actually.
TURTLES!
</div>
</div>
The (hypothetical) SelectText function would wrap all selected text in <span class="selected"> tags, regardless of whether parts of the search are located in other inline elements like <span>, '', etc. It does not search the child <div>'s contents because that's not an inline element.
Is there a jQuery plugin that does something like this? (wrap search query in span tags and return them, oblivious to whether parts of the found text may be located in other inline elements?)
If not, how would one go about creating such a function? This function's kinda what I'm looking for, but it doesn't return the array of selected spans and breaks when parts of the found text are nested in other inline elements.
Any help would be greatly appreciated!

Piece of cake! See this.
Folded notation:
$.each(
$(...).findText(...),
function (){
...
}
);
In-line notation:
$(...).findText(...).each(function (){
...
}
);

Three options:
Use the browser's built-in methods for this. For the finding, IE has TextRange with its findText() method; other browsers (with the exception of Opera, last time I checked, which was a long time ago) have window.find(). However, window.find() may be killed off without being replaced at some point, which is not ideal. For the highlighting, you can use document.execCommand().
Use my Rangy library. There's a demo here: http://rangy.googlecode.com/svn/trunk/demos/textrange.html
Build your own code to search text content in the DOM and style it.
The first two options are covered in more detail on this answer:
https://stackoverflow.com/a/5887719/96100

Since I just so happened to be working on a similar thing right now, in case you'd like to see the beginnings of my interpretation of "option 3", I thought I'd share this, with the main feature being that all text nodes are traversed, without altering existing tags. Not tested across any unusual browsers yet, so no warranty whatsoever with this one!
function DOMComb2 (oParent) {
if (oParent.hasChildNodes()) {
for (var oNode = oParent.firstChild; oNode; oNode = oNode.nextSibling) {
if (oNode.nodeType==3 && oNode.nodeValue) { // Add regexp to search the text here
var highlight = document.createElement('span');
highlight.appendChild(oNode.cloneNode(true));
highlight.className = 'selected';
oParent.replaceChild(highlight, oNode);
// Or, the shorter, uglier jQuery hybrid one-liner
// oParent.replaceChild($('<span class="selected">' + oNode.nodeValue + '</span>')[0], oNode);
}
if (oNode.tagName != 'DIV') { // Add any other element you want to avoid
DOMComb2(oNode);
}
}
}
}
Then search through things selectively with jQuery perhaps:
$('aside').each(function(){
DOMComb2($(this)[0]);
});
Of course, if you have asides within your asides, strange things might happen.
(DOMComb function adapted from the Mozilla dev reference site
https://developer.mozilla.org/en-US/docs/Web/API/Node)

I wrote a draft as a fiddle. The main steps:
I made a plugin for jQuery
$.fn.selectText = function(params){
var phrases = params.query,
ignorance = params.ignorecase;
wrapper = $(this);
. . .
return(wrapper);
};
Now I can call the selection as a $(...).selectText({query:["tortoise"], ignorance: true, style: 'selection'});
I know you want to have iterator after the function call, but in your case it is impossible, because iterator have to return valid jQuery selectors. For example:
word <tag>word word</tag> word is not valid selector.
After sanitizing the content of wrapper, for each search makeRegexp() makes personal regular expression.
Each searched piece of html source goes to emulateSelection() and then wrapWords()
Main idea is to wrap in <span class="selection">...</span> each single piece of phrase not separated by tags, but not analyze the whole tree of nodes.
NOTE:
It's NOT working with <b><i>... tags in html. You have to make corrections in regexp string for it.
I not guarantee it will work with unicode. But who knows...

As I understood, we talking about iterators like $.each($(...).searchText("..."),function (str){...});.
Check the David Herbert Lawrence poem:
<div class="poem"><p class="part">I never saw a wild thing<br />
sorry for itself.<br />
A small bird will drop frozen dead from a bough<br />
without ever having felt sorry for itself.<br /></p></div>
Actually, after rendering, browser will understood it like this:
<div class="poem">
<p class="part">
<br>I never saw a wild thing</br>
<br>sorry for itself.</br>
<br>A small bird will drop frozen dead from a bough</br>
<br>without ever having felt sorry for itself.</br>
</p>
</div>
For example, I looking for the phrase: "wild thing sorry for". Therefore, I have to highligt the exerpt:
wild thing</br><br>sorry for
I can not wrap it like this <span>wild thing</br><br>sorry for</span>, then create jQuery selector by some temporary id="search-xxxxxx", and return it back -- it's wrong html. I can wrap each piece of text like this:
<span search="search-xxxxx">wild thing</span></br><br><span search="search-xxxxx">sorry for</span>
Then I have to call some function and return jQuery array of selectors:
return($("[search=search-xxxxx]"));
Now we have two "results": a) "wild thing"; b) "sorry for". Is it really what you want?
OR
You have to write you own each() function like another plugin to jQuery:
$.fn.eachSearch = function(arr, func){
...
};
where arr will be not an array of selectors, but array of arrays of selectors, like:
arr = [
{selector as whole search},
{[{selector as first part of search]}, {[selector as second part of search]}},
...
]

Is using custom HTML tags and replacing custom tags with outerHTML okay?

Is it alright to define and use custom tags? (that will not conflict with future html tags) - while replacing/rendering those by changing outerHTML??
I created a demo below and it seems to work fine
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<script type="text/javascript" src="jquery.min.js"></script>
</head>
<body>
<div id="customtags">
<c-TextField name="Username" ></c-TextField> <br/>
<c-NameField name="name" id="c-NameField"></c-NameField> <br/>
<c-TextArea name="description" ></c-TextArea> <br/>
<blahblah c-t="d"></blahblah>
</div>
</body>
<script>
/* Code below to replace the cspa's with the actual html -- woaah it works well */
function ReplaceCustomTags() {
// cspa is a random term-- doesn;t mean anything really
var grs = $("*");
$.each(grs, function(index, value) {
var tg = value.tagName.toLowerCase();
if(tg.indexOf("c-")==0) {
console.log(index);
console.log(value);
var obj = $(value);
var newhtml;
if(tg=="c-textfield") {
newhtml= '<input type="text" value="'+obj.attr('name')+'"></input>';
} else if(tg=="c-namefield") {
newhtml= '<input type="text" value="FirstName"></input><input type="text" value="LastName"></input>';
} else if(tg=="c-textarea") {
newhtml= '<textarea cols="20" rows="3">Some description from model</textarea>';
}
obj.context.outerHTML = newhtml;
}
z = obj;
});
}
if(typeof(console)=='undefined' || console==null) { console={}; console.log=function(){}}
$(document).ready(ReplaceCustomTags);
</script>
</html>
Update to the question:
Let me explain a bit further on this. Please assume that JavaScript is enabled on the browser - i.e application is not supposed to run without javascript.
I have seen libraries that use custom attributes to define custom behavior in specified tags. For example Angular.js heavily uses custom attributes. (It also has examples on custom-tags). Although my question is not from a technical strategy perspective - I fail to understand why it would strategically cause problems in scalability/maintainability of the code.
Per me code like <ns:contact .....> is more readable than something like <div custom_type="contact" ....> . The only difference is that custom tags are ignored and not rendered, while the div type gets rendered by the browser
Angular.js does show a custom-tag example (pane/tab). In my example above I am using outerHTML to replace these custom tags - whilst I donot see such code in the libraries - Am I doing something shortsighted and wrong by using outerHTML to replace custom-tags?

I can't think of a reason why you'd want to do this.
What would you think if you had to work on a project written by someone else who ignored all common practices and conventions? What would happen if they were no longer at the company to find out why they did something a certain way?
The fact that you have to just go through with JavaScript to make it work at all should be a giant red flag. Unless you have a VERY good reason to, do yourself a favor and use the preexisting tags. Six months from now, are you going to remember why you did things that way?

It may well work, but it's probably not a good idea. Screen readers and search engines may have a hard/impossible time reading your page, since they may not interpret the JavaScript. While I can see the point, it's probably better to use this template to develop with, then "bake" it to HTML before putting it on the server.

We Keep Coding

JavaScript is the programming language of the Web.

css selector code to scrape/parse data from tricky website - javascript

Related

Inject html after review widget loads for schema markup

Are they any syntax highlighting plugins that will allow you to embed an ignorable html element into a snippet?

a few questions regarding innerHTML

jQuery highlight pieces of text in an element across tags

Is using custom HTML tags and replacing custom tags with outerHTML okay?

Categories

Resources