In js using Cheerio, for this block, where "*" is dynamic text (such as an ID #):
<a class="article-link*" href="https://www.somedomain.com">
How do I extract the URL?
Is it possible to use a wildcard to get information after part of an element's name? I tried:
$("[class = 'article-link']*")
//fails probably because the string is terminated prematurely
$("*[class = 'article-link]*")
//malformed attribute (obviously, but thought I'd give it a whack)
$("*[class = 'article-link*']")
//fails (again, obviously)
$("*[class = 'article-link\*']")
//I was trying to escape the string, but I believe cheerio encapsulates the break character as part of the string because it's inside of [] - and idk if the wildcard can even be used this way
FYI - I can use a wildcard like this to get another element where information before a tag isn't the same (itemprop in this example) such as with different header tags coming before it:
var titleElem = $("*[itemprop = 'title']").get()
//gets [itemprop = 'title'] regardless of previous tag(s)
If the dynamic text is generated by Javascript then won't be able to access it via cheerio as cheerio is just a DOM parser.
If this is the case and you need to simulate browser action you could look into PhantomJS or Puppeteer.
The problem with request is that it can't execute javascript rendered data. Try using a headless browser instead. Nightmare is a great one.
npm install nightmare --save
You make a call using nightmare instance then pass the html code to your cheerio. Here is the sample:
const Nightmare = require('nightmare')
const nightmare = Nightmare({ show: true })
const cheerio = require('cheerio');
nightmare
.goto(url)
//do something in the chain to go to your desired page.
.evaluate(() => document.querySelector('body').outerHTML)
.then(function (html) {
cheerio.load(html);
// do something in cheerio perhaps something like:
let links = $("a[class^='article-link]").map(function(i, element) {
return $(this).attr('href');
}).toArray();
console.log(links) // => [link1, link2, ...]
})
.catch(function (error) {
console.error('Error:', error);
});
The way I accessed it was this:
const cheerio = require('cheerio');
const $ = cheerio.load(html);
//article is the div directly above this link, list-wrapper the div before that, a is this element
const rows = $('.list-wrapper article a');
//.attr selects an elements attributes
url = $(rows).attr('href').trim();
I had other elements to grab from this class or I would've done this on a single line:
url = $('.list-wrapper article a').attr('href').trim();
Related
My school blocked CTRL + U, but you can use 'view-source:' before a link to view the code. It takes awhile, so i've been trying to make a script to automatically direct to the source code. However, I keep getting errors because it is not a link
I have tried the following:
var code = fetch(`view-source:https://${location.hostname}${location.pathname}`);
location.href = (code);
and
var code = (`view-source:https://${location.hostname}${location.pathname}`);
location.href = (code);
In the first one, I see a bad request, and in the second, I a blank page with the words "view-source:" followed by the link
view-source: isn't a real protocol you can fetch().
However, just
var resp = await fetch('http://...');
var text = await resp.text();
document.body.textContent = text;
should replace the current document's body with the text contents of that URL...
If you try from frontend to fetch the source code you will run to CORS Problems. But you can use some proxyies like in the example beloow:
fetch('https://api.codetabs.com/v1/proxy?quest=https://stackoverflow.com/questions/75440023/script-to-get-source-code-of-website-js#75440023').then((response) => response.text()).then((text) => console.log(text));
I am quite new to Node JS and express but I am trying to build a website which serves static files. After some research I've found out that NodeJS with Express can be quite useful for this.
So far I managed to serve some static html files which are located on my server, but now I want to do something else:
I have an URL to an html page, and in that html page, there is a table with some information.
I want to extract specific a couple of values from it, and 1) save it as JSON in a file, 2) write those values in a html page. I've tried to play with jQuery, but so far I've been unsuccessful.
This is what I have so far:
1.node app running on port 8081, which I will further access it from anywhere with NGINX reverse proxy (I already have nginx setup and it works)
2.I can get the URL and serve it as HTML when I use the proper URI.
3.I see that the table doesn't have an ID, but only the "details" class associated with it. Also, I am only interested in getting these rows:
<div class='group'>
<table class='details'>
<tr>
<th>Status:</th>
<td>
With editors
</td>
</tr>
From what I've seen so far, jQuery would work fine if the table has an ID.
This is my code in app.js
var express = require('express');
var app = express();
var request = require('request');
const path = require('path');
var content;
app.use('/', function(req, res, next) {
var status = 'It works';
console.log('This is very %s', status);
//console.log(content);
next();
});
request(
{
uri:
'https://authors.aps.org/Submissions/status?utf8=%E2%9C%93&accode=CH10674&author=Poenaru&commit=Submit'
},
function(error, response, body) {
content = body;
}
);
app.get('/', function(req, res) {
console.log('Got a GET request for the homepage');
res.sendFile(path.join(__dirname, '/', 'index.html'));
});
app.get('/url', function(req, res) {
console.log('You requested table data!!!');
TO DO: SHOW ONLY THE THE VALUES OF THAT TABLE INSTEAD OF THE WHOLE HTML PAGE
res.send(content);
});
var server = app.listen(8081, function() {
var host = server.address().address;
var port = server.address().port;
console.log('Node-App listening at http://%s:%s', host, port);
});
Basically, the HTML content of that URL is saved into content variable, and now I would like to save only the table from it, and also output only the saved part to the new html page.
Any ideas?
Thank you in advance :)
Ok, So I've come across this package called cheerio which basically allows one to use jQuery on the server. Having the html code from that specific URL, I could search in that table the elements that I need. Cheerio is quite straight-forward and with this code I got the results I needed:
var cheerio = require('cheerio');
request(
'https://authors.aps.org/Submissions/status?utf8=%E2%9C%93&accode=CH10674&author=Poenaru&commit=Submit',
(error, res, html) => {
if (!error && res.statusCode === 200) {
const $ = cheerio.load(html);
const details = $('.details');
const articleInfo = details.find('th').eq(0);
const articleStatus = details
.find('th')
.next()
.eq(0);
//console.log(details.html());
console.log(articleInfo.html());
console.log(articleStatus.html());
}
}
);
Thank you #O.Jones and #avcS for guiding me to jsdon and html-node-parser. I will definitely play with those in the near future :)
Cheers!
Your task is called "scraping." You want to scrape a particular chunk of data from some web page you did not create and then return it as part of your own web page.
You have noticed a problem with scraping: often the page you're scraping does not cleanly identify the data you want with a distinctive id. So you must use some guesswork to find it. #AvcS pointed out a server-side npm library called jsdom you can use for this purpose.
Notice this: Even though browsers and nodejs both use Javascript, they are still very different environments. Browser Javascript has lots of built-in APIs to access web pages' Document Object Models (DOMs). But nodejs doesn't have those APIs. If you try to load jQuery into node.js, it won't work, because it depends on browser DOM APIs. The jsdom package gives you some of those DOM APIs.
Once you have fetched that web page to scrape, code like this may help you get what you need.
const jsdom = require("jsdom");
const { JSDOM } = jsdom;
...
const page = new JSDOM(page_in_text_string).window;
Then you can use a subset of the DOM APIs to find the elements you want in your page. In your example, you are looking for elements with the selector div.class table.group. You're looking for the div.class element.
You can do this sort of thing to find what you need:
const desiredTbl = page.document.querySelector("div.class table.group");
const desiredDiv = desiredTbl ? desiredTbl.parentNode : null;
const result = desiredDiv ? desiredDiv.textContent : null;
Finally do this:
page.close();
Your question says you want certain rows from your document. HTML document don't have rows, they have elements. If you want to extract just parts of elements (part of the table rather than the whole thing) you'll need to use some text-string code. Just sayin'
Also, I have not debugged any of this. That is left to you.
There's a smaller and faster library to do similar things called node-html-parser. If performance is important you may want that one instead.
I am writing an application in node.js that will navigate to a website, click a button on the website, and then extract certain pieces of data from the website. All is going well except for the button-clicking aspect. I cannot seem to simulate a button click. I'm extremely new at this, so I'd appreciate any suggestions y'all have! Sadly I've scoured the internet looking for a solution to this issue and have been unable to find one.
I have used .click() and .bind('click, ...) in a .js file that uses 'request' and 'cheerio'.
I have also tried using page.click() and page.evaluate() in a different .js file that uses 'chrome-launcher', 'chrome-remote-interface', and 'puppeteer'.
Here is my code for the 'request' and 'cheerio' file:
const request = require('request');
const cheerio = require('cheerio');
let p1 = {}, p2 = {}, p3 = {}, p4 = {}, p5 = {};
p1.name = 'TheJackal666';
p2.name = 'Naether Raviel';
p3.name = 'qman37';
p4.name = 'ranger51';
p5.name = 'fernanda12x';
const team = {1: p1, 2: p2, 3: p3, 4: p4, 5: p5};
for(var x in team){
let url = 'https://na.op.gg/summoner/userName=' +
team[x].name;
request(url, (error, response, html) => {
if (!error && response.statusCode == 200) {
const $ = cheerio.load(html);
$('.SummonerRefreshButton.Button.SemiRound.Blue').click();
//FIXME: MAKE A FUNCTION THAT SUCCESSFULLY "CLICKS" UPDATE BUTTON
team[x].overallWR = $('.winratio');
team[x].overallWR =
team[x].overallWR.text().match(/\d/g);
team[x].overallWR =
team[x].overallWR.join("");
console.log(team[x].overallWR);
}
});
}
I expect to successfully click the update button on any of the pages (there is a section on the page that says when it was last updated) without getting an error. As it is, I either get an error that:
"$(...).click is not a function"
or (if I incorporate that line into an outer function) I get no error, but no result.
See the documentation:
Cheerio is not a web browser
Cheerio parses markup and provides an API for traversing/manipulating the resulting data structure. It does not interpret the result as a web browser does. Specifically, it does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript. If your use case requires any of this functionality, you should consider projects like PhantomJS or JSDom.
Cheerio is a HTML parser.
Cheerio can be used to select and manipulate dom elements, but it is not a full browser.
Cheerio only has access to the original source dom, which means that if the dom of a webpage is manipulated by javascript, Cheerio will not notice that change.
Cheerio cannot be used to interact with dom elements (ala jQuery) because it does not similarly execute within a window (js window)
As of this moment, if you need to manipulate or select against js-rendered html, your best option is puppeteer. This is likely to change though,
HTH
Hi I am trying to get an iframe from wathfree.to.
Here is the code I am using for this purpose:
function getSecondBody(url2)
{
url2 = 'http://www.watchfree.to/watch-366-Forrest-Gump-movie-online-free-putlocker.html';
request(url2, function(err, resp, body)
{
var $ = cheerio.load(body);
var embedcode = $('.links_left_container');
embedcodetext2 = embedcode.html();
console.log(embedcodetext2);
return embedcodetext2;
});
}
But response returned doesn't contain the iframe that I need.
here is the response I am receiving:
while the actual page looks like this:
Only the iframe part is missing in my response.
view-source:http://www.watchfree.to/watch-366-Forrest-Gump-movie-online-free-putlocker.html
Please take a look the source of the page. There is no iframe. It's generic content. So when you request it, it won't load to DOM.
You need to crawl probably this line if you want openload embed link:
var locations = ["http:\/\/streamin.to\/embed-k2msgp8yhhd8-580x326.html","http:\/\/streamin.to\/embed-cbp8zxw3yo3f-580x326.html","http:\/\/embed.nowvideo.sx\/embed.php?v=2729070a4365a","http:\/\/thevideo.me\/embed-wm815ejm5uvb-580x326.html","https:\/\/estream.to\/embed-6u49dntawgr0.html","http:\/\/thevideo.me\/embed-n14mh2bjj2nz-580x326.html","http:\/\/vidtodo.com\/embed-bmjw5u1e18js.html","http:\/\/vidtodo.com\/embed-z8614uxy1824.html","https:\/\/openload.co\/embed\/xTLPXjSGz7o\/","https:\/\/openload.co\/embed\/F8gJS5Y1o1o\/"];
I know that this could be a very stupid question, but, since I'm totally new to Javascript, I'm not sure about how to do this. I want to write a script and run it through node on my laptop, and, in this script, I want to interact with a web page in order to use functions like document.getElementById and stuff like that.
In Python one could do this by using something like Beautiful Soup or requests, but how do you do this in Javascript?
I have implemented a crawler using cheerio and request-promise as follows:
https://www.npmjs.com/package/cheerio
let request = require('request-promise');
let cheerio = require('cheerio');
request = request.defaults({
transform: function (body) {
return cheerio.load(body);
}
});
// ... omitted
request({uri: 'http://example.org'})
.then($ => {
const element = $('.element-with-class');
});