I'm writing a crawler using node.js for study, and crawling data for future use as well. I know how to crawl a single element in the pages, but I couldn't figure out how to get the value of variable child-elements after a whole days's study.
Here is the part of HTML which I want to crawl. Each sub-element of 'attrgroup' has different number of
<p class="attrgroup">
<span><b>4</b>BR / <b>1</b>Ba</span>
<span><b>1200</b>ft<sup>2</sup></span>
<span>duplex</span>
<span>laundry on site</span>
<span>street parking</span>
<br><span>cats are OK - purrr</span></p>
Here is my code
topics = topics.map(function (topicPair) {
var topicUrl = topicPair[0];
var topicHtml = topicPair[1];
var $ = cheerio.load(topicHtml);
return ({
//[1]I got correct value,such as duplex, using following clauses.
att1: $('.attrgroup').children().eq(0).text().trim(),
att2: $('.attrgroup').children().eq(1).text().trim(),
att3: $('.attrgroup').children().eq(2).text().trim(),
//[2]I want all of them,but.each function doesn't return the correct data
atts: $('.attrgroup').children().each(function(){
$(this).text()
}),
});
});
I got result like this:
att1: '4BR / 1Ba',
att2: '1200ft2',
att3: 'duplex'
atts: { '0': [Object],
'1': [Object],
'2': [Object],
'3': [Object],
'4': [Object],
'5': [Object],
options: [Object],
_root: [Object],
length: 7,
prevObject: [Object] },
Currently, I know the reason might be $(this),a jquery object. I tried to convert it to DOM object which didn't work either.
Could any one help me correct that part of my code, or tell me how to fix it. It doesn't have to use each method, any method works is welcoming. Or a hint will help a lot as well. Thanks in advance!
Maybe something like this?
return (function () {
var object = {};
$('.attrgroup').children().each(function(i, element){
object["att" + i] = $(element).text().trim();
});
return object;
})();
Related
I am receiving a JSON file from an APIcall (see below). Now, I would like to access the different keys of the JSON file to present the various values separately. However, I get an 'undefined' message in the console. Any ideas?
This is the code to call the API and process JSON file:
const input = req.body.var_1;
console.log('Request Query:' + JSON.stringify({ "user_input": input}));
const articles = [];
const sent = 'test';
const api_url = '...';
const options = {
method: 'POST',
body: JSON.stringify({"user_input": input}),
headers: {'Content-Type': 'application/json'}
}
const response = await fetch(api_url, options);
const results = await Promise.all([response.json()]);
console.log(results);
data = JSON.parse(JSON.stringify(results))
console.log(data);
console.log(data.Topic);
This is the console output I get:
Request Query:{"user_input":"xgboost"}
[
{
Topic: {
'0': 'random forest , Machine Learning and Wine Quality: Finding a good wine using multiple classifications',
'1': 'decision tree learning , Gradient Boost for Classification',
'2': 'feature selection , Understanding Multilabel Text Classification and the related process',
'3': 'decision tree learning , Gradient Boost for Regression',
'4': '<strong class="markup--strong markup--h3-strong">Understanding AdaBoost</strong> , Anyone starting to learn Boosting technique should start first with AdaBoost or…'
},
'URL/PDF': {
'0': 'https://dev.to//leading-edje/machine-learning-and-wine-quality-finding-a-good-wine-using-multiple-classifications-4kho',
'1': 'https://dev.to//xsabzal/gradient-boost-for-classification-2f15',
'2': 'https://dev.to//botreetechnologies/understanding-multilabel-text-classification-and-the-related-process-n66',
'3': 'https://dev.to//xsabzal/gradient-boost-for-regression-1e42',
'4': 'https://towardsdatascience.com/understanding-adaboost-2f94f22d5bfe'
},
Doc_Type: {
'0': 'article',
'1': 'article',
'2': 'article',
'3': 'article',
'4': 'article'
}
}
]
[
{
Topic: {
'0': 'random forest , Machine Learning and Wine Quality: Finding a good wine using multiple classifications',
'1': 'decision tree learning , Gradient Boost for Classification',
'2': 'feature selection , Understanding Multilabel Text Classification and the related process',
'3': 'decision tree learning , Gradient Boost for Regression',
'4': '<strong class="markup--strong markup--h3-strong">Understanding AdaBoost</strong> , Anyone starting to learn Boosting technique should start first with AdaBoost or…'
},
'URL/PDF': {
'0': 'https://dev.to//leading-edje/machine-learning-and-wine-quality-finding-a-good-wine-using-multiple-classifications-4kho',
'1': 'https://dev.to//xsabzal/gradient-boost-for-classification-2f15',
'2': 'https://dev.to//botreetechnologies/understanding-multilabel-text-classification-and-the-related-process-n66',
'3': 'https://dev.to//xsabzal/gradient-boost-for-regression-1e42',
'4': 'https://towardsdatascience.com/understanding-adaboost-2f94f22d5bfe'
},
Doc_Type: {
'0': 'article',
'1': 'article',
'2': 'article',
'3': 'article',
'4': 'article'
}
}
]
undefined
Happy for any help, thx
Note the [ ] in the output - results (or data, for that matter) is an array, so the Topic would be results[0].Topic (the array itself doesn't have a property Topic, only its element has one, hence the undefined result).
But: It shouldn't even be an array in the first place, it only is one because you make it one here:
const results = await Promise.all([response.json()]);
Promise.all will wait for an array of promises and return an array of their resolved values. You provide an array with one promise, so you will get back an array with one value, instead of getting only the value itself.
There is no point in using Promise.all with just one promise though, so you can get rid of that part and thereby also get rid of the array:
const results = await response.json()
console.log(results.Topic)
(As you can see, the whole JSON.parse(JSON.stringify(...)) shenanigans are superfluous as well.)
It looks like you have the Object nested in an array so you have to access the first item in the array with data[0].Topic.
Given: Express app, AVA, supertest
When: I test generated html in response and the test case fails
Then: AVA displays the whole response object in the console which slows down analysis of the issue
Example of the test:
test('Positive case: book is found in /api/v1/books/1', async (t) => {
t.plan(2)
const response = await request(app).get('/api/v1/books/1')
t.is(response.status, 200)
const expectedBook = '<h3>Book 1</h3>'
t.truthy(response.res.text.match(expectedBook), 'Book title is not found')
})
Example of the output in the console
/express-project/src/books/index.test.js:22
21: const text = response.res.text
22: t.truthy(response.res.text.match(expectedBook), 'Book t…
23: })
Book title is not found
Value is not truthy:
null
response.res.text.match(expectedBook)
=> null
expectedBook
=> '<h3>Book 2</h3>'
response.res.text
=> '<!DOCTYPE html><html><head><title>BOOK</title><link rel="stylesheet" href="/stylesheets/style.css"></head><body><h1>BOOK</h1>
<h3>Book 1</h3><h4></h4></body></html>'
response.res
=> IncomingMessage {
_consuming: false,
_dumped: false,
_events: {
close: Function bound emit {},
data: [
Function {},
Function {},
Function bound emit {},
],
end: [
Function responseOnEnd {},
Function {},
Function bound emit {},
],
error: [
Function bound onceWrapper { … },
Function bound emit {},
],
},
_eventsCount: 4,
_maxListeners: undefined,
_readableState: ReadableState [
awaitDrain: 0,
.......... VERY LONG LIST WITH HUNDREDS OF LINES
SO HAVE TO SCROLL UP AND UP AND UP BEFORE YOU GET TO THE POINT
Ava is trying to be helpful in debugging the failed test so Ava puts in the console respectively
response.res.text
response.res
response
and it generates hundreds of even thousands of lines in the console
So the solution is pretty simple - use an intermediate variable for the assertion
Instead of
t.truthy(response.res.text.match(expectedBook), 'Book title is not found')
use
const text = response.res.text
t.truthy(text.match(expectedBook), 'Book title is not found')
Continuing yesterday's saga, now I can retrieve json objects in a response but I can't extract the data from them.
The following node.js snippet is from the file "accounts.js" which is in an ETrade api library that exists in the path /lib. It returns json containing data about the accounts of the authenticated user. The authentication part is working great.
exports.listAccounts = function(successCallback,errorCallback)
{
var actionDescriptor = {
method : "GET",
module : "accounts",
action : "accountlist",
useJSON: true,
};
this._run(actionDescriptor,{},successCallback,errorCallback);
};
The ETrade website says this call will produce the following sample response:
{
"AccountListResponse": {
"Account": [
{
"accountDesc": "MyAccount-1",
"accountId": "83405188",
"marginLevel": "MARGIN",
"netAccountValue": "9999871.82",
"registrationType": "INDIVIDUAL"
},
{
"accountDesc": "MyAccount-3",
"accountId": "83405553",
"marginLevel": "CASH",
"netAccountValue": "100105468.99",
"registrationType": "INDIVIDUAL"
},
{
"accountDesc": "SIMPLE IRA",
"accountId": "83405188",
"marginLevel": "CASH",
"netAccountValue": "99794.13",
"registrationType": "IRA"
}
]
}
}
In my app.js file, I have the following:
var etrade = require('./lib/etrade');
var et = new etrade(configuration);
et.listAccounts(
function(res){
var listAccountsRes = res;
console.log('account list success!');
console.log(listAccountsRes)
},
function(error) {
console.log("Error encountered while attempting " +
"to retrieve account list: " +
error);
});
When I run this code, the console log shows the following message:
{ 'json.accountListResponse':
{ response:
[ [Object],
[ [Object],
[ [Object],
[ [Object],
[ [Object],
[ [Object],
[ [Object],
[ [Object] ] } }
Suppose in app.js I want to put the accounts data in a variable called myAccounts.
One of our members, Jack, solved yesterday's problem and when I commented that I still couldn't access the data in the response, he suggested this: "That property has a dot in it so you'll have to use [ ... ] rather than dot notation to access it. See what's inside the objects with a['json.accountListResponse'].response." So far I have not been able to get that to work, even when I use ['json.accountListResponse'].res like this:
var listAccountsRes = [json.accountListResponse].res;
This returns undefined when printed to the console.
Thanks to Adam for his suggestion which led to this which works:
var listAccountsRes = res['json.accountListResponse'];
var listAccounts = listAccountsRes['response'];
console.log('account list success!');
console.log(listAccounts)
Now the console log reports almost exactly what ETrade says I should get. (They appear to have changed the name "Account" to "response"). I presume my variable listAccounts now contains the json with eight sample accounts in it that I can see in my console log. But I still don't know how to access individual elements. There should be some simple code that will iterate over the json file and produce an array of arrays that I could actually use for something. I tried accessing it like an array: console.log(listAccounts[0]) but that returns undefined. Do I need to stringify it or something?
I'm using node.js/express and I have a Mongodb to store some sets of data. On a webpage the user can enter, edit and delete data (which all works fine). For example, to add data I have the following code:
router.post('/addset', function(req,res) {
var db = req.db;
var collection = db.get('paramlist');
collection.insert(req.body, function(err, result){
res.send(
(err === null) ? { msg: '' } : { msg: err }
);
});
});
In my app.js file I include the lines
// Database
var mongo = require('mongodb');
var monk = require('monk');
var db = monk('localhost:27017/paramSet1');
as well as
app.use(function(req,res,next){
req.db = db;
next();
});
to make the database accessible in the rest of the code (following this tutorial: http://cwbuecheler.com/web/tutorials/2013/node-express-mongo/ , I'm a beginner with these things).
So all this works fine. My problem is the following: I would like to add a test if a dataset with the same name is already in the database and give a message to the user. Following this answer How to query MongoDB to test if an item exists? I tried using collection.find.limit(1).size() but I get the error
undefined is not a function
I tried the following things. In the cost above (router.post) I tried adding after the line var collection...
var findValue = collection.find({name:req.body.name});
If i then do console.log(findValue), I get a huge output JSON. I then tried console.log(findValue.next()) but I get the same error (undefined is not a function). I also tried
collection.find({name:req.body.name}).limit(1)
as well as
collection.find({name:req.body.name}).limit(1).size()
but also get this error. So in summary, collection.insert, collection.update and collection.remove all work, but find() does not. On the other hand, when I enter the mongo shell, the command works fine.
I would be grateful for any hints and ideas.
Edit:
The output to console.log(findValue) is:
{ col:
{ manager:
{ driver: [Object],
helper: [Object],
collections: [Object],
options: [Object],
_events: {} },
driver:
{ _construct_args: [],
_native: [Object],
_emitter: [Object],
_state: 2,
_connect_args: [Object] },
helper: { toObjectID: [Function], isObjectID: [Function], id: [Object] },
name: 'paramlist',
col:
{ _construct_args: [],
_native: [Object],
_emitter: [Object],
_state: 2,
_skin_db: [Object],
_collection_args: [Object],
id: [Object],
emitter: [Object] },
options: {} },
type: 'find',
opts: { fields: {}, safe: true },
domain: null,
_events: { error: [Function], success: [Function] },
_maxListeners: undefined,
emitted: {},
ended: false,
success: [Function],
error: [Function],
complete: [Function],
resolve: [Function],
fulfill: [Function],
reject: [Function],
query: { name: 'TestSet1' } }
find returns a cursor, not the matching documents themselves. But a better fit for your case would be to use findOne:
collection.findOne({name:req.body.name}, function(err, doc) {
if (doc) {
// A doc with the same name already exists
}
});
If you're using the method on that website http://cwbuecheler.com/web/tutorials/2013/node-express-mongo/
you have to change your code
collection.find({name:req.body.name}).limit(1)
and use it like this
collection.find({name:req.body.name},{limit:1})
and if you want to add more options
do like this
collection.find({name:req.body.name},{limit:1,project:{a:1},skip:1,max:10,maxScan:10,maxTimeMS:1000,min:100})
You can find everything here http://mongodb.github.io/node-mongodb-native/2.0/api/Cursor.html
you can use like that
app.get('/', (req, res) => {
db.collection('quotes').find().toArray((err, result) => {
console.log(result);
})
})
The first thing that looks wrong is your syntax is incorrect for find. You need to call it as a function. Try:
collection.find({name:req.body.name}).limit(1).size();
Probably you might have missed to connect a Database.
Try adding below code before executing
mongodb.connect(dbURI).then((result)=>console.log('connected to DB')).catch((err)=>console.log('connected to DB'));
I've got an issue reading a nested array from JSON(BSON from MongoHQ) using Node and Angular.
JSON snippet: http://pastie.org/9305682. Specifically look for the edges array.
Mongoose model: http://pastie.org/9305685
Basically I call the character from the DB and then attempt to log it to the console with
console.log(char); before sending it back to the angular call with res.json(char); 'char' is the returned character from the databased saved as my mongoose model.
Attempting to log the character to the console. I get everything looking normal except for the portions with the nested "effects" arrays. Anywhere they show up I receive the following:
edges:
[ { name: 'Super Hacker', notes: '', effects: [Object] },
{ name: 'Witty', notes: '', effects: [Object] },
{ name: 'Attractive', notes: '', effects: [Object] },
{ name: 'Encyclopedic Memory',
notes: 'Prereq: d8 Smarts',
effects: [Object] },
{ name: 'Daywalker', notes: '', effects: [Object] },
{ name: 'Tough', notes: '', effects: [Object] } ],
From here if I try to call it with:
From NodeJS - console.log(char[0].edges[0].effects[0].type); - Returns undefined.
From Angular View - {{cur_char.edges[0].effects[0].type}} - Displays nothing.
Thanks in advance for the help. Let me know if I can provide more in.
I think what you're asking is how to see more depth to the object in the console output. You can use util.inspect to print out more information:
console.log(util.inspect(char, { depth: 5 }));
By default util.inspect only goes to a depth of 2 which explains why you can only see the contents of the array (1) and the primitive properties of each element in the array (2).
See: http://nodejs.org/api/util.html#util_util_inspect_object_options