How to crawling using Node.js - javascript

I can't believe that I'm asking an obvious question, but I still get the wrong in console log.
Console shows crawl like "[]" in the site, but I've checked at least 10 times for typos. Anyways, here's the javascript code.
I want to crawl in the site.
This is the kangnam.js file :
const axios = require('axios');
const cheerio = require('cheerio');
const log = console.log;
const getHTML = async () => {
try {
return await axios.get('https://web.kangnam.ac.kr', {
headers: {
Accept: 'text/html'
}
});
} catch (error) {
console.log(error);
}
};
getHTML()
.then(html => {
let ulList = [];
const $ = cheerio.load(html.data);
const $allNotices = $("ul.tab_listl div.list_txt");
$allNotices.each(function(idx, element) {
ulList[idx] = {
title : $(this).find("list_txt title").text(),
url : $(this).find("list_txt a").attr('href')
};
});
const data = ulList.filter(n => n.title);
return data;
}). then(res => log(res));
I've checked and revised at least 10 times
Yet, Js still throws this result :
root#goorm:/workspace/web_platform_test/myapp/kangnamCrawling(master)# node kangnam.js
[]

Mate, I think the issue is you're parsing it incorrectly.
$allNotices.each(function(idx, element) {
ulList[idx] = {
title : $(this).find("list_txt title").text(),
url : $(this).find("list_txt a").attr('href')
};
});
The data that you're trying to parse for is located within the first index of the $(this) array, which is really just storing a DOM Node. As to why the DOM stores Nodes this way, it's most likely due to efficiency and effectiveness. But all the data that you're looking for is contained within this Node object. However, the find() is superficial and only checks the indexes of an array for the conditions you supplied, which is a string search. The $(this) array only contains a Node, not a string, so when you you call .find() for a string, it will always return undefined.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/find
You need to first access the initial index and do property accessors on the Node. You also don't need to use $(this) since you're already given the same exact data with the element parameter. It's also more efficient to just use element since you've already been given the data you need to work with.
$allNotices.each(function(idx, element) {
ulList[idx] = {
title : element.children[0].attribs.title,
url : element.children[0].attribs.href
};
});
This should now populate your data array correctly. You should always analyze the data structures you're parsing for since that's the only way you can correctly parse them.
Anyways, I hope I solved your problem!

Related

Approach to selecting a document

I am using Couchbase in a node app. Every time I insert a document, I am using a random UUID.
It inserts fine and I could retrieve data based on this id.
But in reality, I actually want to search by a key called url in the document. To be able to get or update or delete a document.
I could possibly add the url as the id I suppose but that is not what I see in any database concepts. Ids are not urls
or any unique names. They are typically random numbers or incremental numbers.
How could I approach this so that I can use a random UUID as id but be able to search by url?
Cos lets say the id was 56475-asdf-7856, I am not going to know this value to search for right.
Whereas if the id was https://www.example.com I know about this url and searching for it would give me what I want.
Is it a good idea making the url the id.
This is in a node app using Couchbase.
databaseRouter.put('/update/:id', (req, res) => {
updateDocument(req)
.then(({ document, error }) => {
if (error) {
res.status(404).send(error);
}
res.json(document);
})
.catch(error => res.status(500).send(error));
});
export const updateDocument = async (req) => {
try {
const result = await collection.get(req.params.id); // Feels like id should be the way to do this, but doesn't make sense cos I won't know the id beforehand.
document.url = req.body.url || document.url;
await collection.replace(req.params.id, document);
return { document };
} catch (error) {
return { error };
}
};
I think it's okay to use URLs as IDs, especially if that's the primary way you're going to lookup documents, and you don't need to change the URL later. Yes, often times IDs are numbers or UUIDs, but there is no reason you have to be restricted to this.
However, another approach you can take is to use a SQL query (SQL++, technically, since this is a JSON database).
Something like:
SELECT d.*
FROM mybucket.myscope.mydocuments d
WHERE d.url = 'http://example.com/foo/baz/bar'
You'll also need an index with that, something like:
CREATE INDEX ix_url ON mybucket.myscope.mydocuments (url)
I'd recommend checking out the docs for writing a SQL++ query (sometimes still known as "N1QL") with Node.js: https://docs.couchbase.com/nodejs-sdk/current/howtos/n1ql-queries-with-sdk.html
Here's the first example in the docs:
async function queryPlaceholders() {
const query = `
SELECT airportname, city FROM \`travel-sample\`.inventory.airport
WHERE city=$1
`;
const options = { parameters: ['San Jose'] }
try {
let result = await cluster.query(query, options)
console.log("Result:", result)
return result
} catch (error) {
console.error('Query failed: ', error)
}
}

How do I create an array from Promise results?

I'm using React to build a web app. At one point I have a list of ids, and I want to use those to retrieve a list of items from a database, get a list of metrics from each one, and then push those metrics to an array. My code so far is:
useEffect(() => {
const newMetrics = [];
currentItems.forEach((item) => {
const url = `items/listmetrics/${item.id}`;
Client.getData(url).then(async (metrics) => {
let promises = metrics.map((metricId: string) => {
// Get metric info
const urlMetric = `metrics/${metricId}`;
return Client.getData(urlMetric);
});
await Promise.all(promises).then((metrics: Array<any>) => {
metrics.forEach((metric: MetricModel) => {
const metricItem = {
id: metric.id,
metricName: metric.name
};
newMetrics.push(metricItem);
}
});
});
});
setMetrics(newMetrics);
});
}, [currentItems]);
where "metrics" is a state variable, set by setMetrics.
This appears to work ok, but when I try to access the resulting metrics array, it seems to be in the wrong format. If I try to read the value of metrics[0], it says it's undefined (although I know there are several items in metrics). Looking at it in the console, metrics looks like this:
However, normally the console shows arrays like this (this is a different variable, I'm just showing how it's listed with (2) [{...},{...}], whereas the one I've created shows as []):
I'm not confident with using Promise.all, so I suspect that that's where I've gone wrong, but I don't know how to fix it.

Javascript : Filter a JSON object from API call to only get the infos I need

I'm trying to create a small project to work on API calls. I have created an async that recovers infos about a track using the MusicBrainz API. You can check the result of the request by clicking there : https://musicbrainz.org/ws/2/recording/5935ec91-8124-42ff-937f-f31a20ffe58f?inc=genres+ratings+releases+artists&fmt=json (I chose Highway to Hell from AC/DC).
And here is what I got so far as reworking the JSON response of my request :
export const GET_JSON = async function (url) {
try {
const res = await Promise.race([
fetch(url),
timeout(CONSTANTS.TIMEOUT_SEC),
]);
const data = await res.json();
if (!res.ok) throw new Error(`${data.message} (${res.status})`);
return data;
} catch (err) {
throw err;
}
};
export const loadTrackDetail = async function (id) {
try {
const trackData = await GET_JSON(
encodeURI(
`${CONSTANTS.API_URL}${id}?inc=genres+artists+ratings+releases&fmt=json`
)
);
details.trackDetails = {
trackTitle: trackData.title,
trackID: trackData.id,
trackLength: trackData.length ?? "No duration provided",
trackArtists: trackData["artist-credit"].length
? trackData["artist-credit"]
: "No information on artists",
trackReleases: trackData["releases"].length
? trackData["releases"]
: "No information on releases",
trackGenres: trackData["genres"].length
? trackData["genres"]
: "No information on genres",
trackRating: trackData.rating.value ?? "No rating yet",
};
console.log(details.trackDetails);
} catch (err) {
throw err;
}
Now this isn't half bad, but the releases property for example is an array of objects (each one being a specific release on which the track is present) but for each of those releases, I want to "reduce" the object to its id and title only. The rest does not interest me. Moreover, I'd like to say that if, for example, the title of a release is similar to that of a previous one already present, the entire object is not added to the new array.
I've thought about doing a foreach function, but I just can't wrap my head around how to write it correctly, if it's actually possible at all, if I should use an array.map for example, or another iterative method.
If anyone has some nice way of doing this in pure JS (not Jquery !), efficient and clean, it'd be much appreciated !
Cheers
There are a few things that make this question a little difficult to answer, but I believe the below will get you pointed in the right direction.
You don't include the GET_JSON method, so your example isn't complete and can't be used immediately to iterate on.
In the example you bring, there isn't a name property on the objects contained in the releases array. I substituted name with title below to demonstrate the approach.
You state
Moreover, I'd like to say that if, for example, the name of a release
is similar to that of a previous one already present, the entire
object is not added to the new array.
But you don't define what you consider that would make releases similar.
Given the above, as stated, I assumed you meant title when you said name and I also assumed that what would constitute a similar release would be one with the same name/title.
Assuming those assumptions are correct, I just fetch to retrieve the results. The response has a json method on it that will convert the response to a JSON object. The I map each release to the smaller data set you are interested in(id, title) and then reduce that array to remove 'duplicate' releases.
fetch('https://musicbrainz.org/ws/2/recording/5935ec91-8124-42ff-937f-f31a20ffe58f?inc=genres+ratings+releases+artists&fmt=json')
.then(m => m.json())
.then(j => {
const reducedReleases = j.releases
.map(release => ({ id: release.id, name: release.title }))
.reduce(
(accumulator, currentValue, currentIndex, sourceArray) => {
if (!accumulator.find(a => a.name === currentValue.name)) {
accumulator.push(currentValue);
}
return accumulator;
},
[]);
console.log(reducedReleases);
});
const releasesReduced = []
const titleNotExist = (title) => {
return releasesReduced.every(release => {
if(release.title === title) return false;
return true
})
}
trackData["releases"].forEach(release => {
if (titleNotExist(release.title))
releasesReduced.push({id: release.id, title: release.title})
})
console.log(releasesReduced)
The array details.trackDetails.trackReleases has a path to an id and name from different objects. If you meant: ["release-events"]=>["area"]["id"]and["area"]["name"]` then see the demo below.
Demo uses flatMap() on each level of path to extract "release-events" then "area" to return an array of objects
[{name: area.name, id: area.id}, {name: area.name, id: area.id},...]
Then runs the array of pairs into a for...of loop and sets each unique name with id into a ES6 Map. Then it returns the Map as an object.
{name: id, name: id, ...}
To review this functioning, go to this Plunker
const releaseEvents = (details.trackDetails.trackReleases) => {
let trackClone = JSON.parse(JSON.stringify(objArr));
let areas = trackClone.flatMap(obj => {
if (obj["release-events"]) {
let countries = obj["release-events"].flatMap(o => {
if (o["area"]) {
let area = {};
area.name = o["area"]["name"];
area.id = o["area"]["id"];
return [area];
} else {
return [];
}
});
return countries;
} else {
return [];
}
});
let eventAreas = new Map();
for (let area of areas) {
if (!eventAreas.has(area.name)) {
eventAreas.set(area.name, area.id);
}
}
return Object.fromEntries([...eventAreas]);
};
console.log(releaseEvents(releases));

Firebase Firestore - Async/Await Not Waiting To Get Data Before Moving On?

I'm new to the "async/await" aspect of JS and I'm trying to learn how it works.
The error I'm getting is Line 10 of the following code. I have created a firestore database and am trying to listen for and get a certain document from the Collection 'rooms'. I am trying to get the data from the doc 'joiner' and use that data to update the innerHTML of other elements.
// References and Variables
const db = firebase.firestore();
const roomRef = await db.collection('rooms');
const remoteNameDOM = document.getElementById('remoteName');
const chatNameDOM = document.getElementById('title');
let remoteUser;
// Snapshot Listener
roomRef.onSnapshot(snapshot => {
snapshot.docChanges().forEach(async change => {
if (roomId != null){
if (role == "creator"){
const usersInfo = await roomRef.doc(roomId).collection('userInfo');
usersInfo.doc('joiner').get().then(async (doc) => {
remoteUser = await doc.data().joinerName;
remoteNameDOM.innerHTML = `${remoteUser} (Other)`;
chatNameDOM.innerHTML = `Chatting with ${remoteUser}`;
})
}
}
})
})
})
However, I am getting the error:
Uncaught (in promise) TypeError: Cannot read property 'joinerName' of undefined
Similarly if I change the lines 10-12 to:
remoteUser = await doc.data();
remoteNameDOM.innerHTML = `${remoteUser.joinerName} (Other)`;
chatNameDOM.innerHTML = `Chatting with ${remoteUser.joinerName}`;
I get the same error.
My current understanding is that await will wait for the line/function to finish before moving forward, and so remoteUser shouldn't be null before trying to call it. I will mention that sometimes the code works fine, and the DOM elements are updated and there are no console errors.
My questions: Am I thinking about async/await calls incorrectly? Is this not how I should be getting documents from Firestore? And most importantly, why does it seem to work only sometimes?
Edit: Here are screenshots of the Firestore database as requested by #Dharmaraj. I appreciate the advice.
You are mixing the use of async/await and then(), which is not recommended. I propose below a solution based on Promise.all() which helps understanding the different arrays that are involved in the code. You can adapt it with async/await and a for-of loop as #Dharmaraj proposed.
roomRef.onSnapshot((snapshot) => {
// snapshot.docChanges() Returns an array of the documents changes since the last snapshot.
// you may check the type of the change. I guess you maybe don’t want to treat deletions
const promises = [];
snapshot.docChanges().forEach(docChange => {
// No need to use a roomId, you get the doc via docChange.doc
// see https://firebase.google.com/docs/reference/js/firebase.firestore.DocumentChange
if (role == "creator") { // It is not clear from where you get the value of role...
const joinerRef = docChange.doc.collection('userInfo').doc('joiner');
promises.push(joinerRef.get());
}
});
Promise.all(promises)
.then(docSnapshotArray => {
// docSnapshotArray is an Array of all the docSnapshots
// corresponding to all the joiner docs corresponding to all
// the rooms that changed when the listener was triggered
docSnapshotArray.forEach(docSnapshot => {
remoteUser = docSnapshot.data().joinerName;
remoteNameDOM.innerHTML = `${remoteUser} (Other)`;
chatNameDOM.innerHTML = `Chatting with ${remoteUser}`;
})
});
});
However, what is not clear to me is how you differentiate the different elements of the "first" snapshot (i.e. roomRef.onSnapshot((snapshot) => {...}))). If several rooms change, the snapshot.docChanges() Array will contain several changes and, at the end, you will overwrite the remoteNameDOM and chatNameDOM elements in the last loop.
Or you know upfront that this "first" snapshot will ALWAYS contain a single doc (because of the architecture of your app) and then you could simplify the code by just treating the first and unique element as follows:
roomRef.onSnapshot((snapshot) => {
const roomDoc = snapshot.docChanges()[0];
// ...
});
There are few mistakes in this:
db.collection() does not return a promise and hence await is not necessary there
forEach ignores promises so you can't actually use await inside of forEach. for-of is preferred in that case.
Please try the following code:
const db = firebase.firestore();
const roomRef = db.collection('rooms');
const remoteNameDOM = document.getElementById('remoteName');
const chatNameDOM = document.getElementById('title');
let remoteUser;
// Snapshot Listener
roomRef.onSnapshot(async (snapshot) => {
for (const change of snapshot.docChanges()) {
if (roomId != null){
if (role == "creator"){
const usersInfo = roomRef.doc(roomId).collection('userInfo').doc("joiner");
usersInfo.doc('joiner').get().then(async (doc) => {
remoteUser = doc.data().joinerName;
remoteNameDOM.innerHTML = `${remoteUser} (Other)`;
chatNameDOM.innerHTML = `Chatting with ${remoteUser}`;
})
}
}
}
})

Unable to add async / await and then unable to export variable. Any help appreciated

Background: Been trying for the last 2 day to resolve this myself by looking at various examples from both this website and others and I'm still not getting it. Whenever I try adding callbacks or async/await I'm getting no where. I know this is where my problem is but I can't resolve it myself.
I'm not from a programming background :( Im sure its a quick fix for the average programmer, I am well below that level.
When I console.log(final) within the 'ready' block it works as it should, when I escape that block the output is 'undefined' if console.log(final) -or- Get req/server info, if I use console.log(ready)
const request = require('request');
const ready =
// I know 'request' is deprecated, but given my struggle with async/await (+ callbacks) in general, when I tried switching to axios I found it more confusing.
request({url: 'https://www.website.com', json: true}, function(err, res, returnedData) {
if (err) {
throw err;
}
var filter = returnedData.result.map(entry => entry.instrument_name);
var str = filter.toString();
var addToStr = str.split(",").map(function(a) { return `"trades.` + a + `.raw", `; }).join("");
var neater = addToStr.substr(0, addToStr.length-2);
var final = "[" + neater + "]";
// * * * Below works here but not outside this block* * *
// console.log(final);
});
// console.log(final);
// returns 'final is not defined'
console.log(ready);
// returns server info of GET req endpoint. This is as it is returning before actually returning the data. Not done as async.
module.exports = ready;
Below is an short example of the JSON that is returned by website.com. The actual call has 200+ 'result' objects.
What Im ultimately trying to achieve is
1) return all values of "instrument_name"
2) perform some manipulations (adding 'trades.' to the beginning of each value and '.raw' to the end of each value.
3) place these manipulations into an array.
["trades.BTC-26JUN20-8000-C.raw","trades.BTC-25SEP20-8000-C.raw"]
4) export/send this array to another file.
5) The array will be used as part of another request used in a websocket connection. The array cannot be hardcoded into this new request as the values of the array change daily.
{
"jsonrpc": "2.0",
"result": [
{
"kind": "option",
"is_active": true,
"instrument_name": "26JUN20-8000-C",
"expiration_timestamp": 1593158400000,
"creation_timestamp": 1575305837000,
"contract_size": 1,
},
{
"kind": "option",
"is_active": true,
"instrument_name": "25SEP20-8000-C",
"expiration_timestamp": 1601020800000,
"creation_timestamp": 1569484801000,
"contract_size": 1,
}
],
"usIn": 1591185090022084,
"usOut": 1591185090025382,
"usDiff": 3298,
"testnet": true
}
Looking your code we find two problems related to final and ready variables. The first one is that you're trying to console.log(final) out of its scope.
The second problem is that request doesn't immediately return the result of your API request. The reason is pretty simple, you're doing an asynchronous operation, and the result will only be returned by your callback. Your ready variable is just the reference to your request object.
I'm not sure about what is the context of your code and why you want to module.exports ready variable, but I suppose you want to export the result. If that's the case, I suggest you to return an async function which returns the response data instead of your request variable. This way you can control how to handle your response outside the module.
You can use the integrated fetch api instead of the deprecated request. I changed your code so that your component exports an asynchronous function called fetchData, which you can import somewhere and execute. It will return the result, updated with your logic:
module.exports = {
fetchData: async function fetchData() {
try {
const returnedData = await fetch({
url: "https://www.website.com/",
json: true
});
var ready = returnedData.result.map(entry => entry.instrument_name);
var str = filter.toString();
var addToStr = str
.split(",")
.map(function(a) {
return `"trades.` + a + `.raw", `;
})
.join("");
var neater = addToStr.substr(0, addToStr.length - 2);
return "[" + neater + "]";
} catch (error) {
console.error(error);
}
}
}
I hope this helps, otherwise please share more of your code. Much depends on where you want to display the fetched data. Also, how you take care of the loading and error states.
EDIT:
I can't get responses from this website, because you need an account as well as credentials for the api. Judging your code and your questions:
1) return all values of "instrument_name"
Your map function works:
var filter = returnedData.result.map(entry => entry.instrument_name);
2)perform some manipulations (adding 'trades.' to the beginning of each value and '.raw' to the end of each value.
3) place these manipulations into an array. ["trades.BTC-26JUN20-8000-C.raw","trades.BTC-25SEP20-8000-C.raw"]
This can be done using this function
const manipulatedData = filter.map(val => `trades.${val}.raw`);
You can now use manipulatedData in your next request. Being able to export this variable, depends on the component you use it in. To be honest, it sounds easier to me not to split this logic into two separate components - regarding the websocket -.

Categories