Pulling Articles from Large JSON Response - javascript

I'm trying to code something which tracks the Ontario Immigrant Nominee Program Updates page for updates and then sends an email alert if there's a new article. I've done this in PHP but I wanted to try and recreate it in JS because I've been learning JS for the last few weeks.
The OINP has a public API, but the entire body of the webpage is stored in the JSON response (you can see this here: https://api.ontario.ca/api/drupal/page%2F2020-ontario-immigrant-nominee-program-updates?fields=body)
Looking through the safe_value - the common trend is that the Date / Title is always between <h3> tags. What I did with PHP was create a function that stored the text between <h3> into a variable called Date / Title. Then - to store the article body text I just grabbed all the text between </h3> and </p><h3> (basically everything after the title, until the beginning of the next title), stored it in a 'bodytext' variable and then iterated through all occurrences.
I'm stumped figuring out how to do this in JS.
So far - trying to keep it simple, I literally have:
const fetch = require("node-fetch");
fetch(
"https://api.ontario.ca/api/drupal/page%2F2020-ontario-immigrant-nominee-program-updates?fields=body"
)
.then((result) => {
return result.json();
})
.then((data) => {
let websiteData = data.body.und[0].safe_value;
console.log(websiteData);
});
This outputs all of the body. Can anyone point me in the direction of a library / some tips that can help me :
Read through the entire safe_value response and break down each article (Date / Title + Article body) into an array.
I'm probably then just going to upload each article into a MongoDB and then I'll have it checked twice daily -> if there's a new article I'll send an email notif.
Any advice is appreciated!!
Thanks,

You can use regex to get the content of Tags e.g.
/<h3>(.*?)<\/h3>/g.exec(data.body.und[0].safe_value)[1]
returns August 26, 2020

With the use of some regex you can get this done pretty easily.
I wasn't sure about what the "date / title / content" parts were but it shows how to parse some html.
I also changed the code to "async / await". This is more of a personal preference. The code should work the same with "then / catch".
(async () => {
try {
// Make request
const response = await fetch("https://api.ontario.ca/api/drupal/page%2F2020-ontario-immigrant-nominee-program-updates?fields=body");
// Parse response into json
const data = await response.json();
// Get the parsed data we need
const websiteData = data.body.und[0].safe_value;
// Split the html into seperate articles (every <h2> is the start of an new article)
const articles = websiteData.split(/(?=<h2)/g);
// Get the data for each article
const articleInfo = articles.map((article) => {
// Everything between the first h3 is the date
const date = /<h3>(.*)<\/h3>/m.exec(article)[0];
// Everything between the first h4 is the title
const title = /<h4>(.*)<\/h4>/m.exec(article)[0];
// Everything between the first <p> and last </p> is the content of the article
const content = /<p>(.*)<\/p>/m.exec(article)[0];
return {date, title, content};
});
// Show results
console.log(articleInfo);
} catch(error) {
// Show error if there are any
console.log(error);
}
})();
Without comments
(async () => {
try {
const response = await fetch("https://api.ontario.ca/api/drupal/page%2F2020-ontario-immigrant-nominee-program-updates?fields=body");
const data = await response.json();
const websiteData = data.body.und[0].safe_value;
const articles = websiteData.split(/(?=<h2)/g);
const articleInfo = articles.map((article) => {
const date = /<h3>(.*)<\/h3>/m.exec(article)[0];
const title = /<h4>(.*)<\/h4>/m.exec(article)[0];
const content = /<p>(.*)<\/p>/m.exec(article)[0];
return {date, title, content};
});
console.log(articleInfo);
} catch(error) {
console.log(error);
}
})();

I just completed creating .Net Core worker service for this.
The value you are looking for is "metatags.description.og:updated_time.#attached.drupal_add_html_head..#value"
The idea is if the last updated changes you send an email notification!
Try this in you javascript
fetch(`https://api.ontario.ca/api/drupal/page%2F2021-ontario-immigrant-nominee-program-updates`)
.then((result) => {
return result.json();
})
.then((data) => {
let lastUpdated = data.metatags["og:updated_time"]["#attached"].drupal_add_html_head[0][0]["#value"];
console.log(lastUpdated);
});
I will be happy to add you to the email list for the app I just created!

Related

tensorflowjs - universal sentence encoder taking 8 seconds to encode input

I'm running some tests using tensorflow. I have this code to provide the inputData to train the model:
import * as use from '#tensorflow-models/universal-sentence-encoder';
const encodeData = async (data) => {
const texts = data.map(d => d.text.toLowerCase())
const trainingData = await use.load().then(async (model) => {
return model.embed(texts)
.then(embeddings => {
return embeddings
})
})
.catch((err) => console.log(err))
return trainingData
}
It works okay tho, the problem is when I try to predict something using a text, I also need to encode the text in order to predict. The problem is when I'm doing for example:
const dummyData = await encodeData([{ text: 'watching movie tonight?' }])
console.log(model.predict(dummyData).dataSync())
it takes like 8 seconds to encode this string to provide a prediction.
So let's say I want to put this website live, I can't let the user waiting 8 seconds to provide a result, I guess it is too long. Is it something I could do or maybe another faster library to use instead of this one? It is really slow.

Returning an attribute from a fetch call using the Star Wars API

I am using the Star Wars API (SWAPI) for a project. I am in the process of learning fetch API. An issue I have is that, I have a section for species and when I press the button it returns certain attributes for that species. This is the code:
function getSpecies() {
let numberSpecies = Math.floor((Math.random()*37)+1)
let apiUrl = 'https://swapi.dev/api/species/' + numberSpecies
fetch(apiUrl)
.then(function(response){
return response.json()
})
.then (function(json){
console.log(json)
let name = document.getElementById('species-name')
let classification = document.getElementById('classification')
let designation = document.getElementById('designation')
let language = document.getElementById('language')
let lifespan = document.getElementById('lifespan')
name.innerText = `Species Name: ${json['name']}`;
classification.innerText = `Classification: ${json['classification']}`;
designation.innerText = `Designation: ${json['designation']}`;
language.innerText = `Language: ${json['language']}`;
lifespan.innerText = `Lifespan: ${json['average_lifespan']}-years`;
})
})
}
So, when I press the button it fetches the info I want. However, there is a data attribute in the species section which is labelled 'people' and this show the people that belong to that species. The 'people' attribute is another link within the SWAPI. In other words, the attribute is made up of more urls which require using fetch.
Mu problem is I want to call those multiple urls and return the 'name' from them and then show it. So, if the species is 'Human' it has four people and therefore 4-url's and I want to fetch all those url's and only grab the 'name' of each 'Human'. This is what I have tried (it goes directly below the last line of code i.e below lifespan.innertext):
const peopleUrl = json.people
peopleUrl.forEach(people = (data) => {
fetch(data)
.then(function(response){
return response.json()
})
.then (function(json){
console.log(json)
let inhabitants = document.getElementById('inhabitants')
inhabitants.innerText = json.name
})
})
This however, only return the name of the last person in the array and not all of them.
Is there a way I can fetch all the 'names'?

Simplifying JS query from JSON API

I'm trying to display some data from a API, using similar to the below example.
I need to feed in data such as a Video ID from a CMS, which will then retrieve additional data from the API about the video.
The below example works, but I'm not sure I've written it in the best way - and I would like to be able to set a variable using a Video ID from the CMS's front end template string. Below I am having to repeat the Video ID in order to get it to work - and I'm struggling to simplify this.
Any help appreciated!
async function getSomething() {
let request = {
item: {
name: "example",
colour: "red"
},
video: ["12345"]
};
let url = new URL("https://example.com");
url.searchParams.append('data', JSON.stringify(request));
let response = await fetch(url);
let data = await response.json();
console.log(data.video);
return data;
}
getSomething().then(data =>
document.querySelector('[data-video-id]').innerHTML = data.video.12345.label
);

How to call a function after it's already defined in Puppeteer

I have a function that grabs text data from this site and stores it in a JSON file. This function is only crawling the first page of this website, but I'd like click through or "goto" each url (there are 10 pages) and grab the text data from each page:
await page.goto('http://quotes.toscrape.com/page/1/')
//grab quote data
const quotes = await page.evaluate(() => {
const grabFromDiv = (div, selector) => Array.from(div
.querySelectorAll(selector), (el => el.innerText.trim()))
Currently, it just navigates to page 1, grabs the data, stores it, and then exits. Is there a way to call the quotes function over and over until I've navigated through all 10 pages and collected all the data?
I would just do the same for each page.
If you know the number of pages, then just do:
var quotes = ''
for each page
await page.goto(page)
quotes+ = await page.evaluate(myPageFunction)
If you don't know the number of pages, you need to get that information from the actual page.
then, in the evaluate function just search for the next page:
myPageFunction = function(){
// get your data
const nextPage = document.querySelector('.next a')?.href
return {data: yourData, nextPage: nextPage}
}
You would then have something like:
nextPage = 'http://quotes.toscrape.com/page/1/'
while (nextPage= {
await page.goto(nextPage)
const result = await page.evaluate(myPageFunction)
quotes += result.data
nextPage = resut.nextPage
}
The code is just an example and won't work as is.
Best!

How can I update my dictionary with nested HTTP request?

I'm gonna try to explain this as clearly as I can, but it's very confusing to me so bear with me.
For this project, I'm using Node.js with the modules Axios and Cheerio.
I am trying to fetch HTML data from a webshop (similar to Amazon/eBay), and store the product information in a dictionary. I managed to store most things (title, price, image), but the product description is on a different page. To do a request to this page, I'm using the URL I got from the first request, so they are nested.
This first part is done with the following request:
let request = axios.get(url)
.then(res => {
// This gets the HTML for every product
getProducts(res.data);
console.log("Got products in HTML");
})
.then(res => {
// This parses the product HTML into a dictionary of product items
parseProducts(productsHTML);
console.log("Generated dictionary with all the products");
})
.then(res => {
// This loops through the products to fetch and add the description
updateProducts(products);
})
.catch(e => {
console.log(e);
})
I'll also provide the way I'm creating product objects, as it might clarify the function where I think the problem occurs.
function parseProducts(html) {
for (item in productsHTML) {
// Store the data from the first request
const $ = cheerio.load(productsHTML[item]);
let product = {};
let mpUrl = $("a").attr("href");
product["title"] = $("a").attr("title");
product["mpUrl"] = mpUrl;
product["imgUrl"] = $("img").attr("src");
let priceText = $("span.subtext").text().split("\xa0")[1].replace(",", ".");
product["price"] = parseFloat(priceText);
products.push(product);
}
}
The problem resides in the updateProducts function. If I console.log the dictionary afterwards, the description is not added. I think this is because the console will log before the description gets added. This is the update function:
function updateProducts(prodDict) {
for (i in prodDict) {
let request2 = axios.get(prodDict[i]["mpUrl"])
.then(res => {
const $ = cheerio.load(res.data);
description = $("div.description p").text();
prodDict[i]["descr"] = description;
// If I console.log the product here, the description is included
})
}
// If I console.log the product here, the description is NOT included
}
I don't know what to try anymore, I guess it can be solved with something like async/await or putting timeouts on the code. Can someone please help me with updating the products properly, and adding the product descriptions? Thank you SO much in advance.
To refactor this with async/await one would do:
async function fetchAndUpdateProducts() => {
try {
const response = await axios.get(url);
getProducts(response.data);
console.log("Got products in HTML");
parseProducts(productsHTML);
console.log("Generated dictionary with all the products");
await updateProducts(products);
} catch(e) {
console.log(e);
}
}
fetchAndUpdateProducts().then(() => console.log('Done'));
and
async function updateProducts(prodDict) {
for (i in prodDict) {
const response = await axios.get(prodDict[i]["mpUrl"]);
const $ = cheerio.load(response.data);
description = $("div.description p").text();
prodDict[i]["descr"] = description;
}
}
This will not proceed to conclude the call to fetchAndUpdateProducts unless the promise returned by updateProducts has been resolved.

Categories