Web scraper iterating over pages with Rx.js - javascript

About a month ago I built this web scraper using Async / Await as a async way of collecting info for a web scraper. I'm trying to build that very same scraper again using Rx.js. I've read through the docs and it seems to make sense, starting off is the hardest bit, but after that hump I made some progress.
You can see here that I get the first page on the site (page 0) and I need to use that page to get the count of pages (which is around 6000). I have that count and using the getPageURI(page) I can create each page URL, however my issue is that I can't figure out how to trigger, or fire, or pipe information back to the original pageRequestStream. I have this page count number and I need a way to iterate over it pushing data back to the first original pageRequestStream stream.
import cheerio from 'cheerio'
import Rx from 'rx'
import fetch from 'isomorphic-fetch'
const DIGITAL_NYC_URI = 'http://www.digital.nyc'
let getPageURI = (page) => `${DIGITAL_NYC_URI}/startups?page=${page}`
let getProfileURI = (profile) => `${DIGITAL_NYC_URI}${profile}`
function fetchURL(stream, dataType = 'json') {
return stream.flatMap(requestURL => {
return Rx.Observable.fromPromise(fetch(requestURL).then(res => res[dataType]()))
})
}
function getNumberOfPages($) {
let summary = $('.result-summary').text()
let match = summary.match(/Showing 1 - 20 of (\d+) Startups/)
return parseInt(match[1], 10)
}
function getCompaniesOnPage ($) {
let companySelector = 'h3.node-title a'
let companies = $(companySelector).map(function (i, el) {
let name = $(this).text()
let profile = $(this).attr('href')
return {
'name': name,
'profile': profile
}
}).get()
return companies
}
let pageRequestStream = Rx.Observable.just(getPageURI(0))
let pageResponseStream = fetchURL(pageRequestStream, 'text')
let parsedPageHTMLStream = pageResponseStream.map(html => cheerio.load(html))
let numberOfPagesStream = parsedPageHTMLStream.map(html => getNumberOfPages(html))
// not sure how to get this to iterate over count and fire url's into pageRequestStream
numberOfPagesStream.subscribe(pageCount => console.log(pageCount))
let companiesOnPageStream = parsedPageHTMLStream.flatMap(html => getCompaniesOnPage(html))
// not sure how to build up the company object to include async value company.profileHTML
companiesOnPageStream.subscribe(companies => console.log(companies))
// let companyProfileStream = companiesOnPageStream.map((company) => {
// return fetch(getProfileURI(company.profile))
// .then(res => res.html())
// .then(html => {
// company.profileHTML = html
// return company
// })
// })

Have a look at subjects, they allow you to fire events as you go.
Maybe this can serve as some inspiration
import cheerio from 'cheerio';
import Rx from 'rx';
import fetch from 'isomorphic-fetch';
function getCheerio(url) {
var promise = fetch(url)
.then(response => response.text())
.then(body => cheerio.load(body));
return Rx.Observable.fromPromise(promise);
}
const DIGITAL_NYC_URI = 'http://www.digital.nyc';
var pageRequest = new Rx.Subject();
pageRequest
.flatMap(pageUrl => getCheerio(pageUrl))
.flatMap(page$ => {
// here we pipe back urls into our original observable.
var nextPageUrl = page$('ul.pagination li.arrow a').attr('href');
if(nextPageUrl) pageRequest.onNext(DIGITAL_NYC_URI + '/' + nextPageUrl);
var profileUrls = page$('h3.node-title a')
.map(function() {
var url = page$(this).attr('href');
return DIGITAL_NYC_URI + '/' + url;
});
return Rx.Observable.from(profileUrls);
})
.flatMap(url => getCheerio(url))
.map(profile$ => {
// build the company profile here
return profile$('title').text();
})
.subscribe(value => console.log('profile ', value));
pageRequest.onNext(DIGITAL_NYC_URI + '/startups');

Related

Google Apps Script Working on backend but not on sheets

I am trying to create a script that pulls from the coin market cap API and displays the current price. The script is working fine on the back end when I assign the variable a value. However, when I try to run the function on sheets the returned value is null.
function marketview(ticker) {
var url = "https://pro-api.coinmarketcap.com/v1/cryptocurrency/quotes/latest?CMC_PRO_API_KEY=XXX&symbol=" + ticker;
var data = UrlFetchApp.fetch(url);
const jsondata = JSON.parse(data);
Logger.log(jsondata.data[ticker].quote['USD'].price)
}
My execution logs show that the scripts are running, but when when I use the function and try and quote ETH for example, the script is running for BTC.
When I do this on the backend and assign ETH the script works fine and returns the right quote. Any ideas on what I'm missing?
I did the same with coingecko API and add an issue having all my requests being rejected with quota exceeded error.
I understood that Google sheets servers IPs address were already spamming coingecko server. (I was obviously not the only one to try this).
This is why I used an external service like apify.com to pull the data and re-expose data over their API.
This is my AppScripts coingecko.gs:
/**
* get latest coingecko market prices dataset
*/
async function GET_COINGECKO_PRICES(key, actor) {
const coinGeckoUrl = `https://api.apify.com/v2/acts/${actor}/runs/last/dataset/items?token=${key}&status=SUCCEEDED`
return ImportJSON(coinGeckoUrl);
}
You need ImportJSON function, available here: https://github.com/bradjasper/ImportJSON/blob/master/ImportJSON.gs
Then in a cell I write: =GET_COINGECKO_PRICES(APIFY_API_KEY,APIFY_COINGECKO_MARKET_PRICES), you will have to create two field named APIFY_API_KEY and APIFY_COINGECKO_MARKET_PRICES in order for this to work.
Then register on apify.com, then you'll have to create an actor by forking apify-webscraper actor.
I set the StartURLs with https://api.coingecko.com/api/v3/coins/list, this will give me the total number of existing crypto (approx 11000 as of today), and number of page so I can run the request concurrently (rate limit is 10 concurrent requests on coingecko), then I just replace /list with /market and set the proper limit to get all the pages I need.
I use the following for the tasks page function:
async function pageFunction(context) {
let marketPrices = [];
const ENABLE_CONCURRENCY_BATCH = true;
const PRICE_CHANGE_PERCENTAGE = ['1h', '24h', '7d'];
const MAX_PAGE_TO_SCRAP = 10;
const MAX_PER_PAGE = 250;
const MAX_CONCURRENCY_BATCH_LIMIT = 10;
await context.WaitFor(5000);
const cryptoList = readJson();
const totalPage = Math.ceil(cryptoList.length / MAX_PER_PAGE);
context.log.info(`[Coingecko total cryptos count: ${cryptoList.length} (${totalPage} pages)]`)
function readJson() {
try {
const preEl = document.querySelector('body > pre');
return JSON.parse(preEl.innerText);
} catch (error) {
throw Error(`Failed to read JSON: ${error.message}`)
}
}
async function loadPage($page) {
try {
const params = {
vs_currency: 'usd',
page: $page,
per_page: MAX_PER_PAGE,
price_change_percentage: PRICE_CHANGE_PERCENTAGE.join(','),
sparkline: true,
}
let pageUrl = `${context.request.url.replace(/\/list$/, '/markets')}?`;
pageUrl += [
`vs_currency=${params.vs_currency}`,
`page=${params.page}`,
`per_page=${params.per_page}`,
`price_change_percentage=${params.price_change_percentage}`,
].join('&');
context.log.info(`GET page ${params.page} URL: ${pageUrl}`);
const page = await fetch(pageUrl).then((response) => response.json());
context.log.info(`Done GET page ${params.page} size ${page.length}`);
marketPrices = [...marketPrices, ...page];
return page
} catch (error) {
throw Error(`Fail to load page ${$page}: ${error.message}`)
}
}
try {
if (ENABLE_CONCURRENCY_BATCH) {
const fetchers = Array.from({ length: totalPage }).map((_, i) => {
const pageIndex = i + 1;
if (pageIndex > MAX_PAGE_TO_SCRAP) {
return null;
}
return () => loadPage(pageIndex);
}).filter(Boolean);
while (fetchers.length) {
await Promise.all(
fetchers.splice(0, MAX_CONCURRENCY_BATCH_LIMIT).map((f) => f())
);
}
} else {
let pageIndex = 1
let page = await loadPage(pageIndex)
while (page.length !== 0 && page <= MAX_PAGE_TO_SCRAP) {
pageIndex += 1
page = await loadPage(pageIndex)
}
}
} catch (error) {
context.log.info(`Fetchers failed: ${error.message}`);
}
context.log.info(`End: Updated ${marketPrices.length} prices for ${cryptoList.length} cryptos`);
const data = marketPrices.sort((a, b) => a.id.toLowerCase() > b.id.toLowerCase() ? 1 : -1);
context.log.info(JSON.stringify(data.find((item) => item.id.toLowerCase() === 'bitcoin')));
function sanitizer(item) {
item.symbol = item.symbol.toUpperCase()
return item;
}
return data.map(sanitizer)
}
I presume you are hiting the same issue I had with coinmarketcap, and that you could do the same with it.
You're not return ing anything to the sheet, but just logging it. Return it:
return jsondata.data[ticker].quote['USD'].price

How to make a chained fetch and show results as each step becomes available?

I'm fetching article list data from API and use/fetch Unsplash API to get relative images according to each fetched article title.
This is my code:
let url = 'http://127.0.0.1:8000/api';
async function getData(url) {
const res = await fetch(url);
const objects = await res.json();
await Promise.all(objects.map(async (object) => {
const res = await fetch('https://api.unsplash.com/search/photos?client_id=XXX&content_filter=high&per_page=1&query=' + object.title);
const image = await res.json();
object.image_url = image.results[0].urls.small
object.image_alt = image.results[0].alt_description
}));
}
let articles_1 = getData(url + '/articles/index/1/');
let articles_2 = getData(url + '/articles/index/2/');
let articles_3 = getData(url + '/articles/index/3/');
I am showing three different categories at once on the same page. That's why I call that function three times.
Question:
When this function kicks, results are shown after both article data and images are fetched. But I want to show article data first when it's been fetched and then images when they get fetched in order to shorten the user waiting time. How can I achieve it wether with Svelte reactive declaration or plain Javascript?
You would want to seperate the two functions, so they can be called in sequence.
const endpoint = 'http://127.0.0.1:8000/api';
const getArticles = async (url) => {
return fetch(url).res.json();
};
const renderArticles = async (articles) => {
// render article set and return it as a DOM object
return articlesDOM;
};
const getImageForArticle = async (articleNode) => {
const res = await = fetch('https://api.unsplash.com/search/photos?client_id=XXX&content_filter=high&per_page=1&query=' + object.title);
const img = new Image();
img.src = res.results[0].urls.small;
img.alt = res.results[0].alt_description;
return {img, articleNode};
};
const renderImage = async (stuff) => {
const {img, articleNode} = stuff;
// inject your img into your article
};
// now call in sequence
getArticles(endpoint+'/articles/index/1/').then(renderArticles).then(articleNodes => {
const promises = articleNodes.map(articleNode => {
return getImageForArticle(articleNode).then(renderImage);
});
return Promise.all(promises);
});
While I'm not completely sure what you're trying to do (I don't have a minimal working example), here's my best attempt at it:
var url = 'http://127.0.0.1:8000/api';
async function getData(url) {
var data = fetch(url)
.then(data => data.json())
await data.then(data => ArticleFunc(data))
await data.then(function(data) {
data.map(function(object) {
fetch('https://api.unsplash.com/search/photos?client_id=XXX&content_filter=high&per_page=1&query=' + object.title)
.then(data => data.json())
object.image_url = image.results[0].urls.small
object.image_alt = image.results[0].alt_description
ImageFunc(object)
})
})
}
function ArticleFunc(data){
//display article
}
function ImageFunc(data){
//display image
}
getData(url + '/articles/index/1/');
getData(url + '/articles/index/2/');
getData(url + '/articles/index/3/');
Note that this is to be treated as pseudocode, as again, it is untested due to the absense of a minimal working example.

How to call web service endpoints one after other?

I want to implement a function in JavaScript which calls a series of web service endpoint and checks for a value in the response of the API call.
I need to achieve it in a way that the first endpoint page is called first then the there would be a filter method to filter out the specific object from the response. If the object is found, this process should break and the object must be returned. However if the object is not found in the first endpoint, then the second endpoint must be called and the same process is repeated until the object is found.
The Web service endpoint that I am working on is:
https://jsonmock.hackerrank.com/api/countries?page=1
This API returns a list of country data. Here the value of page query varies from 1 to 25. I need to call the endpoint and check for a specific country from 1 to 25 until the country object is found.
I tried achieving this using JavaScript Promise and Fetch API and couldn't think of a way to call the APIs one after the other.
I am really looking forward for your answer. Thank you in advance.
You can use async and await for this:
async function findCountry(country) {
for (let page = 1; page < 26; page++) {
console.log("page = " + page); // for debugging only
let response = await fetch("https://jsonmock.hackerrank.com/api/countries?page=" + page);
let {data} = await response.json();
let obj = data.find(obj => obj.name == country);
if (obj) return obj;
}
}
let country = "Belgium";
findCountry(country).then(obj => {
if (obj) {
console.log("The capital of " + country + " is " + obj.capital);
} else {
console.log("Could not find " + country);
}
});
If you know that the data is sorted by country name, then you could reduce the average number of requests by using a binary search.
Here's a way that you can do it.
const url = 'https://jsonmock.hackerrank.com/api/countries'
const fetchFromApi = async (countryName, page) => {
const res = await fetch(`${url}?page=${page}`)
return await res.json()
}
const getCountryFromResults = (countryName, data) => {
const country = countryName.toLowerCase()
return data.find(({name}) => name.toLowerCase() === country)
}
const findCountry = async (countryName) => {
let page = 1;
let totalPages = 1;
while(page <= totalPages) {
const res = await fetchFromApi(countryName, page);
if(totalPages < res.total_pages) {
totalPages = res.total_pages
}
const country = getCountryFromResults(countryName, res.data)
if(country){
return country
}
page = page + 1
}
}
( async () => {
console.log(await findCountry("Afghanistan"))
console.log(await findCountry("Argentina"))
}
)()

How to avoid data duplication while fetching?

I have a simple fetch function that gets data (messages from db) and putting it into an array to display it with simple vanilla JS. The thing is I am calling this function every 2 seconds in order to check for new messages. But when I do that I duplicate my messages and it keeps adding instead of replacing. I am struggling to understand what I should do to change, not add.
(a little dummy question, sorry)
const list = document.getElementById('message-list');
const getData = () => {
fetch('/messages')
.then((resp) => resp.json())
.then(function(data) {
console.log(data)
for (let i = 0; i < data.length; i++) {
const listItem = document.createElement('li');
listItem.innerText = data[i].message;
const delButton = document.createElement('button');
delButton.innerHTML = 'Delete';
delButton.addEventListener('click', ()=>{
const message_id = data[i].message_id;
deleteItem(message_id);
})
listItem.appendChild(delButton);
list.appendChild(listItem)
}
})
}
setInterval(getData,2000)
Make a Set of the message_ids processed so far, and on further calls, ignore messages matching that message_id:
const seenIds = new Set();
const getData = () => {
fetch('/messages')
.then((resp) => resp.json())
.then(function(data) {
data
.filter(({ message_id }) => !seenIds.has(seenIds))
.forEach(({ message, message_id }) => {
seenIds.add(message_id);
const listItem = document.createElement('li');
listItem.innerText = message;
const delButton = document.createElement('button');
delButton.textContent = 'Delete';
delButton.addEventListener('click', () => {
deleteItem(message_id);
});
listItem.appendChild(delButton);
list.appendChild(listItem)
});
});
};
That said, it would probably be better to change your backend so that it can filter the items for you, rather than sending objects over the net that proceed to get ignored.

Batch get DocumentReferences?

I'm trying to improve a firestore get function, I have something like:
return admin.firestore().collection("submissions").get().then(
async (x) => {
var toRet: any = [];
for (var i = 0; i < 10; i++) {
try {
var hasMedia = x.docs[i].data()['mediaRef'];
if (hasMedia != null) {
var docData = (await x.docs[i].data()) as MediaSubmission;
let submission: MediaSubmission = new MediaSubmission();
submission.author = x.docs[i].data()['author'];
submission.description = x.docs[i].data()['description'];
var mediaRef = await admin.firestore().doc(docData.mediaRef).get();
submission.media = mediaRef.data() as MediaData;
toRet.push(submission);
}
}
catch (e) {
console.log("ERROR GETTIGN MEDIA: " + e);
}
}
return res.status(200).send(toRet);
});
The first get is fine but the performance is worst on the line:
var mediaRef = await admin.firestore().doc(docData.mediaRef).get();
I think this is because the call is not batched.
Would it be possible to do a batch get on an array of mediaRefs to improve performance?
Essentially I have a collection of documents which have foreign references stored by a string pointing to the path in a separate collection and getting those references has been proven to be slow.
What about this? I did some refactoring to use more await/async code, hopefully my comments are helpful.
The main idea is to use Promise.all and await all the mediaRefs retrieval
async function test(req, res) {
// get all docs
const { docs } = await admin
.firestore()
.collection('submissions')
.get();
// get data property only of docs with mediaRef
const datas = await Promise.all(
docs.map(doc => doc.data()).filter(data => data.mediaRef),
);
// get all media in one batch - this is the important change
const mediaRefs = await Promise.all(
datas.map(({ mediaRef }) =>
admin
.firestore()
.doc(mediaRef)
.get(),
),
);
// create return object
const toRet = datas.map((data: MediaSubmission, i) => {
const submission = new MediaSubmission();
submission.author = data.author;
submission.description = data.description;
submission.media = mediaRefs[i].data() as MediaData;
return submission;
});
return res.status(200).send(toRet);
}

Categories