Apify web scraper ignoring URL Fragment - javascript

I have a list of URL that I want to scrape, so i put it into the startUrls like this
"startUrls": [
{
"url": "https://www.example.com/sample#000000",
"method": "GET"
},
{
"url": "https://www.example.com/sample#111111",
"method": "GET"
}
]
And this is the excerpt from my pageFunction code.
async function pageFunction(context) {
const { request } = context;
var name;
try {
name = document.querySelector('h1').textContent;
} catch (e) {
name = "null";
}
return {
link: request.url,
name
};
}
It's working fine with URLs that can be differentiated with either the domain, or the path. But if the only difference is in the fragment, only the first URL is processed as the second URL is considered a duplicate and therefore skipped.
i've tried adding this bit of code at the second line of the pageFunction
await context.enqueueRequest({
url: context.request.url,
keepUrlFragment: true,
});
But it leads up to another problem that it's producing duplicate results for each URL.
So what should I do to make this work correctly? Is there another way than calling enqueueRequest to set the keepUrlFragment to true ?

Unfortunately, you cannot set keepUrlFragment directly in startUrls now. So I propose to not use them at all. You can instead pass them as an array in customData. Then you can use page function like this with a dummy startUrl like http://example.com and label START
async function pageFunction(context) {
const { request, customData } = context;
if (request.userData.label === 'START') {
for (const url of customData) {
await context.enqueueRequest({
url,
keepUrlFragment: true,
});
}
} else {
// Your main scraping logic here
}
}

Related

Using Bottleneck to rate-limit API requests in a library

I am writing an API wrapper in TypeScript. I would like the code to be asynchronous in order to maximally meet the rate limit of the API in question. The API wants requests to be submitted at a maximum rate of 1/second.
I intend to implement an API wrapper which is instantiated once, and allows the use of objects to reach the different endpoints. For instance, within the greater API there is a post and pool endpoint. I would like to access them like post_object.post.submit_request(argument1, ...) or post_object.pool.submit_request(argument1, ...).
I have created an object called state_info which is passed between the various objects, within which is contained a user-agent header, login information if provided, and a rate-limiter object from the Bottleneck library.
The issue I'm running into while testing is that my program doesn't seem to actually be limiting the rate of requests; no matter what I change the limit to in the arguments for Bottleneck, the requests all happen in about .600 seconds every time.
I am thinking this has something to do with passing around the rate-limiter object, or in accessing it from multiple places, but I'm unsure.
First, here is the code for the Model object, which represents access into the API.
import axios, { AxiosRequestConfig } from "axios";
import { StateInfo, Method } from "./interfaces";
export class Model {
public stateInfo: StateInfo;
constructor(stateInfo: StateInfo) {
// Preserve rate limiter, user agent, etc.
this.stateInfo = stateInfo;
}
//Updated to funcName = () => {} syntax to bind "this" to this class context.
private submit_request = (query_url: string, method: Method) => {
if (this.stateInfo.username && this.stateInfo.api_key) {
const axiosConfig: AxiosRequestConfig = {
method: method,
url: query_url,
headers: { "User-Agent": this.stateInfo.userAgent },
auth: {
username: this.stateInfo.username,
password: this.stateInfo.api_key,
},
};
return axios(axiosConfig);
} else {
const axiosConfig: AxiosRequestConfig = {
method: "get",
url: query_url,
headers: { "User-Agent": this.stateInfo.userAgent },
};
return axios(axiosConfig);
}
};
public submit_throttled_request = (url: string, method: Method) => {
return this.stateInfo.rateLimiter.schedule(
this.submit_request,
url,
method
);
};
}
Then, the code from which I call this class:
import { Model } from "./models/model";
import Bottleneck from "bottleneck";
const limiter: Bottleneck = new Bottleneck({ mintime: 1000, maxconcurrent: 1 });
const stateInfo = {
rateLimiter: limiter,
userAgent: "email#website.com | API Dev",
};
let modelObj: Model = new Model(stateInfo);
async function makeRequest() {
try {
let response = await modelObj.submit_throttled_request(
"https://www.website.com/api",
"get"
);
console.log(response.data.id + "|" + Date.now());
} catch (err) {
console.log(err);
}
}
let start = new Date();
for (let i = 0; i < 20; i++) {
makeRequest();
}
My expectation is that the operation would take, at a minimum, 10 seconds if only one request can be submitted per second. Yet I'm averaging half that, no matter what I include for mintime.
I've learned the answer to my own question after much head-scratching.
It turns out, in the "gotchas" section of the bottleneck API reference they note:
If you're passing an object's method as a job, you'll probably need to bind() the object:
with the following code:
// instead of this:
limiter.schedule(object.doSomething);
// do this:
limiter.schedule(object.doSomething.bind(object));
// or, wrap it in an arrow function instead:
limiter.schedule(() => object.doSomething());
This is the issue into which I was running. I was handing off axios(axiosContext) without binding the scope, so nothing was being sent off to the bottleneck ratelimiter. By wrapping is like so: this.state_info.rateLimiter.schedule(() => axios(axiosContext)); I have managed to correctly bind the context as needed.

Read settings contained in a json file in javascript

Im tring to read a simple setting from a json file, the json is this :
{
"Label": "some string here"
}
form my javascript part i do:
import settings from '../settings.json';
then:
var settings= ()=> {
const headers = new Headers();
const requestOptions = {
method: 'GET',
headers: { ...headers.authentication, ...headers.culture, 'ContentType':'application/json',
};
return fetch(`${settings.Label}`, requestOptions).then(() => {
return response.text().then(text => {
const data = text ? text && JSON.parse(text) : {};
let token = response.headers.get('X-Token');
if (token) {
data.token = token;
}
if (!response.ok) {
// manage error here
}
return Promise.reject(error);
}
return data;
})
});
};
// use settings here
Despite my many searches and attempts im not very expert in javascript,i have tried in many ways before, but the my variable 'settings' is not contain nothing.
I believe you need to add an export to your JSON file
export const settings = {
"label": "some string here"
}
Not much information given here, but this probably has to do with transpiling your javascript. You can use:
const settings = require('../settings.json')
instead.
try this answer https://stackoverflow.com/a/59844868/7701381
Also, change the name of the imported json settings or the var settings, cuz this might cause unexpected behaviors
I had completely wrong the approach, the file is already available and I don't have to request to download it from the server, I just have to return string, without use of fetch or other:
return (`${settings.Label}`
Sorry and thank a lot for the support

Problem w/state of a vue data object in component

I am updating my original vue project and am getting an error w/data object sports_feeds_boxscores_*. The site has three tabs to pull down scores for the three major leagues. I am adding the player stats for each game now. I first did baseball and all worked fine. Now I am doing football and the problem arises. I have three objects setup for the stats for each league. The nfl also contains an object with the three days of the week they play. What is happening is the stats for Sunday get pulled down ok but then Thursday's stats which should only be one game instead has all sunday's games plus the one thursday game. And then Monday has both Sunday & Thursdays results in it besides Mondays. I have made all the components separate as well as three separate data objects for the component props. And if I first click the nfl tab and then go to the mlb tab all results from nfl data object are in sports_feeds_boxscores_mlb. I setup a site here to better understand whats going on in using Vue.js devtools. Here is the pertinent code:
index.html:
<component
v-if="currentTabComponent === 'tab-mlb'"
v-bind:is="currentTabComponent"
v-bind:props_league_data="sports_feeds_data"
v-bind:props_league_standings="standings"
v-bind:props_baseball_playoffs="baseball_playoffs"
v-bind:props_end_of_season="end_of_season[this.currentTab.toLowerCase()]"
v-bind:props_box_game_scores_mlb="sports_feeds_boxscores_mlb"
class="tab"
>
</component>
<component
v-if="currentTabComponent === 'tab-nfl'"
v-bind:is="currentTabComponent"
v-bind:props_league_data="sports_feeds_data"
v-bind:props_league_data_nfl="nfl_feeds"
v-bind:props_league_standings="standings"
v-bind:props_nfl_playoffs="nfl_playoffs"
v-bind:props_end_of_season="end_of_season[this.currentTab.toLowerCase()]"
v-bind:props_box_game_scores_nfl="sports_feeds_boxscores_nfl"
class="tab"
>
</component>
vue.js:
data() {
return {
sports_feeds_boxscores_mlb: null,
sports_feeds_boxscores_nfl: {
sun: null,
mon: null,
thurs: null
},
sports_feeds_boxscores_nba: null,
etc
/* Component Code */
// First let's get the Game and BoxScores Data
const nflScores = async () => {
this.nfl_feeds.sunday_data = await getScores(
nflDate.sundayDate,
config
);
this.nfl_feeds.thurs_data = await getScores(
nflDate.thursdayDate,
config
);
this.nfl_feeds.mon_data = await getScores(nflDate.mondayDate, config);
// Next we need the gameid's to retrieve the game boxscores for each day
this.nfl_feeds.sunday_data.forEach(function(item, index) {
if (item.isCompleted === "true") {
nflGameIDs.sunday[index] = item.game.ID;
}
});
this.nfl_feeds.thurs_data.forEach(function(item, index) {
if (item.isCompleted === "true") {
nflGameIDs.thursday[index] = item.game.ID;
}
});
this.nfl_feeds.mon_data.forEach(function(item, index) {
if (item.isCompleted === "true") {
nflGameIDs.monday[index] = item.game.ID;
}
});
// Check if boxscores have been retrieved on previous tab click for each day
// if not retrieve the boxscores
this.sports_feeds_boxscores_nfl.sun =
this.sports_feeds_boxscores_nfl.sun ||
(await getBoxScores(nflGameIDs.sunday, url, params));
this.sports_feeds_boxscores_nfl.thurs =
(await getBoxScores(nflGameIDs.thursday, url, params));
this.sports_feeds_boxscores_nfl.mon =
this.sports_feeds_boxscores_nfl.mon ||
(await getBoxScores(nflGameIDs.monday, url, params));
}; /* End nflScores Async function */
getBoxScores.js:
try {
const getBoxScores = async (gameIDs, myUrl, params) => {
gameIDs.forEach(function(item) {
promises.push(
axios({
method: "get",
headers: {
Authorization:
"Basic &&*&&^&&=="
},
url: myUrl + item,
params: params
})
);
});
// axios.all returns a single Promise that resolves when all of the promises passed
// as an iterable have resolved. This single promise, when resolved, is passed to the
// "then" and into the "values" parameter.
await axios.all(promises).then(function(values) {
boxScores = values;
});
console.log(`boxScores is ${boxScores.length}`)
return boxScores;
};
module.exports = getBoxScores;
} catch (err) {
console.log(err);
}
I have split up all the sports_feeds_boxscores objects and at a loss as to why they are sharing state??? Sorry for verbosity of the question but it is somewhat complex. That is why I provided the site where you can see devtools that for instance this.sports_feeds_boxscores_nfl.thurs has 14 elements instead of one after the call to API. And if mlb tab is clicked after nfl tab then mlb results include the nfl results. I would really appreciate help in figuring this out. Thanks in advance...
Update:
I have added getBoxScores.js cause it seems as if I am returning the extra stats from this call.
This was my bad. I didnt realize I had created a closure in getBoxScores.js:
let boxScores = [];
let promises = [];
try {
const getBoxScores = async (gameIDs, myUrl, params) => {
gameIDs.forEach(function(item) {
promises.push(
axios({
method: "get",
headers: {
Authorization:
"Basic &&^^&^&&^FGG="
},
url: myUrl + item,
params: params
})
);
});
Moving declarations inside async function quickly solved trouble. URRRRGGGHHH!!!

Cannot get response content in mithril

I've been trying to make a request to a NodeJS API. For the client, I am using the Mithril framework. I used their first example to make the request and obtain data:
var Model = {
getAll: function() {
return m.request({method: "GET", url: "http://localhost:3000/store/all"});
}
};
var Component = {
controller: function() {
var stores = Model.getAll();
alert(stores); // The alert box shows exactly this: function (){return arguments.length&&(a=arguments[0]),a}
alert(stores()); // Alert box: undefined
},
view: function(controller) {
...
}
};
After running this I noticed through Chrome Developer Tools that the API is responding correctly with the following:
[{"name":"Mike"},{"name":"Zeza"}]
I can't find a way to obtain this data into the controller. They mentioned that using this method, the var may hold undefined until the request is completed, so I followed the next example by adding:
var stores = m.prop([]);
Before the model and changing the request to:
return m.request({method: "GET", url: "http://localhost:3000/store/all"}).then(stores);
I might be doing something wrong because I get the same result.
The objective is to get the data from the response and send it to the view to iterate.
Explanation:
m.request is a function, m.request.then() too, that is why "store" value is:
"function (){return arguments.length&&(a=arguments[0]),a}"
"stores()" is undefined, because you do an async ajax request, so you cannot get the result immediately, need to wait a bit. If you try to run "stores()" after some delay, your data will be there. That is why you basically need promises("then" feature). Function that is passed as a parameter of "then(param)" is executed when response is ready.
Working sample:
You can start playing with this sample, and implement what you need:
var Model = {
getAll: function() {
return m.request({method: "GET", url: "http://www.w3schools.com/angular/customers.php"});
}
};
var Component = {
controller: function() {
var records = Model.getAll();
return {
records: records
}
},
view: function(ctrl) {
return m("div", [
ctrl.records().records.map(function(record) {
return m("div", record.Name);
})
]);
}
};
m.mount(document.body, Component);
If you have more questions, feel free to ask here.

Node.js - Can't post nested/escaped JSON to body using Fermata REST client

The problem may be with the actual client, but he's not responding on github, so I'll give this a shot!
I'm trying to post, in the body, nested JSON:
{
"rowkeys":[
{
"rowkey":"rk",
"columns":[
{
"columnname":"cn",
"columnvalue":"{\"date\":\"2011-06-21T00:53:10.309Z\",\"disk0\":{\"kbt\":31.55,\"tps\":6,\"mbs\":0.17},\"cpu\":{\"us\":5,\"sy\":4,\"id\":90},\"load_average\":{\"m1\":0.85,\"m5\":0.86,\"m15\":0.78}}",
"ttl":10000
},
{
"columnname":"cn",
"columnvalue":"cv",
"ttl":10000
}
]
},
{
"rowkey":"rk",
"columns":[
{
"columnname":"cn",
"columnvalue":"fd"
},
{
"columnname":"cn",
"columnvalue":"cv"
}
]
}
]
}
When I remove the columnvalue's json string, the POST works. Maybe there's something I'm missing regarding escaping? I've tried a few built in escape utilities to no avail.
var jsonString='the json string above here';
var sys = require('sys'),
rest = require('fermata'), // https://github.com/andyet/fermata
stack = require('long-stack-traces');
var token = ''; // Username
var accountId = ''; // Password
var api = rest.api({
url : 'http://url/v0.1/',
user : token,
password : accountId
});
var postParams = {
body: jsonString
};
(api(postParams)).post(function (error, result) {
if (error)
sys.puts(error);
sys.puts(result);
});
The API I'm posting to can't deserialize this.
{
"rowkeys":[
{
"rowkey":"rk",
"columns":[
{
"columnname":"cn",
"columnvalue":{
"date":"2011-06-21T00:53:10.309Z",
"disk0":{
"kbt":31.55,
"tps":6,
"mbs":0.17
},
"cpu":{
"us":5,
"sy":4,
"id":90
},
"load_average":{
"m1":0.85,
"m5":0.86,
"m15":0.78
}
},
"ttl":10000
},
{
"columnname":"cn",
"columnvalue":"cv",
"ttl":10000
}
]
},
{
"rowkey":"rk",
"columns":[
{
"columnname":"cn",
"columnvalue":"fd"
},
{
"columnname":"cn",
"columnvalue":"cv"
}
]
}
]
}
Dual problems occuring at the same occurred led me to find an issue with the fermata library handling large JSON posts. The JSON above is just fine!
I think the real problem here is that you are trying to post data via a URL parameter instead of via the request body.
You are using Fermata like this:
path = fermata.api({url:"http://example.com/path");
data = {key1:"value1", key2:"value2"};
path(data).post(callback);
What path(data) represents is still a URL, with data showing up in the query part. So your code is posting to "http://example.com/path/endpoint?key1=value1&key2=value2" with an empty body.
Since your data is large, I'm not surprised if your web server would look at such a long URL and send back a 400 instead. Assuming your API can also handle JSON data in the POST body, a better way to send a large amount of data would be to use Fermata like this instead:
path = fermata.api({url:"http://example.com/path");
data = {key1:"value1", key2:"value2"};
path.post(data, callback);
This will post your data as a JSON string to "http://example.com/path" and you would be a lot less likely to run into data size problems.
Hope this helps! The "magic" of Fermata is that unless you pass a callback function, you are getting local URL representations, instead of calling HTTP functions on them.

Categories