Using Wikipedia API for autocomplete search - javascript

I want to use the Wikipedia API to select a famous person's name from the People category from my Javascript application. Basically, I would like to send the name or partial name and get results that contains the Wikipedia URL, title, an excerpt of the content and if possible the main picture.
I have been trying two ways, but I cannot make it work as I want.
First I have tried with search, but I cannot find the way of make it return the url. Would sectiontitle be good as unique identifier? Can snippet be plain text somehow? Cannot find how to filter by category.
Second, I have tried with opensearch, but the JSON response does not contain images, while the XML response does:
JSON: http://en.wikipedia.org/w/api.php?action=opensearch&search=mariano&namespace=0&format=json
XML: http://en.wikipedia.org/w/api.php?action=opensearch&search=mariano&namespace=0&format=xml
It is not possible to filter by category. Also, some results include a link to the disambiguation page, when I would prefer to get the list of possible matches rather than such link.
How could I search by title and get full title, url, small description and a picture link?

Opensearch is for input field autocompletion; it's based on an external spec and not very flexible. You should use the search API as a generator for some other API such as info which can return more details (example).

Related

I can't find Xpath on this website or maybe wrong syntax

I'm trying to scrape data from this url https://drive.getbigger.io/#/stores, however I didn't find the Xpath of the text I want to export, which are the producer's offer.
Firstly I try the importxml function in Google sheet:
=IMPORTXML(A1;"/html/body/flt-ruler-host/div[23]/p")
and it gave me N/A error "the imported content is empty"
so I try to scrape this website with add-ons and Parsehub, and it gave me every time a .csv file where i can't find the data I want to export.
Also I can't find the right Xpath for the data I would like to scrape, when I use the inspection tool, the data isn't in the <body> part.
However the Xpath I use in my importXML function is some code I find in the <body> part and which is close of the text I'd like to extract (the producer's offer).
It seems that the Xpath I am looking for is linked in the <head> part with some JavaScript code, also when I hover the page with the selection tool in order to scrape the data it select the whole page, maybe because there is a "scroll <div>".
So I wonder if the website use some kind of protection against scraping or other.
Please guys tell me if :
I could find the right Xpath in order to scrape with the importXML function?
Should I extract the data with a python script?
if the website block my attempts, how could I do this?
You won't be able to scrape anything with IMPORTXML formula since the website uses dynamic rendering (javascript).
So yes, Python+Selenium (or other combinations) could do the job. The website won't block you if you follow some rules (switch user-agent, add pauses between requests).
You would probably need these XPath :
Product description :
//p[1][string-length(text())>5][parent::flt-dom-canvas]
Product price :
//p[3][contains(text(),"€") and not (contains(text(),","))][parent::flt-dom-canvas]
However, I think the most elegant way to get the data is probably to use the API the website relies upon. With GoogleSheets and a custom ImportJSON script, you can obtain something like this (result for "fromage" as query) :
It won't work out of the box, you'll have to modify some parts of the script since it won't load a JSON (called with POST) which needs headers in the request. In a nutshell, you need to construct the payload part, add headers to the request ("Bearer XXXXX"), and add a parameter to a function to retrieve the results.
All this depends on your objective and your expected output.
EDIT : For references (constructing the payload, adding parameters) you can read :
https://developers.google.com/apps-script/reference/url-fetch/url-fetch-app#fetchurl,-params
Look also the networktab of your browser developper tools in order to find : the url of the API and the correct parameters to send.

Check a set of URLs if it contains a particular table ID

I have a set of URL links, some of which contains a table and some that does not return any table.
I've inspected the source and there is this particular table ID that loads at runtime. Is there a script or tool to check each URL content and see if it contains that particular table ID and return true?
I can use this script or tool to load a batch of URLs to see which URL contains the table that I am looking for.
This is web scraping. Many languages have a way to do this.
Pick what you are most comfortable with. However, Python with the Beautiful Soup library is a nice choice.

Is it possible to have one formula that can scrape any given url for universal properties?

As I am new to the world of scraping, I need some help. I am creating an app called "Job Pocket" that will help with saving jobs and recommending jobs. It has two sections: A "Recommended" section and a "My List" section.
Within the "Recommended" section I have successfully scraped the RSS feed from Craigslist and appended the main information that I wanted to the DOM as a list: (Url, Title, description, and date). The title links back to the actual job posting on Craigslist. Right now I am scraping the RSS feed from craigslist with jQuery and Javascript. I will have to refactor this later to an AngularJS way.
What I am having a difficult time with is the "My List" section. This section allows users to copy an url to a job posting from any site and add it to the "My List" section for saving.
I would like to be able to scrape any given url for the main tags such as: title, first image and first 200-500 characters of the body.
I feel that these properties should be apart of most if not all urls. Is it possible to make a single or very few functions that will be able to do this?
I would like to see a solution either by using Javascript with AngularJS or jQuery.
Here is a link to my repo on Github if you need it = Github Repo

Fetch name of company from Google Finance page

I need to get the name of a company into a Google spreadsheet.
The GOOGLEFINANCE function doesn't include the name of the company in its attributes, so I'm trying to create a custom function for that.
So, for IBM, for example, I can fetch the URL:
https://www.google.com/finance?q=ibm
And using Javascript, I'm trying to get the text of the name using:
document.getElementsByClassName('appbar-snippet-primary')[0].getElementsByTagName("span")[0].innerHTML
Which is returning:
undefined
If you are trying to do that inside apps scripts it wont be possible, the version of javascript in apps script does not contain the document object, therefore you won't be able to do it like that.
If you return that response to the client(where the javascript contains the document object) in order to look for the item in that way, you should have first add the information to that object.
a possible solution would be to treat the result of the urlfetch as string and then look for the information you require.

Blogger JSON API Post fetch and Content Parse

I'm new to Blogger and its JSON API. I've found out I can retrieve all posts / retrieve specific post using post id.
I'm trying to build a lazy loading blogger post list page, where blog content needs to fetch dynamically. What is the best way to retrieve latest 5 posts in every request? (I don't want to request 5 times for 5 posts)
Another thing is, I want to show post's first image on post list page. How can I fetch the first image and fetch textual content only?
I've searched Google but couldn't find any good tutorial. I hope you guys can help me.
Cheers
One option is to use javascript callbacks , for example :
http://jayunit100.blogspot.com/feeds/posts/default?alt=json-in-script&callback=myFunc
This will return back a peice of executable javascript code which
1) Calls a function which you defined
2) Sends that function a json object
The below is my speculation on this matter, because it clearly is much more difficult than one might think it should be:
It is not clear to me why a simpler, pure, authentication-free JSON REST api is not available (maybe one does exist), which simply takes a blog id and returns pure text, however, I suspect it might be that blogger wants to discourage crawling.
http://blogname.blogspot.com/feeds/posts/default?alt=json
Replace blogname with you Blogname.
You will get JSON Object and use JSON FORMATTER to format the JSON.

Categories