Copy data from a dynamic website using scrapy - javascript
I started to write a scraper for the site to collect data on cars. As it turned out, the data structure can change, since the sellers do not fill all the fields, because of what there are fields that can change, and during the scraper as a result in the csv file, the values are in different fields.
page example:
https://www.olx.ua/obyavlenie/prodam-voikswagen-touran-2011-goda-IDBzxYq.html#87fcf09cbd
https://www.olx.ua/obyavlenie/fiat-500-1-4-IDBjdOc.html#87fcf09cbd
data example:
Data example
One approach was to check the field name with text () = "Category name", but I'm not sure how to correctly write the result to the correct cells.
Also I use the built-in Google developer tool, and with the help of the command document.getElementsByClassName('margintop5')[0].innerText
I brought out the whole contents of the table, but the results are not structured.
So, if the output can be in json format then it would solve my problem?
innerText result
In addition, when I studied the page code, I came across a javascript script in which all the necessary data is already structured, but I do not know how to get them.
<script type="text/javascript">
var GPT = GPT || {};
GPT.targeting = {"cat_l0":"transport","cat_l1":"legkovye-avtomobili","cat_l2":"volkswagen","cat_l0_id":"1532","cat_l1_id":"108","cat_l2_id":"1109","ad_title":"volkswagen-jetta","ad_img":"https:\/\/img01-olxua.akamaized.net\/img-olxua\/676103437_1_644x461_volkswagen-jetta-kiev.jpg","offer_seek":"offer","private_business":"private","region":"ko","subregion":"kiev","city":"kiev","model":["jetta"],"modification":[],"motor_year":[2006],"car_body":["sedan"],"color":["6"],"fuel_type":["543"],"motor_engine_size":["1751-2000"],"transmission_type":["546"],"motor_mileage":["175001-200000"],"condition":["first-owner"],"car_option":["air_con","climate-control","cruise-control","electric_windows","heated-seats","leather-interior","light-sensor","luke","on-board-computer","park_assist","power-steering","rain-sensor"],"multimedia":["acoustics","aux","cd"],"safety":["abs","airbag","central-locking","esp","immobilizer","servorul"],"other":["glass-tinting"],"cleared_customs":["no"],"price":["3001-5000"],"ad_price":"4500","currency":"USD","safedealads":"","premium_ad":"0","imported":"0","importer_code":"","ad_type_view":"normal","dfp_user_id":"e3db0bed-c3c9-98e5-2476-1492de8f5969-ver2","segment":[],"dfp_segment_test":"76","dfp_segment_test_v2":"46","dfp_segment_test_v3":"46","dfp_segment_test_v4":"32","adx":["bda2p24","bda1p24","bdl2p24","bdl1p24"],"comp":["o12"],"lister_lifecycle":"0","last_pv_imps":"2","user-ad-fq":"2","ses_pv_seq":"1","user-ad-dens":"2","listingview_test":"1","env":"production","url_action":"ad","lang":"ru","con_inf":"transportxxlegkovye-avtomobilixx46"};
data in json dict
How can I get the data from the pages using python and scrapy?
You can do it by extracting the JS code from the <script> block, using a regex to get only the JS object with the data and then loading it using the json module:
query = 'script:contains("GPT.targeting = ")::text'
js_code = response.css(query).re_first('targeting = ({.*});')
data = json.loads(js_code)
This way, data is a python dict containing the data from the JS object.
More about the re_first method here: https://doc.scrapy.org/en/latest/topics/selectors.html#using-selectors-with-regular-expressions
Related
get only specific element of a JSON database stored in server by url request
my website relies on a database which is a big JSON file like this: var myjsonData = [ { "ID": 0, "name": "Henry", "surname": "McLarry", "...": "...", }] I do generate this data every month at high cost to me, therefore I would like to avoid calling it straight in my html <head>, because this will allow any user to download the full database in no time. I would like to build a "something" that can only call specific items from the json file (just the only one I want to show) without "exposing" the full .json onto client side. today I use the call var myvar= myjsonData.ID.Name to get "Henry" into myvar, I would like to build something like var myvar = mycallfunction(ID,Name) I did try with PHP as intermediary but the ajax calls from javacript doesn't allow me to fetch the data. Can I use JQuery with the JSON Url to get only the item I need?
What you can do is parse your json for an object. So you can get any value you want from json. Example: var myjsonData = '{"ID": 0,"name": "Henry","surname": "McLarry"}'; obj = JSON.parse(myjsonData); console.log(myjsonData.ID); //print the id console.log(myjsonData.name); //print the name console.log(myjsonData.surname); //print the surname
So you have a NoSQL Database which has only one kind of Document that is the full JSON element you use in your website. In that scenario you have three options: Depending on the NoSQL Database you're using you can limit the fields which will be returned(I.e: For MongoDB you can look here: https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/) Change the way you store you data into more modular documents and make the logic to connect them in you application. So instead of one big document you'll have modular ones as Users, Products, Transactions and etc and you can use your application to query them individually. Build a Server Side logic as an API to deal with your data and provide only what you need, so the API(Which can be node.js, php, or any you may like) will get the full JSON it`s endpoints will only the data you want. For example: myapi.com/getUser, myapi.com/getProducts and so on. If you're able to provide more info on the technologies you're using that would help us. Hope that helped :).
How to query JSON with JS API to return JSON properties?
Apologies if this seems basic to some, but I'm new to JS/node.js/JSON and still finding my way. I've searched this forum for an hour but cannot find a specific solution. I have a basic website setup running of a local Node.js server along with 2x JSON data files with information about 32x local suburbs. An example of an API GET request URL on the site would be: .../api/b?field=HECTARES The structure of the JSON files are like: JSON Structure In the JSON file there are 32x Features (suburbs), each with it's own list of Properties as shown above. What I am trying to do is use the API 'field' query to push all the HECTARES values each of the 32x Features into a single output variable. The code below is an example of how far I have got: var fieldStats = []; var fieldQ = req.query['field']; for (i in suburbs.features) { x = suburbs.features[i].properties.HECTARES; fieldStats.push(x); } As you can see in the above "HECTARES" is hard-coded - I need to be able to pass the 'fieldQ' variable to this code but have no idea how to. Advice appreciated!
Exactly the same syntax you are using just above: suburbs.features[i].properties[fieldQ];
How to Access Oracle Database Table Column Data within Javascript in Oracle ApEx
I have a column in a database table that contains several urls and I was wondering what is the best way to get these urls from the database table into a javascript function. Example code of how to approach this would be much appreciated. Thanks.
If you can get the URLs into page items via PL/SQL code then you can access the page item values from Javascript like this: url1 = $v('P1_URL1'); url2 = $v('P1_URL2'); For example, you could have an on-load PL/SQL process like: select url1, url2 into :p1_url1, :p1_url2 from my_urls where ...; To put several URLs into an array you could use the PL/JSON library - see this example. Again, this would be PL/SQL code to put the JSON array into a page item which you can then access from Javascript using v$(). Or you could use AJAX as descrobed here.
Local HTML 5 database usable in Mac Dashboard wigdets?
I'm trying to use HTML 5's local database feature on a Mac Dashboard widget. I'm programming in Dashcode the following javascript: if (window.openDatabase) { database = openDatabase("MyDB", "1.0", "Sample DB", 1000); if (database) { ...database code here... } } Unfortunately the database-variable remains always null after the call to openDatabase-method. I'm starting to think that local databases are not supported in Widgets... Any ideas? /pom
No you will not be able to do the above. And even if you could then you would not be able to distribute the widget without distributing the database assuming it was a MySQL or SGLite. (not sure what you mean by HTML 5's local Db. here are a number of ways round this:- You can add a data source which can be a JSON file, or an XML file or and RSS feed. So to do this with JSON for example you would write a page on a server in PHP or something that accessed a database so that when the URL was called the result was a JSON string. Take the JSON string and parse it and use it in the Widget. This will let you get data but not save it. Another way would be to use the user preferences. This allows you to save and retrieve data in the individual widget. So var preferenceKey = "key"; // replace with the key for a preference var preferenceValue = "value"; // replace with a preference to save // Preference code widget.setPreferenceForKey(preferenceValue, preferenceKey); You can then retrieve it with var preferenceForKey = "key"; // replace with the key for a preference // Preference code preferenceForKey = widget.preferenceForKey(preferenceForKey); The external call, you could also use REST will let you read any amount of data in and the preferences will let you save data for later reuse that will survive log out's and shut downs. The Apple site has a lot of information about Widgets and tutorials as well thjat are worth working through. Hope this helps.
JQuery and JSON
Here's something I want to learn and do. I have a JSON file that contains my product and details (size, color, description). In the website I can't use PHP and MySQL, I can only use Javascript and HTML. Now what I want to happen is using JQuery I can read and write a JSON file (JSON file will serve as my database). I am not sure if it can be done using only JQuery and JSON. First thing, How to query a JSON file? (Example: I would search for the name and color of the product.) How to parse the JSON datas that were searched into an HTML? How to add details, product to the JSON file? It will also be great if you can point me to a good tutorial about my questions. I'm new to both JQuery and JSON. Thanks!
Since Javascript is client side, you won't be able to write to the JSON file on the server using only Javascript. You would need some server side code in order to do that. Reading and parsing the JSON file is not a problem though. You would use the jQuery.getJSON function. You would supply both a url and a callback parameter (data isn't needed, because you're reading a file, so no need to send data). The url would be the path to your JSON file, and the callback would be a function that uses the data. Here's an example of what your code might look like. I don't know exactly what your JSON is, but if you have a set called "products" containing a set of objects with the details "name" and "price", this code would print those out: $.getJSON("getProductJSON.htm", function(data) { $.each(data.products, function(i, item) { var name = item.name; var price = item.price; // now display the name and price on the page here! }); }, ); Basically, the data variable in $.getJSON makes the entire contents of the JSON available to you, very easily. And the $.each is used to loop over a set of JSON objects.