I'm trying to parse a specifc part of url after search using
any language.(Ideally Javascript but open to Python)
How do I get a specific part of url and save/store?
For example,
In songking.com,
The way to get artist_id is checking a specific part of the url after searching artist name
in the search bar of the website.
in the case below,
the artist id is 301329.
https://www.songkick.com/artists/301329-rac
I strongly believe there is a way to parse this part using either python or js
given that I have a csv file that has artist name in its column. Instead of searching all the artists one by one. I wonder about the algorithm that literate my csv column and search it and parse the url and save/store.
It would be very grateful even if I could only get a hint that I could start with.
Thank you so much always.
It can be done using regular expressions.
Here's an example of a JavaScript implementation
const url = "https://www.songkick.com/artists/301329-rac";
const regex = /https:\/\/www\.songkick\.com\/artists\/(\d+)-.+/;
const match = url.match(regex);
if (match) {
console.log('Artist ID: ' + match[1]);
} else {
console.log('No Artist ID found!');
}
This regular expression /https:\/\/www\.songkick\.com\/artists\/(\d+)-.+/ means that we're trying to match something that starts with https://www.songkick.com/artists/, preceded by a group of decimals a dash then a group of letters.
The match() method retrieves the result of matching a string against a
regular expression.
Thus it will return the overall string in the first index, then the matched (\d+) group in the second index (match[1] in our case).
If you're not sure of the protocol (http vs https) you can add a ? in the regex right after https. That makes the s in https optional. So the regex would become /https?:\/\/www\.songkick\.com\/artists\/(\d+)-.+/.
Let me know if you need more explanation.
First, you can use RegEx simply.
In python
import re
url = 'https://www.songkick.com/artists/301329-rac'
pattern = '/artists/(\d+)-\w'
match = re.search(pattern, url)
if match:
artist_id = match.group(1)
I hope this will help you.
Related
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I'm struggling to figure out the best way to strip out all the content in a URL from a specific keyword onwards (including the keyword), using either regex or a substring operation. So if I have an example dynamic URL http://example.com/category/subcat/filter/size/1/ - I would like to strip out the /filter/size/1 element of the URL and leave me with the remaining URL as a separate string. Grateful for any pointers. I should clarify that the number of arguments after the filter keyword isn't fixed and could be more than in my example and the number of category arguments prior to the filter keyword isn't fixed either
To be a little safer you could use the URL object to handle most of the parsing and then
just sanitize the pathname.
const filteredUrl = 'http://example.com/category/subcat/filter/test?param1¶m2=test';
console.log(unfilterUrl(filteredUrl));
function unfilterUrl(urlString) {
const url = new URL(urlString);
url.pathname = url.pathname.replace(/(?<=\/)filter(\/|$).*/i, '');
return url.toString();
}
You can tweak this a little based on your need. Like it might be the case where filter is not present in the URL. but lets assume it is present then consider the following regex expression.
/(.*)\/filter\/(.*)/g
the first captured group ( can be obtained by $1 ) is the portion of the string behind the filter keyword and the second captured group ( obtained by $2 ) will contain all your filters present after the filter keyword
have a look at example i tried on regextester.com
Use the split() function.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/split
url='http://example.com/category/subcat/filter/size/1/';
console.log(url.split('/filter')[0]);
Split
The simplest solution that occurs to me is the following:
const url = 'http://example.com/category/subcat/filter/size/1/';
const [base, filter] = url.split('/filter/');
// where:
// base == 'http://example.com/category/subcat'
// filter == 'size/1/'
If you expect more than one occurrence of '/filter/', use the limit parameter of String.split(): url.split('/filter/', 2);
RegExp
The assumption of the above is that after the filter parameter, everything is part of the filter. If you need more granularity, you can use a regex that terminates at the '?', for example. This will remove everything from 'filter/anything/that/follows' that immediately follows a / and until the first query string separator ?, not including.
const filterRegex = /(?<=\/)filter(\/|$)[^?]*/i;
function parseURL(url) {
const match = url.match(filterRegex);
if (!match) { return [url, null, null]; } // expect anything
const stripped = url.replace(filterRegex, '');
return [url, stripped, match[0]];
}
const [full, stripped, filter] = parseURL('http://example.com/category/subcat/filter/size/1/?query=string');
// where:
// stripped == 'http://example.com/category/subcat/?query=string'
// filter == 'filter/size/1/'
I'm sadly not able to post the full answer here, as i'ts telling me 'it looks like spam'. I created a gist with the original answer. In it i talk about the details of String.prototype.match and of JS/ES regex in general including named capture groups and pitfalls. And incude a link to a great regex tool: regex101. I'm not posting the link here in fear of triggering the filter again. But back to the topic:
In short, a simple regext can be used to split and format it (using filter as the keyword):
/^(.*)(\/filter\/.*)$/
or with named groups:
/^(?<main>.*)(?<stripped>\/filter\/.*)$/
(note that the forward slashes need to be escaped in a regex literal)
Using String.prototype.match with that regex will return an array of the matches: index 1 will be the first capture group (so everything before the keyword), index 2 will be everything after that (including the keyword).
Again, all the details can be found in the gist
I am writing an application in Node.js that allows users to mention each other in messages like on twitter. I want to be able to find the user and send them a notification. In order to do this I need to pull #usernames to find mentions from a string in node.js?
Any advice, regex, problems?
I have found that this is the best way to find mentions inside of a string in javascript.
var str = "#jpotts18 what is up man? Are you hanging out with #kyle_clegg";
var pattern = /\B#[a-z0-9_-]+/gi;
str.match(pattern);
["#jpotts18", "#kyle_clegg"]
I have purposefully restricted it to upper and lowercase alpha numeric and (-,_) symbols in order to avoid periods that could be confused for usernames like (#j.potts).
This is what twitter-text.js is doing behind the scenes.
// Mention related regex collection
twttr.txt.regexen.validMentionPrecedingChars = /(?:^|[^a-zA-Z0-9_!#$%&*#@]|RT:?)/;
twttr.txt.regexen.atSigns = /[#@]/;
twttr.txt.regexen.validMentionOrList = regexSupplant(
'(#{validMentionPrecedingChars})' + // $1: Preceding character
'(#{atSigns})' + // $2: At mark
'([a-zA-Z0-9_]{1,20})' + // $3: Screen name
'(\/[a-zA-Z][a-zA-Z0-9_\-]{0,24})?' // $4: List (optional)
, 'g');
twttr.txt.regexen.endMentionMatch = regexSupplant(/^(?:#{atSigns}|[#{latinAccentChars}]|:\/\/)/);
Please let me know if you have used anything that is more efficient, or accurate. Thanks!
Twitter has a library that you should be able to use for this. https://github.com/twitter/twitter-text-js.
I haven't used it, but if you trust its description, "the library provides autolinking and extraction for URLs, usernames, lists, and hashtags.". You should be able to use it in Node with npm install twitter-text.
While I understand that you're not looking for Twitter usernames, the same logic still applies and you should be able to use it fine (it does not validate that extracted usernames are valid twitter usernames). If not, forking it for your own purposes may be a very good place to start.
Edit: I looked at the docs closer, and there is a perfect example of what you need right here.
var usernames = twttr.txt.extractMentions("Mentioning #twitter and #jack")
// usernames == ["twitter", "jack"]
here is how you extract mentions from instagram caption with JavaScript and underscore.
var _ = require('underscore');
function parseMentions(text) {
var mentionsRegex = new RegExp('#([a-zA-Z0-9\_\.]+)', 'gim');
var matches = text.match(mentionsRegex);
if (matches && matches.length) {
matches = matches.map(function(match) {
return match.slice(1);
});
return _.uniq(matches);
} else {
return [];
}
}
I would respect names with diacritics, or character from any language \p{L}.
/(?<=^| )#\p{L}+/gu
Example on Regex101.com with description.
PS:
Don't use \B since it will match ##wrong.
So I currently pass two variables into the url for use on another page. I get the last variable (ie #12345) with location.hash. Then from the other part of the url (john%20jacob%202) all I need is the '2'. I've got it working but feel there must be a cleaner and succinct way to handle this. The (john%20jacob%202) will change all the time to have different string lengths.
url: http://localhost/index.html?john%20jacob%202?#12345
<script>
var hashUrl = location.hash.replace("?","");
// function here to use this data
var fullUrl = window.location.href;
var urlSplit = fullUrl.split('?');
var justName = urlSplit[1];
var nameSplit = justName.split('%20');
var justNumber = nameSplit[2];
// function here to use this data
</script>
A really quick one-liner could be something like:
let url = 'http://localhost/index.html?john%20jacob%202?#12345';
url.split('?')[1].split('').pop();
// returns '2'
How about something like
decodeURI(window.location.search).replace(/\D/g, '')
Since your window.location.search is URI encoded we start by decoding it. Then replace everything that is not a number with nothing. For your particular URL it will return 2
Edit for clarity:
Your example location http://localhost/index.html?john%20jacob%202?#12345 consists of several parts, but the interesting one here is the part after the ? and before the #.
In Javascript this interesting part, the query string (or search), is available through window.location.search. For your specific location window.location.search will return ?john%20jacob%202?.
The %20 is a URI encoded space. To decode (ie. remove) all the URI encodings I first run the search string through the decodeURI function. Then I replace everything that is not a number in that string with an empty string using a regular expression.
The regular expression /\D/ matches any character that is not a number, and the g is a modifier specifying that I want to match everything (not just stop after the first match), resulting in 2.
If you know you are always after a tag, you could replace everything up until the "#"
url.replace(/^.+#/, '');
Alternatively, this regex will match the last numbers in your URL:
url.match(/(?<=\D)\d+$/);
//(positive look behind for any non-digit) one more digits until the end of the string
I'm currently using Modenizr to determine what link to serve users based on their device of choice. So if they're using a mobile device I want to return a URI if not then just return a traditional URL.
URI: spotify:album:1jcYwZsN7JEve9xsq9BuUX
URL: https://open.spotify.com/album/1jcYwZsN7JEve9xsq9BuUX
Right now I'm using slice() to retrieve the last 22 characters of the URI. Though it works I'd like to parse the string via regex in the event that the URI exceeds the aforementioned character amount. What would be the best way to get the string of characters after the second colon of the URI?
$(".spotify").attr("href", function(index, value) {
if (Modernizr.touch) {
return value
} else {
return "https://open.spotify.com/album/" + value.slice(-22);
}
});
I would like something like this using split.
var url = 'spotify:album:1jcYwZsN7JEve9xsq9BuUX'.split(':');
var part = url[url.length-1];
// alert(part);
return "https://open.spotify.com/album/" + part;
Regex is appropriate for this task because it is quite simple, here's the RegEx which supports as many : as there are and will still work
/[\w\:]*\:(\w+)/
How it works
[\w\:]* Will get all word characters (Letters, numbers, underscore) and colons
\: Will basically tell the previous thing to stop at a colon. Regex is by default greedy, that means it will get the last colon
(\w+) Will select all word characters and store it in a group so we can access it
Use this like:
var string = 'spotify:album:1jcYwZsN7JEve9xsq9BuUX',
parseduri = string.match(/[\w\:]*\:(\w+)/)[1];
parseduri is the result
And then you can finally combine this:
var url = 'https://open.spotify.com/album/'+parseduri;
I have a list of urls like this:
http://www.mylocal.com
http://v1.mylocal.com
http://v2.mylocal.com
http://www.mylocal2.com
http://www.mylocal3.com
And I want to write a JS that if I define the search string be "*.mylocal.com" , then it will return www.mylocal.com v1.mylocal.com and v2.myloca.com. And if the search string is "www.local.com", then it will return only www.mylocal.com
how should I write it?
The following regex will match what you want when given a host string:
var reg = new RegExp('^https?://([^.]*' + host + ')');
So, for example:
var host = '.mylocal.com';
reg.exec('http://www.mylocal.com'); // ["http://www.mylocal.com", "www.mylocal.com"]
reg.exec('http://v1.mylocal.com/path'); // ["http://v1.mylocal.com", "v1.mylocal.com"]
reg.exec('https://v3.mylocal.com'); // ["https://v3.mylocal.com", "v3.mylocal.com"]
host = 'www.mylocal.com';
reg.exec('http://www.mylocal.com'); // ["http://www.mylocal.com", "www.mylocal.com"]
reg.exec('http://v1.mylocal.com/path'); // null
reg.exec('https://v3.mylocal.com'); // null
You could also refer to the following post for a full URI regex:
Regular expression validation for URL in ASP.net.
If you want to search on each part of the URL then do just that.
split the URL into 3 searching strings, then run a match of each against your split search terms, this way you can control matching at the beginning and end of each term, and if you would like can order the rest of the terms appropriately.