Next put yourself in their shoes and realize they don't usually monitor their traffic that much or simply don't care as long as you don't slow down their site. It's usually only certain big sites with heavy bot traffic such as linkedin or sneaker shoe sites which implement bot protections. Most others don't care.
Some websites are created almost as if they want to be scraped. The json api used by frontend is ridiculously clean and accessible. Perhaps they benefit when people see their results and invest in their stock. You never fully know if the site wants to be scraped or not.
The reality of scraping industry related to your question is this
1. scraping companies generally don't use real user agent such as 'my friendly data science bot' but they hide behind a set of fake ones and/or route the traffic through a proxy network. You don't want to get banned so stupidly easily by revealing user agent when you know your competitors don't reveal theirs.
2. This one is obvious. The general rule is to scrape over long time period continuously and add large delays between requests of at least 1 second. If you go below 1 second be careful.
3. robots.txt is controversial and doesn't serve its original purpose. It should be renamed to google_instructions.txt because site owners use it to guide googlebot to navigate their site. It is generally ignored by the industry again because you know your competitors ignore it.
Just remember the rule of 'not to piss off the site owner' and then just go ahead and scrape. Also keep in mind that you are in a free country and we don't discriminate here whether it is of racial or gender reasons or whether you are a biological or mechanical website visitor.
I simply described the reality of data science industry around scraping after several years of being in it. Note that this will probably not be liked by HN audience as they are mostly website devs and site owners.
Oh, and for 3.: if you can, apply some heuristics to your reading of the robots.txt. If it's just "deny everything", then ignore it, but you really don't want to be responsible for crawling all of the GET /delete/:id pages of a badly-designed site… (those should definitely be POST, and authenticated, by the way).
Anonymity is part of the right to privacy; IMO, such a right should extend to bots as well. There should be no shame in anonymously accessing a website, whether via automated means or otherwise.
No, it very much shouldn't, but (as you probably meant) it should extend to the person (not, eg, company) using a bot, which amounts to the same thing in this case.
If sensitive data can be scraped, it is not really stored sensitive. So I would not care too much about it and just notify the owner if I notice it.
Totally agree with the rest though. Maybe adapt the "large delay" of 1 second to the kind of website I'm scraping though.
Thanks for your feedback!
If the bots aren't querying from residential IPs you could match their IPs to ASNs and then filter based on that to separate domestic and data center origins.
So if you see a known hoster here then you can exclude it from your statistics.
$ dig +short reddit.com 126.96.36.199 $ whois 188.8.131.52 NetRange: 184.108.40.206 - 220.127.116.11 CIDR: 18.104.22.168/16 OrgName: Fastly [...]