PubCrawl Detects Automated Abuse Of Websites

  /     /     /  
Publicated : 22/11/2024   Category : security


PubCrawl Detects Automated Abuse Of Websites


Academic researchers create a program to detect unwanted and malicious Web crawlers, blocking them from harvesting proprietary and sensitive data



A group of researchers from the University of California Santa Barbara and Northeastern University have created a system called PubCrawl for detecting Web crawlers, even if the automated bots are coming from a distributed collection of Internet addresses.
The system combines multiple methods of discriminating between automated traffic and normal user requests, using both content and timing analysis to model traffic from a collection of IP addresses, the researchers
stated in a paper
to be presented at the USENIX Security Conference on Friday. Websites want to allow legitimate visitors to get the data they need from their pages, while blocking wholesale scraping of content by competitors, attackers and others who want to use the data for non-beneficial purposes, says Christopher Kruegel, an associate professor at UCSB and one of the authors of the paper.
You want to make it easy for one person to get a small slice of the data, he says. But you dont want to allow one person to get all the information.
Using data from a large, unnamed social network, the team trained the PubCrawl system to detect automated crawlers and then deployed the system to block unwanted traffic to a production server. The researchers had a high success rate: Crawlers were positively identified more than 95 percent of the time, with perfect detection of unauthorized crawlers and nearly 99 percent recognition of crawlers that masquerade as Web bots from a legitimate service.
A significant advance for crawler detection is the recognizing the difference in traffic patterns between human visitors and Web bots, says Gregoire Jacob, a research scientist at UCSB and another co-author of the paper. By looking at the distribution of requests over time, the system can more accurately detect bots. When the researchers graphed a variety of traffic patterns, the differences became obvious, says Jacob.
We realized that there is a fundamental difference, he says. A crawler is a very stable signal -- its almost a square signal. With a user, there is a lot of variation.
[A Web-security firm launches a site for cataloging Web bots, the automated programs that crawl websites to index pages, grab competitive price information, gather information on social-networking users, or scan for vulnerabilities. See
Gather Intelligence On Web Bots To Aid Defense
.]
The researchers did not stop at using the signal patterns to improve the accuracy of their system. The team also tried to link similar patterns between disparate Internet sources that could indicate a distributed Web crawler. The PubCrawl system clusters Internet addresses that demonstrate similar traffic patterns into crawling campaigns.
Such distributed networks are the main threat to any attempt to prevent content scraping. PubCrawl can be set to allow a certain number of free requests per Internet address -- under that limit, no request will be denied. Once above that limit, then the system will attempt to identify the traffic pattern. Attacks that use a very large number of low-bandwidth requests could escape notice.
That is the limit of the detection, when attackers are able to mimic users in a distributed non-regular fashion, it make it difficult to catch, says UCSBs Kruegel. But right now, attackers are very far from that.
For traffic above the minimum threshold that does not match any known pattern, the PubCrawl system uses an active countermeasure, forcing the user to input the occasional CAPTCHA. Sources of requests that ask for non-existent pages, fail to revisit pages, have odd referrer fields, and ignore cookies will all be flagged as automated crawlers much more quickly.
Much of this is not new to the industry, says Matthew Prince, CEO of Cloudflare, a website availability and security service. Companies such as Incapsula, Akamai, and Cloudflare have already created techniques to find and classify Web crawlers.
We see a huge amount of traffic and are able to automatically classify most of the Webs bots and crawlers in order to better protect sites from bad bots while ensuring theyre still are accessible by good bots, Prince says.
Rival security firm Incapsula has noted the increase in automated Web traffic, which, in February, reached 51 percent of all traffic seen by websites. While 20 percent of Web requests are search engine indexers and other good bots, 31 percent are competitors and intelligence gathering bots as well as site scrapers, comment spammers, and vulnerability scans.
Yet with Web traffic
set to increase five-fold by 2016
, teasing out which traffic is good and which is bad will become more difficult, says Sumit Agarwal, vice president of product management for security start-up Shape Security.
Being able to control your website while being open and accessible is going to be the biggest challenge for Web service firms in the future, he says.
Have a comment on this story? Please click Add Your Comment below. If youd like to contact
Dark Readings
editors directly,
send us a message
.

Last News

▸ Debunking Machine Learning in Security. ◂
Discovered: 23/12/2024
Category: security

▸ Researchers create BlackForest to gather, link threat data. ◂
Discovered: 23/12/2024
Category: security

▸ Travel agency fined £150,000 for breaking Data Protection Act. ◂
Discovered: 23/12/2024
Category: security


Cyber Security Categories
Google Dorks Database
Exploits Vulnerability
Exploit Shellcodes

CVE List
Tools/Apps
News/Aarticles

Phishing Database
Deepfake Detection
Trends/Statistics & Live Infos



Tags:
PubCrawl Detects Automated Abuse Of Websites