Web crawling and its limitations

Question

Web crawling and its limitations

Let's say that we host a file on the Internet that can be publicly evaluated if you know the direct URL. There are no links pointing to the file, and directory lists have also been disabled on the server. So while it is publicly available, the page cannot be accessed except by entering the exact URL into this file. What are the chances that a website crawler of any kind (handsome or malicious) will be able to find this file by scanning and then indexing the file.

For me, even though it is publicly available, it will need to require luck or specific knowledge of finding the file. Just like burying gold in the backyard, and someone finds it without a map or something there is something buried there.

I just don't see any other way this would be detected, but why am I asking stackoverflow community.

Thanks.

+1

web-crawler

embsupafly May 25 '09 at 17:49

a source to share

8 answers

mjy · Answer 1 · 2009-05-25T17:52:09+0000

In the past, such hidden locations were allegedly "found" by the Google toolbar (and possibly other similar browser plugins) used by the owner / downloader.

Matthew flaschen · Answer 2 · 2009-05-25T17:53:38+0000

Security through obscurity never works. You say you are not going to mess with him, and I believe you. But nothing prevents your user from communicating with him, intentionally or uninteionally. As stated by ceejayoz there are so many different places to post links. And there are even "bookmark syncs" that people might think are private, but are actually open to the world.

So, use real authentication. If you don't, you will regret it later.

ceejayoz · Answer 3 · 2009-05-25T17:51:48+0000

Links can appear anywhere - someone can tweet a link to it, or post it on Facebook, or in a blog comment. It only takes one.

If it is important that it does not appear anywhere, put it behind a password.

If it doesn't matter that much, but you still prefer it not to be easily accessible through a search engine, use a robots.txt file to block well-performing crawlers.

arachnode.net · Answer 4 · 2010-10-01T18:49:48+0000

Upstream data purchased / sold may result in inconsistent content being detected otherwise: http://en.wikipedia.org/wiki/Clickstream

Boris Guéry · Answer 5 · 2009-05-25T17:53:09+0000

Assuming this:

Directory List: Disabled. Nobody
knows about the existence of the page.
Your file contains no links (your browser may send a referrer to a linked site)
Robots.txt configured correctly
You trust all people who will not distribute your link to anyone else.
You are lucky.

Okay, your page probably won't be found or found.

Conclusion?

Use a .htaccess file to protect your data.

Jeff Meatball Yang · Answer 6 · 2009-05-25T17:55:55+0000

You're right. Web robots, metaphorically, spiders - they need to be able to go through the Internet (hyperlinks) and come to your page.

To get your hypothetical page in search engine results, you must manually submit its URL to the search engine. There are several services for submitting your page to these search engines. See "Submitting URLs to Search Engines"

Also, your page will only show if the search engine determines that your page has enough metadata / karma on the search engine ranking system. See SEO and Meta Keywords.

Viswas · Answer 7 · 2009-08-21T07:20:34+0000

yes ur right Visiting the web browser by URLs, it identifies all the hyperlinks on the page and adds them to the list of URLs to visit and is called a crawl boundary, but these hyperlinks and URLs have bad links. Once users click on a bad link and land on the Malware site, they are often advertised with a fake codec dialog box. If that fails, the site will still be loaded with dozens of other tactics to infect their computer. From fake toolbars, scary products, rogue software, etc. The sites have it all. One site they came across even tried to install 25 different bits of malware. Such sites leave people vulnerable to spam bots, rootkits, Steelers passwords, and a host of Trojan horses, among others.

ariso · Answer 8 · 2009-05-25T17:52:47+0000

you can use google search api. for a web page that is not linked to any other web page. we don't know about it.

-2

ariso May 25 '09 at 17:52

a source to share

Web crawling and its limitations

More articles: