Web crawling and its limitations
Let's say that we host a file on the Internet that can be publicly evaluated if you know the direct URL. There are no links pointing to the file, and directory lists have also been disabled on the server. So while it is publicly available, the page cannot be accessed except by entering the exact URL into this file. What are the chances that a website crawler of any kind (handsome or malicious) will be able to find this file by scanning and then indexing the file.
For me, even though it is publicly available, it will need to require luck or specific knowledge of finding the file. Just like burying gold in the backyard, and someone finds it without a map or something there is something buried there.
I just don't see any other way this would be detected, but why am I asking stackoverflow community.
Thanks.
Security through obscurity never works. You say you are not going to mess with him, and I believe you. But nothing prevents your user from communicating with him, intentionally or uninteionally. As stated by ceejayoz there are so many different places to post links. And there are even "bookmark syncs" that people might think are private, but are actually open to the world.
So, use real authentication. If you don't, you will regret it later.
a source to share
Links can appear anywhere - someone can tweet a link to it, or post it on Facebook, or in a blog comment. It only takes one.
If it is important that it does not appear anywhere, put it behind a password.
If it doesn't matter that much, but you still prefer it not to be easily accessible through a search engine, use a robots.txt file to block well-performing crawlers.
a source to share
Upstream data purchased / sold may result in inconsistent content being detected otherwise: http://en.wikipedia.org/wiki/Clickstream
a source to share
Assuming this:
- Directory List: Disabled. Nobody
- knows about the existence of the page.
- Your file contains no links (your browser may send a referrer to a linked site)
- Robots.txt configured correctly
- You trust all people who will not distribute your link to anyone else.
- You are lucky.
Okay, your page probably won't be found or found.
Conclusion?
Use a .htaccess file to protect your data.
a source to share
You're right. Web robots, metaphorically, spiders - they need to be able to go through the Internet (hyperlinks) and come to your page.
To get your hypothetical page in search engine results, you must manually submit its URL to the search engine. There are several services for submitting your page to these search engines. See "Submitting URLs to Search Engines"
Also, your page will only show if the search engine determines that your page has enough metadata / karma on the search engine ranking system. See SEO and Meta Keywords.
a source to share
yes ur right Visiting the web browser by URLs, it identifies all the hyperlinks on the page and adds them to the list of URLs to visit and is called a crawl boundary, but these hyperlinks and URLs have bad links. Once users click on a bad link and land on the Malware site, they are often advertised with a fake codec dialog box. If that fails, the site will still be loaded with dozens of other tactics to infect their computer. From fake toolbars, scary products, rogue software, etc. The sites have it all. One site they came across even tried to install 25 different bits of malware. Such sites leave people vulnerable to spam bots, rootkits, Steelers passwords, and a host of Trojan horses, among others.