Adam Fields (weblog)

This blog is largely deprecated, but is being preserved here for historical interest. Check out my index page at adamfields.com for more up to date info. My main trade is technology strategy, process/project management, and performance optimization consulting, with a focus on enterprise and open source CMS and related technologies. More information. I write periodic long pieces here, shorter stuff goes on twitter or app.net.

1/29/2006

What’s the big fuss about IP addresses?

Given the recent fuss about the government asking for search terms and what qualifies as personally identifiable information, I want to explain why IP address logging is a big deal. This explanation is somewhat simplified to make the cases easier to understand without going into complete detail of all of the possible configurations, of which there are many. I think I’ve kept the important stuff without dwelling on the boundary cases, and be aware that your setup may differ somewhat. If you feel I’ve glossed over something important, please leave a comment.

First, a brief discussion of what IP addresses are and how they work. Slightly simplified, every device that is connected to the Internet has a unique number that identifies it, and this number is called an IP address. Whenever you send any normal network traffic to any other computer on the network (request a web page, send an email, etc…), it is marked with your IP address.

There are three standard cases to worry about:

  1. If you use dialup, your analog modem has an IP address. Remote computers see this IP address. (This case also applies if you’re using a data aircard, or using your cell phone as a modem.)
  2. If you have a DSL or cable connection, your DSL/cable modem has an IP address when it’s connected, and your computer has a separate internal IP address that it uses to only communicate with the DSL or cable modem, typically mediated by a home router. Remote computers see the IP address of the DSL/cable modem. (This case also applies if you’re using a mobile wifi hotspot.)
  3. If you’re directly connected to the internet via a network adapter, your network adapter has an IP address. Remote computers see this IP address.

Sometimes, IP addresses are static, meaning they’re manually assigned and don’t change automatically unless someone changes them (typically, only for case #3). Often, they’re dynamic, which means they’re assigned automatically with a protocol called DHCP, which allows a new network connection to automatically pick up an IP address from an available pool. But just because they can change doesn’t mean they will change. Even dynamic IP addresses can remain the same for months or years at a time. (The servers you’re communicating with also have IP addresses, and they are typically static.)

In order to see how an IP address may be personally identifiable information, there’s a critical question to ask – “where do IP addresses come from, and what information can they be correlated with?”.

Depending on how you connect to the internet, your IP address may come from different places:

  • If you use dialup, your modem will get its IP address from the dialup ISP, with which you have an account. The ISP knows who you are and can correlate the IP address they give you with your account. Your name and billing details are part of your account information. By recording the phone number you call from, they may be able to identify your physical location.
  • If you have a DSL or cable connection, your DSL/cable modem will get its IP address from the DSL/cable provider. The ISP knows who you are and can correlate the IP address they give you with your account. Your name and physical location, and probably other information about you, are part of your account information.
  • If you’re using a public wifi access point, you’re probably using the IP address of the access point itself. If you had to log in your account, your name and physical location, and probably other information about you, are part of your account information. If you’re using someone else’s open wifi point, you look like them to the rest of the internet. This case is an exception to the rest of the points outlined in this article.
  • If you’re directly connected to the internet via a network adapter, your network adapter will get its IP address from the network provider. In an office, this is typically the network administrator of the company. Your network administrator knows which computer has which IP address.

None of this information is secret in the traditional sense. It is probably confidential business information, but in all cases, someone knows it, and the only thing keeping it from being further revealed is the willingness or lack thereof of the company or person who knows it.

While an IP address may not be enough to identify you personally, there are strong correlations of various degrees, and in most cases, those correlations are only one step away. By itself, an IP address is just a number. But it’s trivial to find out who is responsible for that address, and thus who to ask if you want to know who it’s been given out to. In some cases, the logs will be kept indefinitely, or destroyed on a regular basis – it’s entirely up to each individual organization.

Up until now, I’ve only discussed the implications of having an IP address. The situation gets much much worse when you start using it. Because every bit of network traffic you use is marked with your IP address, it can be used to link all of those disparate transactions together.

Despite these possible correlations, not one of the major search engines considers your IP address to be personally identifiable information. [Update: someone asked where I got this conclusion. It's from my reading of the Google, Yahoo, and MSN Search privacy policies. In all cases, they discuss server logs separately from the collection of personal information (although MSN Search does have it under the heading of "Collection of Your Personal Information", it's clearly a separate topic). If you have some reason to believe I've made a mistake, I'm all ears.] While this may technically be true if you take an IP address by itself, it is a highly disingenuous position to take when logs exist that link IP addresses with computers, physical locations, and account information… and from there with people. Not always, but often. The inability to link your IP address with you depends always on the relative secrecy of these logs, what information is gathered before you get access to your IP address, and what other information you give out while using it.

Let’s bring one more piece into the puzzle. It’s the idea of a key. A key is a piece of data in common between two disparate data sources. Let’s say there’s one log which records which websites you visit, and it stores a log that only contains the URL of the website and your IP address. No personal information, right? But there’s another log somewhere that records your account information and the IP address that you happened to be using. Now, the IP address is a key into your account information, and bringing the two logs together allows the website list to be associated with your account information.

  • Have you ever searched for your name? Your IP address is now a key to your name in a log somewhere.
  • Have you ever ordered a product on the internet and had it shipped to you? Your IP address is now a key to your home address in a log somewhere.
  • Have you ever viewed a web page with an ad in it served from an ad network? Both the operator of the web site and the operator of the ad network have your IP address in a log somewhere, as a key to the sites you visited.

The list goes on, and it’s not limited to IP addresses. Any piece of unique data – IP addresses, cookie values, email addresses – can be used as a key.

Data mining is the act of taking a whole bunch of separate logs, or databases, and looking for the keys to tie information together into a comprehensive profile representing the correlations. To say that this information is definitely being mined, used for anything, stored, or even ever viewed is certainly alarmist, and I don’t want to imply that it is. But the possibility is there, and in many cases, these logs are being kept, if they’re not being used in that way now, the only thing really standing in the way is the inaction of those who have access to the pieces, or can get it.

If the information is recorded somewhere, it can be used. This is a big problem.

There are various ways to mask your IP address, but that’s not the whole scope of the problem, and it’s still very easy to leak personally identifiable information.

I’ll start with one suggestion for how to begin to address this problem:

Any key information associated with personally identifiable information must also be considered personally identifiable.

[Update: I've put up a followup post to this one with an additional suggestion.]

Tags: , , , , ,


21 Responses to “What’s the big fuss about IP addresses?”

  1. Dick Davies Says:

    Not treating an IP as a personally identifiable piece of information isn’t a contentious position to take (I’m not paying for your powerbook just because we share a http proxy server/ shell server).

    Yes, your IP appears in many logs, because you’re not going to be doing much online without one.
    Unless someone has access to all those logs, so what?

    If anyone cares that much about your activity they’ll pull your ISP into court, it’s much easier.

  2. Westar Says:

    My ISP gives me a new IP address every few weeks. Without much trouble, correlation would be used to determine my IP address history. There are a few blogs where I explicitely check the Remember-Me option, so when I return it knows my handle. The IP logs would clearly show when my handle has switched IPs.

    Pretty much all websites that enable cookies to remember when you return (or ones that ask for your email) get to know when your IP switches, and activity from those other IPs can be pinned to you.

    The little centralized webpage hit counters and embedded ads from big advertisers have the best IP logs, and comprehensive logs of which IPs reference which URLs, do what searches, and do what IP drifting.

  3. adam Says:

    Setting aside the other conclusions, as I said, for you, it may be the case that your IP address doesn’t identify you in any meaningful way. But the fact remains that it is the case now that for many users, that’s not true anymore, and we need to deal with that.

  4. James Wetterau Says:

    Other important factors – DHCP assigned addresses are typically assigned from small pools. Thus if you get a new one, it’s likely to be “close to” your old address, according to some measure.

    When using the web, browser specific information (the type of browser, version number, and operating system version number), is almost always available, too, and commonly recorded. This is sometimes called the “browser fingerprint”. It’s not a true fingerprint — millions of other people likely have the same browser fingerprint as you do, but it can help distinguish you as participating in one or other comparatively small group, such as Mac users, or Windows 98 users, or Windows XP users with the Opera browser. Browser fingerprints vary pretty widely, so at the point at which a DHCP IP address changes, the browser fingerprint can be the clue that ties the old address to the new one.

    This would be done by putting three facts put together: IP address A used to visit a paricular site regularly, with browser fingerprint B. As of a certain time, IP address A stopped visiting the site but now a “close” address IP address C that never showed up before starts regularly visiting the site. The visits share athe browser fingerprint, B, and the two IP addresses are fairly close.

    Obviously this is not proof beyond any possibility of doubt that IP addr C is the new IP address assigned to a DHCP service user who formerly had IP addr A, but it can be good evidence for a statistical analysis. This is especially true if most users have cookies. If 10,000 people are regular visitors of a web site in any given month, and 9,800 use cookies that successfully identify them, then there are only 200 quasi-anonymous site visitors. Of those 200, piecing together a story based on IP addresses and browser fingerprints may be no big data-mining chore.

    This gets even easier if big popular sites share their weblogs for combined statistical analysis. Each site can figure out who its regular visitors are and then share the info with the others to build up a shared profile. This may not be as unlikely as it sounds, since many web sites have hosted advertisements from other companies (i.e. when you hit the web page for a site you may also pull down an ad from another site). If these ads are widely distributed, the ad companies are in a good position to cross-correlate the actions of web browsers across a wide diversity of sites. Access to more data makes the statistical determination of who’s who even easier.

  5. vlidi Says:

    “Any key information associated with personally identifiable information must also be considered personally identifiable.”

    OK, should be the standard.

    once the definition of the “key information” is being agreed upon, as well as how deep the warrants can dig, and for what reasons, that is…
    as we know it will not happen anytime soon, and as we talk about web 2.0 while internet 2.0 is still just a vision (my favourite recepy is
    new & more detailed version of TCP/IP on-the-fly AND “advanced” strings, or cookies on steroids, if you like, able to be transfered from hardware to harware
    or activated per session online from remote server), not much else to work with if you really want to be “untreceable” but to mask your IP, with (still) a suspicious amount
    of sucess and willingness to step back on the speed and once again join another “the-success-is-in-our-(possible)-multitude” group of activists (eg TOR) or similar…

    they still do not use it like they could, and we can not presume that they are not aware of the possibillity, and we saw them cross-referencing before.

    is a fundamental restructure of protocol a possible solution, or is it a solution at all?

    great post, I am curious about the “multitude inteligence” answering the challenge…

  6. Jason Says:

    Why is it that you rail against the retention of personal data yet your blog comment box requires me to give you my email address? Do you have a privacy policy other than it “will not be published”? Are you storing this address securely?

  7. adam Says:

    Thanks for commenting, Jason. Given who your email provider is, I think you have bigger things to worry about than whether I’m storing your email address.

    But still, good question. I’ve never really thought about a formal privacy policy before, because this is a personal blog. For the record, I don’t think I’ve ever rejected a comment because it had a non-working or obviously fake email address, but I have on occasion contacted a poster to explain why I rejected a comment or to get further information before approving it.

    I will consider this. In the meantime, please feel free to use a bogus email address, but bear in mind that I may simply reject your comment out of hand if I have no way to get in touch with you.

  8. adam Says:

    On a similar note, while I don’t have any ads on the site, I do have embedded flickr pictures. So, here’s a question – is flickr just a cover for a huge web bug operation used to track visits to sites that have embedded flickr pictures, or is that being overly paranoid?

  9. Alex Barnett Says:

    Interesting post. You are aware that IP addresses were not handed over last week, yes?

    http://blogs.msdn.com/alexbarn/archive/2006/01/26/517791.aspx

  10. adam Says:

    Yes, I’m aware of that. I do think this discussion goes beyond this particular subpoena.

  11. Westar Says:

    The infatuation with warrants and subpoenas does seem to totally miss the point. The issue is private companies *have* this vast Person->IP->URL info, and sleazy employees or the companies themselves can do whatever they want with the information. The assumption that since we don’t know who works at flickr, google, msn, yahoo, doubleclick, or webhit, that they therefore are not trading and coallating this information seems sort of wrong. It’s not even illegal for them to quietly give the info away to the US govt.

    Note the recent case where private investigators were selling a list of calls made from anyone’s cellphone. These idiot cellphone companies can not even figure out which employees/affiliates are giving out the information. Not that URLs are as interesting as who someone calls, but how much would it cost to get the list of URLs referenced from a given IP? sitemeter.com and technorati have some pretty good databases.

    Adam’s post shining light on this is excellent and fresh. I don’t see much knowledge elsewhere on this topic.

  12. Matt Says:

    If you use TOR to view websites the whole debate about IP adresses is pretty much thrown out the window .

    http://tor.eff.org/

  13. adam Says:

    TOR is, in my view, a partial solution. For one thing, it can be almost unusably slow. For some people, this is an acceptable tradeoff, but most people will get frustrated when web pages start taking 15-30 seconds to load and require several reloads before the DNS request goes through. It’s a good start, and people should use it, but the user experience is hardly ready for the general public. And it’s two more things that people have to install, on every computer that they use.

    But, as I pointed out, this problem isn’t limited to IP addresses, and it’s representative of a deeper issue – that the quality of “personally identifiable” is cumulative when you start putting databases together. Once two pieces of data have been linked, it’s hard to separate them out again.

    Understanding that is a prerequisite to understanding why things like TOR are useful. I think the public dialogue on this has been lacking.

  14. Chris Says:

    The problem with raising this sort of issue is that it brings to light the reality that anonymity on the web is largely illusory. Politically, privacy has far fewer constituents today than “responsibility.” If lawmakers came to realize that by mandating the keeping and publishing of a few keys by all ISPs that citizens could be made “responsible” for their web use, we could kiss all net privacy goodbye. Lots of people see anonymity as antithetical to responsibility. It would be technically simple to create a distributed DNS-like database that links IP not to domain name, but instead to real name. If a law was passed mandating that ISPs make DHCP allocations searchable, voila: instant responsiblity and zero privacy.

    Given the lobbying power of the content owners and folks who would love to be able to target advertisments, and the security spin that could be put on such a policy, it seems very very dangerous to bring such idead to the attention of policians who might get try to get the Internet Responsibility and Terrorist Catching Act passed. Looking at the way the courts are going, it would probably be constitutional too. Yay democracy!

  15. Matt Says:

    Tor can give browsing sppeds simmilar to dial up or more up to about 20kbps currently and the more people that install TOR and operate servers the faster it will get .The slowdowns on TOR have been caused by peer to peer filesharers abuseing the service.

    TOR used to be cumbersome to use but now TOR distribute a bundle with TOR,the TOR controll Pannel and privoxy already configured its quite easy to use .

  16. Kip Patterson Says:

    Your information about “standard case 2″ is totally incorrect. If your computer is connected to a cable or DSL modem without a router, your computer will be assigned a public IP address and this address is what is seen by the sites you visit. The IP address assigned to your modem is a private address for the use of your ISP and is not part of the browsing process ever.

  17. Robert Says:

    This is an interesting topic, and something that could be very scary, especially for those that have been searching for things they really shouldn’t have been searching for. For me, there may be a little embarassment involved, but other than that, I have no worries. It does, however upset me that there is even the most remote possibility someone could be tracking my surfing habits legally, without warrant. The internet has evolved so quickly, laws have not had a chance to keep up with this evolution. There should be some standard in place that will protect our right to privacy. If you want to see what I’ve been up to, first determine if I’ve possibly broken any laws, then obtain a warrant. In my opinion, this would be acceptable.

  18. /pd Says:

    what happens when you are tuneling 6over4 natted addresses ? wont this make it more
    difficult to find out who was actually at the terminal ??

  19. CPCcurmudgeon Says:

    For comparison purposes, you may be interested in the privacy policy of a once-famous search engine (now owned by Yahoo).

    http://www.altavista.com/about/priv_details

  20. Sioen Says:

    thanks for the great discussion. this needs changing.

    But CPCcurmudgeon, I’m curious as to what comparison you were making with Altavista’s privacy policy. I have always used Altavista, just cuz I like it the best, but when I read the privacy policy, it doesn’t seem to be any different from others.

    They, too, explicitly say that anonymous information includes IP addresses. Curious.

    But is there something in it I missed?

  21. CPCcurmudgeon Says:

    The AV privacy policy notes that IP addresses can potentially be personally identifying when they are linked to information that is stored in other places, such as RIRs (Regional Internet Registries) or domain name registrars.

    I would also like to point out that Google’s example of what’s in a typical web server log is just that — an example. A lot more information can be collected. Potentially, anything that is sent in an HTTP request can be collected.

Powered by WordPress