Given the recent fuss about the government asking for search terms and what qualifies as personally identifiable information, I want to explain why IP address logging is a big deal. This explanation is somewhat simplified to make the cases easier to understand without going into complete detail of all of the possible configurations, of which there are many. I think I’ve kept the important stuff without dwelling on the boundary cases, and be aware that your setup may differ somewhat. If you feel I’ve glossed over something important, please leave a comment.
First, a brief discussion of what IP addresses are and how they work. Slightly simplified, every device that is connected to the Internet has a unique number that identifies it, and this number is called an IP address. Whenever you send any normal network traffic to any other computer on the network (request a web page, send an email, etc…), it is marked with your IP address.
There are three standard cases to worry about:
- If you use dialup, your analog modem has an IP address. Remote computers see this IP address. (This case also applies if you’re using a data aircard, or using your cell phone as a modem.)
- If you have a DSL or cable connection, your DSL/cable modem has an IP address when it’s connected, and your computer has a separate internal IP address that it uses to only communicate with the DSL or cable modem, typically mediated by a home router. Remote computers see the IP address of the DSL/cable modem. (This case also applies if you’re using a mobile wifi hotspot.)
- If you’re directly connected to the internet via a network adapter, your network adapter has an IP address. Remote computers see this IP address.
Sometimes, IP addresses are static, meaning they’re manually assigned and don’t change automatically unless someone changes them (typically, only for case #3). Often, they’re dynamic, which means they’re assigned automatically with a protocol called DHCP, which allows a new network connection to automatically pick up an IP address from an available pool. But just because they can change doesn’t mean they will change. Even dynamic IP addresses can remain the same for months or years at a time. (The servers you’re communicating with also have IP addresses, and they are typically static.)
In order to see how an IP address may be personally identifiable information, there’s a critical question to ask – “where do IP addresses come from, and what information can they be correlated with?”.
Depending on how you connect to the internet, your IP address may come from different places:
- If you use dialup, your modem will get its IP address from the dialup ISP, with which you have an account. The ISP knows who you are and can correlate the IP address they give you with your account. Your name and billing details are part of your account information. By recording the phone number you call from, they may be able to identify your physical location.
- If you have a DSL or cable connection, your DSL/cable modem will get its IP address from the DSL/cable provider. The ISP knows who you are and can correlate the IP address they give you with your account. Your name and physical location, and probably other information about you, are part of your account information.
- If you’re using a public wifi access point, you’re probably using the IP address of the access point itself. If you had to log in your account, your name and physical location, and probably other information about you, are part of your account information. If you’re using someone else’s open wifi point, you look like them to the rest of the internet. This case is an exception to the rest of the points outlined in this article.
- If you’re directly connected to the internet via a network adapter, your network adapter will get its IP address from the network provider. In an office, this is typically the network administrator of the company. Your network administrator knows which computer has which IP address.
None of this information is secret in the traditional sense. It is probably confidential business information, but in all cases, someone knows it, and the only thing keeping it from being further revealed is the willingness or lack thereof of the company or person who knows it.
While an IP address may not be enough to identify you personally, there are strong correlations of various degrees, and in most cases, those correlations are only one step away. By itself, an IP address is just a number. But it’s trivial to find out who is responsible for that address, and thus who to ask if you want to know who it’s been given out to. In some cases, the logs will be kept indefinitely, or destroyed on a regular basis – it’s entirely up to each individual organization.
Up until now, I’ve only discussed the implications of having an IP address. The situation gets much much worse when you start using it. Because every bit of network traffic you use is marked with your IP address, it can be used to link all of those disparate transactions together.
Despite these possible correlations, not one of the major search engines considers your IP address to be personally identifiable information. [Update: someone asked where I got this conclusion. It's from my reading of the Google, Yahoo, and MSN Search privacy policies. In all cases, they discuss server logs separately from the collection of personal information (although MSN Search does have it under the heading of "Collection of Your Personal Information", it's clearly a separate topic). If you have some reason to believe I've made a mistake, I'm all ears.] While this may technically be true if you take an IP address by itself, it is a highly disingenuous position to take when logs exist that link IP addresses with computers, physical locations, and account information… and from there with people. Not always, but often. The inability to link your IP address with you depends always on the relative secrecy of these logs, what information is gathered before you get access to your IP address, and what other information you give out while using it.
Let’s bring one more piece into the puzzle. It’s the idea of a key. A key is a piece of data in common between two disparate data sources. Let’s say there’s one log which records which websites you visit, and it stores a log that only contains the URL of the website and your IP address. No personal information, right? But there’s another log somewhere that records your account information and the IP address that you happened to be using. Now, the IP address is a key into your account information, and bringing the two logs together allows the website list to be associated with your account information.
- Have you ever searched for your name? Your IP address is now a key to your name in a log somewhere.
- Have you ever ordered a product on the internet and had it shipped to you? Your IP address is now a key to your home address in a log somewhere.
- Have you ever viewed a web page with an ad in it served from an ad network? Both the operator of the web site and the operator of the ad network have your IP address in a log somewhere, as a key to the sites you visited.
The list goes on, and it’s not limited to IP addresses. Any piece of unique data – IP addresses, cookie values, email addresses – can be used as a key.
Data mining is the act of taking a whole bunch of separate logs, or databases, and looking for the keys to tie information together into a comprehensive profile representing the correlations. To say that this information is definitely being mined, used for anything, stored, or even ever viewed is certainly alarmist, and I don’t want to imply that it is. But the possibility is there, and in many cases, these logs are being kept, if they’re not being used in that way now, the only thing really standing in the way is the inaction of those who have access to the pieces, or can get it.
If the information is recorded somewhere, it can be used. This is a big problem.
There are various ways to mask your IP address, but that’s not the whole scope of the problem, and it’s still very easy to leak personally identifiable information.
I’ll start with one suggestion for how to begin to address this problem:
Any key information associated with personally identifiable information must also be considered personally identifiable.
[Update: I've put up a followup post to this one with an additional suggestion.]