Adam Fields (weblog)

This blog is largely deprecated, but is being preserved here for historical interest. Check out my index page at adamfields.com for more up to date info. My main trade is technology strategy, process/project management, and performance optimization consulting, with a focus on enterprise and open source CMS and related technologies. More information. I write periodic long pieces here, shorter stuff goes on twitter or app.net.

1/31/2006

iRAM SATA ramdisk is now available

Filed under: — adam @ 1:22 pm

Someone pointed out to me that the iRAM is now available. It’s a PCI card with standard DIMM slots that plugs into a standard SATA port to give you up to a 4GB ramdisk that doesn’t require separate drivers. It draws power off of the PCI slot to keep the memory intact even if the machine is off as long as it’s plugged in, and it even has a battery backup for up to ten hours of complete shutoff.

Tech Report did some testing, and the results are pretty impressive:

http://www.techreport.com/reviews/2006q1/gigabyte-iram/index.x?pg=1

Tags: , , ,


1/30/2006

More specific Google tracking questions

I asked two very specific questions in a conversation with John Battelle, and he’s received unequivocal answers from Google:

1) “Given a list of search terms, can Google produce a list of people who searched for that term, identified by IP address and/or Google cookie value?”

2) “Given an IP address or Google cookie value, can Google produce a list of the terms searched by the user of that IP address or cookie value?”

The answer to both of them is “yes”.

http://battellemedia.com/archives/002283.php

Tags: , , ,


Flickr pictures, web beacons, and a modest proposal

As I noted in the comments of the previous post, I don’t have ads on the site, but I do have flickr pictures directly linked from my flickr account.

It is conceivable to me that flickr pictures could qualify as “web beacons” under the Yahoo privacy policy, and thus be used for tracking purposes. Presumably, this was not the original intention of the flickr developers, but it’s certainly a possibility now that they’re owned by Yahoo. Are the access logs for the static flickr pictures available to Yahoo? Probably. Are they correlated with other sorts of usage information? It’s not clear. Presumably, flickr pictures are linked in places where standard Yahoo web beacons can’t go, because they’re not invited (like on this site, for example).

I think my conclusion is that this is probably not a problem, but maybe it is. It and other sorts of distributed 3rd party tracking all have one thing in common:

It’s called HTTP_REFERER.

Here’s how it works. When you make a request for any old random web page that contains a 3rd party ad or an image or a javascript library or whatever, your browser fetches the embedded piece of content from the 3rd party. When it does that, as part of the request, it sends the URL of the page you visited as part of the request, in a field called the referer header (yes, it’s misspelled).

So, every time you visit a web page:

  • You send the URL to the owner of the page. So far so good.
  • You send your IP address to the owner of the page. Not terrible in itself.
  • You send the URL of the page you visited to the owner of the 3rd party content. And this is where it starts to degrade a little.
  • You send your IP address to the owner of the 3rd party content. The owner of the 3rd party content may be able to set a cookie identifying you. Modern browsers are set by default to refuse 3rd party cookies. However, if that 3rd party has ever set a cookie on your browser before (say, if you hit their site directly), they can still read it. In any case, you can be identified in some incremental way.
  • The next time you visit another site with content from the same 3rd party, they can probably identify you again.

That referer URL is a significant key that ties a lot of browsing habits together.

There’s an important distinction to be made here. The referer header makes it possible for 3rd party sites to track your content, and it’s only one of many ways. Doing away with the referer header won’t prevent the sites running 3rd party tracking content from doing so. The owner of the site can always send the URL you’re looking at to the 3rd party as part of the request, even if your browser isn’t. However, what this does prevent is tracking without the consent of the owner of the site you’re looking at. Of all of the sites you’re looking at, actually. Judging from my admittedly limited conversations with site owners, there are a LOT of people out there who have no idea that their users can be tracked if they include 3rd party ads on their site, or flickr images, or whatever. (Again, not to say that their users are being tracked, but the possibility is there.)

Again, the site that includes the ad or image or whatever isn’t sending that information – your browser is, and this is a legacy of the early days of the web. Some browsers allow you to turn it off and not send any referer information. I’d argue that this should be off by default, because there disadvantages outweigh the benefits. I’m told that legitimate advertisers don’t rely on the referer header anyway, because it can be unreliable. If that’s true, that’s even less reason to keep it around.

Suggestion number one was “Tracking information that’s linked to personally identifiable information should also be considered personally identifiable“.

Perhaps suggestion two is “Let’s do away with the Referer header”. (Of course, this comes on the heels of a Google-employed Firefox developer adding more tracking features instead of taking them away.)

Arguments for or against? Are there any good uses for this that are worth the potential for abuse?

Tags: , , , , , ,


1/29/2006

What’s the big fuss about IP addresses?

Given the recent fuss about the government asking for search terms and what qualifies as personally identifiable information, I want to explain why IP address logging is a big deal. This explanation is somewhat simplified to make the cases easier to understand without going into complete detail of all of the possible configurations, of which there are many. I think I’ve kept the important stuff without dwelling on the boundary cases, and be aware that your setup may differ somewhat. If you feel I’ve glossed over something important, please leave a comment.

First, a brief discussion of what IP addresses are and how they work. Slightly simplified, every device that is connected to the Internet has a unique number that identifies it, and this number is called an IP address. Whenever you send any normal network traffic to any other computer on the network (request a web page, send an email, etc…), it is marked with your IP address.

There are three standard cases to worry about:

  1. If you use dialup, your analog modem has an IP address. Remote computers see this IP address. (This case also applies if you’re using a data aircard, or using your cell phone as a modem.)
  2. If you have a DSL or cable connection, your DSL/cable modem has an IP address when it’s connected, and your computer has a separate internal IP address that it uses to only communicate with the DSL or cable modem, typically mediated by a home router. Remote computers see the IP address of the DSL/cable modem. (This case also applies if you’re using a mobile wifi hotspot.)
  3. If you’re directly connected to the internet via a network adapter, your network adapter has an IP address. Remote computers see this IP address.

Sometimes, IP addresses are static, meaning they’re manually assigned and don’t change automatically unless someone changes them (typically, only for case #3). Often, they’re dynamic, which means they’re assigned automatically with a protocol called DHCP, which allows a new network connection to automatically pick up an IP address from an available pool. But just because they can change doesn’t mean they will change. Even dynamic IP addresses can remain the same for months or years at a time. (The servers you’re communicating with also have IP addresses, and they are typically static.)

In order to see how an IP address may be personally identifiable information, there’s a critical question to ask – “where do IP addresses come from, and what information can they be correlated with?”.

Depending on how you connect to the internet, your IP address may come from different places:

  • If you use dialup, your modem will get its IP address from the dialup ISP, with which you have an account. The ISP knows who you are and can correlate the IP address they give you with your account. Your name and billing details are part of your account information. By recording the phone number you call from, they may be able to identify your physical location.
  • If you have a DSL or cable connection, your DSL/cable modem will get its IP address from the DSL/cable provider. The ISP knows who you are and can correlate the IP address they give you with your account. Your name and physical location, and probably other information about you, are part of your account information.
  • If you’re using a public wifi access point, you’re probably using the IP address of the access point itself. If you had to log in your account, your name and physical location, and probably other information about you, are part of your account information. If you’re using someone else’s open wifi point, you look like them to the rest of the internet. This case is an exception to the rest of the points outlined in this article.
  • If you’re directly connected to the internet via a network adapter, your network adapter will get its IP address from the network provider. In an office, this is typically the network administrator of the company. Your network administrator knows which computer has which IP address.

None of this information is secret in the traditional sense. It is probably confidential business information, but in all cases, someone knows it, and the only thing keeping it from being further revealed is the willingness or lack thereof of the company or person who knows it.

While an IP address may not be enough to identify you personally, there are strong correlations of various degrees, and in most cases, those correlations are only one step away. By itself, an IP address is just a number. But it’s trivial to find out who is responsible for that address, and thus who to ask if you want to know who it’s been given out to. In some cases, the logs will be kept indefinitely, or destroyed on a regular basis – it’s entirely up to each individual organization.

Up until now, I’ve only discussed the implications of having an IP address. The situation gets much much worse when you start using it. Because every bit of network traffic you use is marked with your IP address, it can be used to link all of those disparate transactions together.

Despite these possible correlations, not one of the major search engines considers your IP address to be personally identifiable information. [Update: someone asked where I got this conclusion. It's from my reading of the Google, Yahoo, and MSN Search privacy policies. In all cases, they discuss server logs separately from the collection of personal information (although MSN Search does have it under the heading of "Collection of Your Personal Information", it's clearly a separate topic). If you have some reason to believe I've made a mistake, I'm all ears.] While this may technically be true if you take an IP address by itself, it is a highly disingenuous position to take when logs exist that link IP addresses with computers, physical locations, and account information… and from there with people. Not always, but often. The inability to link your IP address with you depends always on the relative secrecy of these logs, what information is gathered before you get access to your IP address, and what other information you give out while using it.

Let’s bring one more piece into the puzzle. It’s the idea of a key. A key is a piece of data in common between two disparate data sources. Let’s say there’s one log which records which websites you visit, and it stores a log that only contains the URL of the website and your IP address. No personal information, right? But there’s another log somewhere that records your account information and the IP address that you happened to be using. Now, the IP address is a key into your account information, and bringing the two logs together allows the website list to be associated with your account information.

  • Have you ever searched for your name? Your IP address is now a key to your name in a log somewhere.
  • Have you ever ordered a product on the internet and had it shipped to you? Your IP address is now a key to your home address in a log somewhere.
  • Have you ever viewed a web page with an ad in it served from an ad network? Both the operator of the web site and the operator of the ad network have your IP address in a log somewhere, as a key to the sites you visited.

The list goes on, and it’s not limited to IP addresses. Any piece of unique data – IP addresses, cookie values, email addresses – can be used as a key.

Data mining is the act of taking a whole bunch of separate logs, or databases, and looking for the keys to tie information together into a comprehensive profile representing the correlations. To say that this information is definitely being mined, used for anything, stored, or even ever viewed is certainly alarmist, and I don’t want to imply that it is. But the possibility is there, and in many cases, these logs are being kept, if they’re not being used in that way now, the only thing really standing in the way is the inaction of those who have access to the pieces, or can get it.

If the information is recorded somewhere, it can be used. This is a big problem.

There are various ways to mask your IP address, but that’s not the whole scope of the problem, and it’s still very easy to leak personally identifiable information.

I’ll start with one suggestion for how to begin to address this problem:

Any key information associated with personally identifiable information must also be considered personally identifiable.

[Update: I've put up a followup post to this one with an additional suggestion.]

Tags: , , , , ,


1/27/2006

Google does keep cookie- and IP-correlated logs

I asked John Battelle the question about whether Google keeps personally identifiable search log information, particularly search logs correlated with IP address. He asked Google PR, who confirmed that they do.

http://battellemedia.com/archives/002272.php

From my comment there, ultimately, this is bad for users. If the information is kept, it’s available for request, abuse, or theft.

Tags: , , , , , ,


Some evidence that Google does keep personally identifiable logs

This article from Internet Week has Alan Eustace, VP of Engineering for Google, on the record talking about the My Search feature.

http://internetweek.cmp.com/showArticle.jhtml?articleID=161500556

“Anytime, you give up any information to anybody, you give up some privacy,” Eustace said.

With “My Search,” however, information stored internally with Google is no different than the search data gathered through its Google .com search engine, Eustace said.

“This product itself does not have a significant impact on the information that is available to legitimate law enforcement agencies doing their job,” Eustace said.

This seems pretty conclusive to me – signing up for saved searches doesn’t (or didn’t, in April 2005) change the way the search data is stored internally.

Comments?

(This was pointed out to me by Ray Everett-Church in the comments of the previous post, covered on his blog: http://www.privacyclue.com/index.php?p=68)

Tags: , , , , , ,


1/26/2006

Does Google keep logs of personal data?

The question is this – is there any evidence that Google is keeping logs of personally identifiable search history for users who have not logged in and for logged-in users who have not signed up for search history? What about personal data collected from Gmail, and Google Groups, and Google Desktop? Aggregated with search? Kept personally identifiably? (Note: For the purposes of this conversation, even though Google does not consider your IP address to be personally identifiable, at least according to their privacy policy, I do.)

It is not arguable that they could keep those logs, but I think every analysis I’ve seen is simply repeating the assumption that they do, based on the fact that they could.

Has there ever been a hard assertion, by someone who’s in a position to know, that these logs do in fact exist?

I have a suspicion about one possible source of all this. Google’s privacy policy used to say (amended 7/2004):

Google notes and saves [emphasis mine] information such as time of day, browser type, browser language, and IP address with each query.“.

But the policy no longer says that. The current version reads: “When you use Google services, our servers automatically record information that your browser sends whenever you visit a website. These server logs may include information such as your web request, Internet Protocol address, browser type, browser language, the date and time of your request and one or more cookies that may uniquely identify your browser.“. Again, no information about what’s being done with that data or how long it’s kept.

Given the possibility that they don’t, I think it drastically changes the value proposition of those free subsidiary tools. Obviously, if you ask for your search history to be saved, they’re going to keep it. But maybe that decision is predicated on the assumption that they’re going to keep it anyway, and you might as well have access to it. If the answer is that they’re not keeping it, that’s a different question.

It’s critical to point out that these issues are not even close to limited to Google. Every search engine, every “free” service you give your data to, every hub of aggregated data on the web has the same problems.

Currently, there’s no way to make an informed decision, because privacy policies don’t include specific information about what data is kept, in what form, and for how long. With all of the disclosures in the past year of personal data lost, compromised, and requested, isn’t it time for us to know? In the beginning of the web, having a privacy policy at all was unheard of, but now everybody has one. I don’t think it’s too much to ask of the companies we do business with that the same be done with log retention policies.

I agree with the request to ask Google to delete those logs if they’re keeping them, but I haven’t seen any evidence that they are. Personally, I’d like to know.

Tags: , , , , , ,


1/25/2006

Tim Wu article on Google and search engine privacy

Filed under: — adam @ 11:03 am

This is pretty much exactly the point I’ve been trying to make – while Google is commendable for standing up to the government, they created this problem in the first place by aggregating search data.

“Imagine we were to find out one day that Starbucks had been recording everyone’s conversations for the purpose of figuring out whether cappuccino is more popular than macchiato. Sure, the result, on the margin, might be a better coffee product. And, yes, we all know, or should, that our conversations at Starbucks aren’t truly private. But we’d prefer a coffee shop that wasn’t listening – and especially one that won’t later be able to identify the macchiato lovers by name. We need to start to think about search engines the same way and demand the same freedoms.”

http://www.slate.com/id/2134670/?nav=tap3


1/20/2006

More thoughts on Google

Having examined the motion and letters, I see a different picture emerging.

I am not a lawyer, but from my reading of the motion, it appears that Google’s objections are thin. Really thin.
Also, they seem to have been completely addressed by the scaling back of the DOJ requests. Of course, that’s not the complete story, but if the arguments in the motion are correct, it seems like to me that Google will lose and be compelled to comply.

Based on the letters and other analysis, they’re also pulling the slippery slope defense – “we’re not going to comply with this because it will give you the expectation that we’re open for business and next time you can ask for personal information”. If that’s true, I think that’s the first good news I’ve heard out of them in years. Good luck with that.

Google’s own behavior is inconsistent with their privacy FAQ, which states Google does comply with valid legal process, such as search warrants, court orders, or subpoenas seeking personal information. These same processes apply to all law-abiding companies. As has always been the case, the primary protections you have against intrusions by the government are the laws that apply to where you live. (Interestingly, this language is inconsistent with their full privacy policy, which states that Google only shares personal information … [when] We have a good faith belief that access, use, preservation or disclosure of such information is reasonably necessary to (a) satisfy any applicable law, regulation, legal process or enforceable governmental request.

I wonder if they intend to challenge the validity of the fishing expedition itself, which would be the real kicker (and probably invalidate the above paragraph). I also idly wonder if they expect to lose anyway and have simply refused to comply with bogus arguments in order to get the request entered into the public record.

Interesting stuff. A lot of my criticisms of Google are about their unwillingness to publicly state their intentions with respect to the data they get (and the extent to which they may or may not be retaining, aggregating, and correlating that data), and I don’t think this case is any different. I think Google’s interest here in not releasing records is aligned with the public good, and as such, I wish them well. It’s been asserted that Google has taken extraordinary steps to preserve the anonymity of its records, and that well may be true. It’s also kind of irrelevant. Beyond this specific case, of whether the govnernment can request information about Google searches (let alone any of their more invasive services, or anyone’s more invasive services), is the issue of the ramifications of collecting, aggregating, and correlating this data in the first place.

There is no question that Google has access to a tremendous amount of data on everyone who interacts with its service. It is still troubling that its privacy policy is inadequate. It’s still troubling that Google (and Yahoo, and how many others) considers your IP address to be not personally identifiable information. It’s still troubling that Google (and Yahoo and how many others) do all of their transactions unencrypted and that search terms are included in the URL of the request. As this case has shown, Google’s actual behavior may not correlate to their stated intentions, of which there are few in the first place. By Google’s own slippery slope logic, this time it works for you – will it next time?

Perhaps it’s time to hold companies accountable for the records they keep.


1/19/2006

Update on DOJ/Google

This is a fascinating deconstruction of the court documents and letters available so far:

http://blog.searchenginewatch.com/blog/060119-161802


DOJ demands large chunk of Google data

The Bush administration on Wednesday asked a federal judge to order Google to turn over a broad range of material from its closely guarded databases.

The move is part of a government effort to revive an Internet child protection law struck down two years ago by the U.S. Supreme Court. The law was meant to punish online pornography sites that make their content accessible to minors. The government contends it needs the Google data to determine how often pornography shows up in online searches.

In court papers filed in U.S. District Court in San Jose, Justice Department lawyers revealed that Google has refused to comply with a subpoena issued last year for the records, which include a request for 1 million random Web addresses and records of all Google searches from any one-week period.

http://www.siliconvalley.com/mld/siliconvalley/13657386.htm

I’m sort of out of analysis about why this is bad, because I’ve said it all before.

See (particularly 4 and 5):

  1. http://www.aquick.org/blog/2005/04/21/google-adds-prove-adam-right-button/
  2. http://www.aquick.org/blog/2005/05/06/they-dont-necessarily-know-who-you-are/
  3. http://www.aquick.org/blog/2005/06/23/beware-the-google-threat/
  4. http://www.aquick.org/blog/2005/05/08/google-is-destroying-the-private/
  5. http://www.aquick.org/blog/2005/05/05/google-wants-your-logs/
  6. http://www.aquick.org/blog/2005/11/21/google-really-wants-your-logs/

It really comes down to one thing.

If data is collected, it will be used.

It’s far past the time for us all to take an interest in who’s collecting what.


1/18/2006

By the way, now’s probably a good time to update your hosts file

Filed under: — adam @ 11:54 am

The hosts file is a long list of known advertising and spyware domains. Using the hosts file makes these sites invisible to your computer.

http://www.mvps.org/winhelp2002/hosts.htm


Sometimes it hurts to be right.

Filed under: — adam @ 11:37 am

‘The Mozilla Team has quietly enabled a new feature in Firefox that parses ‘ping’ attributes to anchor tags in HTML. Now links can have a ‘ping’ attribute that contains a list of servers to notify when you click on a link. Although link tracking has been done using redirects and Javascript, this new “feature” allows notification of an unlimited and uncontrollable number of servers for every click, and it is not noticeable without examining the source code for a link before clicking it.’

http://yro.slashdot.org/yro/06/01/18/1427212.shtml

‘I’m sure this may raise some eye-brows among privacy conscious folks, but please know that this change is being considered with the utmost regard for user privacy. The point of this feature is to enable link tracking mechanisms commonly employed on the web to get out of the critical path and thereby reduce the time required for users to see the page they clicked on. Many websites will employ redirects to have all link clicks on their site first go back to them so they can know what you are doing and then redirect your browser to the site you thought you were going to. The net result is that you end up waiting for the redirect to occur before your browser even begins to load the site that you want to go to. This can have a significant impact on page load performance.’

http://weblogs.mozillazine.org/darin/archives/009594.html

Oh, well, that makes it all okay then. It’s for the user experience.

Where does Darin’s next paycheck come from? Oh, right. It’s Google. But I’m sure they have only our best interests at heart.


1/15/2006

They’ve finally figured out how bees fly

Filed under: — adam @ 12:31 pm

http://www.vnunet.com/vnunet/news/2148481/bees-fly-official


1/14/2006

This is very very bad for Google’s stock price

Filed under: — adam @ 7:46 pm

“Billy Hoffman, an engineer at Atlanta company SPI Dynamics unveiled a new, smarter web-crawling application that behaves like a person using a browser, rather than a computer program. “Basically this nullifies any traditional form of forensics,” says Hoffman. The program comes from different internet addresses, simulates different browsers and throttles itself to human-like speeds.”

Currently, it’s hard to tell the difference between a human click and a robot click, but it’s still possible to make a reasonable guess, and cheap as they are, getting banks of low-paid clickers in 3rd world countries is still comparatively pricey.

But the ability to run a crawler that’s indistinguishable from a human blows all of that out the airlock. And if it’s impossible to tell the difference between an automated click and a human, the AdWords value proposition goes away.

http://www.wired.com/news/technology/0,70016-0.html?tw=rss.index


1/11/2006

Best of 2005

Filed under: — adam @ 6:58 pm



Best of 2005

Originally uploaded by Caviar.



1/7/2006

WMF official patch is out

Filed under: — adam @ 12:26 pm

You should have the MS patch by now for the WMF exploit.

You can verify that the MS one is successfully installed by checking the box in Add or Remove Programs that says “show updates”. The proper one is KB912919.

Once this is installed, you should remove the unofficial patch, if you installed it.


1/3/2006

Retrievr

Filed under: — adam @ 12:09 pm

Draw a picture, and Retriever will fetch “similar” images from flickr (based mostly on color and rough shapes).

http://labs.systemone.at/retrievr/
http://labs.systemone.at/retrievr/about

Update: I did a pretty random search, and it turned up one of my images. Cool!

One of my shots in retrievr


WMF exploit unofficial patch

Filed under: — adam @ 11:55 am

This is pretty unbelievable. A major exploit was announced, diagnosed, and confirmed. While Microsoft has sat on their ass and said they won’t have a patch available FOR ANOTHER WEEK, someone has reverse engineered the binary and issued their own patch. The patch has been verified by a number of reliable sources as being trustworthy, effective, and reversible. Install it now, if you use Windows.

http://isc.sans.org/diary.php?storyid=994

I’m not a lawyer, but this sounds like grounds for bringing a negligence lawsuit against Microsoft. It is completely unacceptable that the fix is simple enough that it can be done by someone without access to the source, there are known exploits in the wild, and it’s going to take another week for an official patch.


Powered by WordPress