What can be guessed about you from your online behavior? Two computer privacy experts — economist Alessandro Acquisti and computer scientist Jennifer Golbeck — on how little we know about how much others know.
The best indicator of high intelligence on Facebook is apparently liking a page for curly fries. At least, that’s according to computer scientist Jennifer Golbeck (TED Talk: The curly fry conundrum), whose job is to figure out what we reveal about ourselves through what we say — and don’t say — online. Of course, the lines between online and “real” are increasingly blurred, but as Golbeck and privacy economist Alessandro Acquisti (TED Talk: Why privacy matters) both agree, that’s no reason to stop paying attention. TED got the two together to discuss what the web knows about you, and what we can do about the things we’d rather it forgot. An edited version of the conversation follows.
I hear so much conflicting information about what I should and shouldn’t be posting online. It’s confusing and unnerving not to know what I can do to protect myself. Can you both talk about that?
Alessandro Acquisti: My personal view is that individual responsibility is important, but we are at a stage where it is not sufficient. The problem is much larger than any one individual’s ability to control their personal information, because there are so many new ways every week or every month in which we can be tracked or things can be inferred about us. It’s absolutely unreasonable to expect consumers and citizens, who are all engaged in so many other activities, to also have the ability to continuously update their knowledge about what new tracking method the industry has discovered and to be able to fend it off. I think it’s a larger problem that requires policy intervention.
Jennifer Golbeck: I agree with that. Even if you did have a person who wanted to be on top of this and was willing to dedicate themselves full-time to keeping track of what technology can do, and then try to make decisions about what they can post, they still actually don’t have control.
Take language analysis, a really powerful tool where we look at the kinds of words that you use — not even necessarily obvious things like curse words, but things like function words: how often you use “I” versus “we,” how often you use “the” versus “a,” these little words that are natural in the way that you develop language and inherent to your personality. It turns out that those reveal all sorts of personal traits. There’s a whole field of psycholinguistics in which people are doing deeper research into comparing the kinds of words you use and how often you use them with personal attributes, and that’s not something you can understand or control.
AA: It’s also difficult to predict how information you reveal now could be used five or ten years out, in the sense of new inferences that could be discovered. Researchers may find that a piece of information “A” combined with a piece of information “B” can lead to the prediction of something particularly sensitive — also in the sense of how this particularly sensitive information could be used. These are literally impossible to predict, because researchers every month come up with new ideas for using data. So we literally do not know how this will play out in the future.
What would a policy solution look like?
JG: Right now in the U.S. it’s essentially the case that when you post information online, you give up control of it. So there are terms of service that regulate the sites you use, like on Facebook and Twitter and Pinterest — though those can change — but even within those, you’re essentially handing control of your data over to the companies. And they can kind of do what they want with it, within reason. You don’t have the legal right to request that data be deleted, to change it, to refuse to allow companies to use it. Some companies may give you that right, but you don’t have a natural, legal right to control your personal data. So if a company decides they want to sell it or market it or release it or change your privacy settings, they can do that.
In Europe, users have more of a right to their data, and recently there was a decision in Spain where a man had sued Google because when people searched for him, it was coming up with information about financial problems that he had had a long time ago. He was basically arguing that he has a right to have this information forgotten about him. When we declare bankruptcy, for example, that stays on our credit report for seven years; it’s not going to be there 30 years later. But a lot of this stuff on the Internet, including public-record stuff, does stick around well past the time that we would allow it to expire before, and users don’t have control over that online.
So Europe is saying users have a right to own their data in a certain way, and in the U.S. we don’t have that. That’s one of the spaces where there are some clear and straightforward legal solutions that could hand control, at least in some part, back to the users.
“There are a number of ways in which transparency control can be bypassed or muted.” Alessandro Acquisti
AA: If you go back to the 1970s, the Organization for Economic Cooperation and Development (OECD) — so not exactly an anti-business or anti-capitalist organization — came up with a number of principles related to handling personal data. These Fair Information Practices, or FIPs, were guidelines for what policymakers could do to make the handling of personal information fair.
If you look at those principles, and then you look now at the state of policymaking in the United States when it comes to privacy, you see a significant difference. The policymaking effort in the U.S. focuses almost exclusively on control and transparency, i.e. telling users how their data is used and giving them some degree of control. And those are important things! However, they are not sufficient means of privacy protection, in that there are a number of ways in which transparency control can be bypassed or muted. What we are missing from the Fair Information Practices are other principles, such as purpose specification (the reason data is being gathered should be specified before or at the time of collection), use limitation (subsequent uses of data should be limited to specific purposes) and security safeguards.
AA: Indeed. A very interesting aspect of that experiment is that people do remember what we told them about how we would use their data. But adding this delay between the time that we told them how their data would be used and the time where we actually started asking them to make choices about their data was enough to render that notice ineffective. That’s probably because their minds started wandering.
When it first came out, I remember thinking, “Oh, I’m not going to bother trying this, because I’m one of the people in the world who knows the most about Facebook privacy settings. I have them cranked up so high; there’s no way it could possibly see anything on my profile.” A week later, I thought, “Well, you know, let’s click on it and see,” and it got all this data that I didn’t think it could get. I remember thinking, if I don’t understand what kind of data is being given to apps, how can anybody else understand?
AA: To explain this phenomenon I borrow the term “rational ignorance.” Rational ignorance has been used in other fields to refer to situations where people rationally decide to remain ignorant about a certain topic because they expect that the costs involved in making an effort will not be offset by the benefit of getting this information. Sometimes, in privacy, we may feel the same way. Sadly, sometimes correctly so: we may do everything to protect ourselves and do everything right, and still our data is being compromised or used in manners that we don’t know about and don’t want. And therefore, some of us may give up, and decide not even to start protecting ourselves.
“It’s really important that people understand that there are computational techniques that will reveal all kinds of information about you that you’re not aware that you’re sharing.” Jennifer Golbeck
JG: At the same time, the thing that I’ve had a hard time communicating in the years that I’ve been doing this work is for people to really understand that we can find things out computationally that they’re not sharing explicitly. So you can “like” these pages, you can post these things about yourself, and then we can infer a completely unrelated trait about you based on a combination of likes or the type of words that you’re using, or even what your friends are doing, even if you’re not posting anything. It’s things that are inherent in what you’re sharing that reveal these other traits, which may be things you want to keep private and that you had no idea you were sharing.
So on the one hand, it’s true that even if you know about all these computational techniques, you can’t necessarily protect yourself. On the other hand, it’s really important that people understand that there are computational techniques that will reveal all kinds of information about you that you’re not aware that you’re sharing.
How is what’s happening on the Internet different from people analyzing the way I dress, cut my hair, where I work or where I live? I don’t give people on the streets permission to judge those things, but they do it anyway.
AA: It’s different on at least two grounds. One is scale. We are talking here about technologies that vastly increase the kind of abilities that you’re describing. They make them more sophisticated. They allow many more entities — not just the friends you meet in your day-to-day life, but entities across the world — to make inferences about you. This data remains somewhere and could be used later to influence you.
The second is asymmetry. We all grow up developing the ability to modulate our public and private spheres, how much we want to reveal with friends, how much we want to protect. And we are pretty good at that. But when we go online, there is an element of asymmetry, because there are entities we don’t even know exist, and they are gathering continuously information about us.
“The point is, we really don’t know how this information will be used.” Alessandro Acquisti
To be clear, I’m not suggesting that all this information will be used negatively, or that online disclosures are inherently negative. That’s not at all the point. The point is, we really don’t know how this information will be used. For instance, say I’m a merchant — once I get information about you, I can use this information to try to extract more economic surplus from the transaction. I can price-discriminate you, so that I can get more out of the transaction than you will.
That’s why I’m interested in working in this area, not because disclosure is bad — human beings disclose all the time, it’s an innate need as much as privacy is – but because we really don’t know how this information will be used in the long run.
JG: You pick what clothes you wear, you pick the neighborhood you live in, you pick the job that you have — and in some way you know what that’s saying about you. Say you’re Catholic. Some people are going to associate one thing with you being Catholic, and some people are going to associate another. You have an easy way to understand what all the reactions will be.
But the kinds of things that we’re talking about online aren’t things that you can necessarily anticipate. One example of this is a pretty early project from a couple of undergrads at MIT called “Project Gaydar.” They were able to infer people’s sexual orientation by completely ignoring anything that the person had actually said and instead looked at the person’s friends and what they had disclosed about themselves. So even if you’re a person who wanted to keep their sexual orientation private, we can still find it out, and there’s nothing you can do about it.
We have such a huge base of data – hundreds of millions of people, combinations of actions, likes and words. By themselves, it’s a pile of traits that doesn’t mean anything. Yet we can detect small patterns among these hundreds of millions of people to pretty accurately infer information that has basically no relationship to what they’re choosing to disclose.
Once those algorithms are mapped out, how do you keep them from being used for evil? Do you worry your research could be used by less genuine entities?
JG: I sometimes tell people that I feel a bit like I’m working on the Manhattan Project. But I actually approach this from a scientific perspective. I’m interested in the science of it, and I think pretty universally, with very few exceptions, it’s always worth doing the science. And in fact, what we’re doing to infer these things can actually be used to teach people how to protect themselves. One of the things that we learned through this research is that the more data we have about people, the easier it is to make inferences about them. That has led me to be a regular purger of all my information online. That’s a lesson that comes out of the science.
You’re right that this stuff is going to get into the hands of companies, governments, and could potentially be used in evil ways. I don’t think not doing the science is the solution to that. As Alessandro said earlier, the only solution is a legal one where people have control over how their data is used and there are limitations and real regulation on data brokers and other companies that have this data.
AA: I doubt that even the best researchers are able to give ideas to the industry that have not already come up or will not come up soon by themselves. In the best scenario, we are maybe one or two steps ahead of the game. And that would be important, because it’s about raising awareness among individuals and among policy makers about things that are about to happen, or have started to happen.
Featured artwork by Dawn Kim.