Tolle has worked with students to achieve this aim with data on the Gulf Region as well as the Eastern Seaboard, where the larger flood problems are. Interestingly, the models that worked best were those that leveraged only recent data—within the last 5 years. “Though anecdotal, this is a clear signal that we are living in a changing world when it comes to weather and climate,” Tolle observed. Now, she plans to scale the project at the whole nation level (3 million river reaches), generating a near-real-time hydrologic simulation. Its application would be to send direct notifications to first aid responders or even to individuals at greater risk.
The first phase of the project has already been very challenging. “Scientific data collection is dirty, dangerous, and hard. The statistics are much easier by comparison. And even if you have the data, there is the hard part of cleaning and transforming the data for use: such as dealing with bad data or missing values,” she points out. Bad data creates the risk of drawing the wrong conclusions. Blindly applying machine learning to data with biases and gaps was a recurrent theme throughout the session.
The National Flood Interoperability Experiment (NFIE) is an instance of interoperability, one of the most attractive features of Big Data, and also one of the most worrisome. It consists of triangulating separate sets of data to infer information that none of them alone contains. “It is really the interesting type of research that is taking place today. And while it can be good and save lives, it can also create the greatest ability to violate each other’s privacy – when an innocuous dataset is combined with others to triangulate information about groups and individuals,” pointed out Tolle. Her solution is “do no evil”. “Data collected by researchers should be made available openly, and also data applications. When these things stay hidden, that does us a disservice. More data, not less data should be available,” she concluded.
The most worrisome aspect of Big Data is the misuse of online tracking for advertising being co-opted for government spying, because it poses risks to civil liberties. Scientists can do more to extract value out of Big Data without sacrificing privacy.
Staff technologist at the Electronic Frontier Foundation. The EFF is a US-based civil society organization gathering lawyers, activists, and technologists to defend civil liberties in the digital world. Free speech, copyright reform, freedom from threats for computer security researchers, government transparency, and privacy, are some of the priorities of the organization.
The NSA Data Center, a row of gigantic, black warehouses in Utah, is the visible part of “a 2B dollar project aimed at holding data that the NSA has sucked from us all”, said Jeremy Gillula in his plenary talk. This is the icon of the “surveillance-industrial complex made by companies and governments that track all our communications,” according to Gillula. “Big data is revolutionizing medicine and physics, but also surveillance,” he alerted.
“I acknowledge the benefits of autonomous technologies, but I am simultaneously aware of the threats they pose to our civil liberties,” said Gillula. His worries focus on the misuse of online tracking for advertisement and spying. He thinks scientists can do more to get results, without sacrificing privacy.
The first step in the surveillance mechanism is simply logging in to a website. “Sometimes you login. But if you don’t, the website assigns unique device identifiers to you anyway: a random number stored in a cookie in your browser, or permanently linked to your mobile device,” Gillula explained.
“Things get creepy when we get to ad networks,” said Gillula, “that is, when a third party comes in.” This is the second step into surveillance. For example, a user logs in to a newspaper’s website. The website loads resources from third parties—typically ad networks—that put cookies on the user’s machine. The same third party is present on many websites, allowing it to track a sample of the user’s navigation. When the user uses mobile devices, the third party may even know where the user is physically located. “All this without any login, without knowing I am being tracked. It’s non-consensual ubiquitous tracking,” summarized Gillula.
The final step is when government steps in. “One may argue that we need tracking to fund online services. And that one can encrypt traffic, for example using https,” Gillula pointed out. “However, governments can tap into communications precisely in the part of the connections where encryption is removed,” he remarked. Intelligence agencies use software, like the NSA Xkeyscore, that link together the collected information.
“People going to a website are in fact going to an ad network, and the ad network can be monitored by a government agency. Big Data for Big Brother piggybacks on Big Data for advertisers, that piggyback on Big Data for websites,” Gillula remarked.