“Encouraging research in this field, stimulating debate, raising people’s awareness, and regulatory approaches are as important as technical solutions”, Ryan pointed out. He also asked a broader question: “Are we sure that investing in yet more surveillance is the most cost-effective way to fight terrorism?”
“We are still in the trenches of “crypto war II” and the frontline is cryptography itself,” observed Bryan Davis, from the Department of Statistics and Operational Research at the University of North Carolina in Chapel Hill, who reported to the plenary about the workshop. [Prof. Ryan] “encouraged us to consider both sides of the power of intelligence and law enforcement. But by now the balance has been much in favour of Big Brother,” he concluded.
Getting the figures right
Crunching Big Data, even with advanced technologies, can tell misleading stories – if data are biased and heterogeneous, or contain gaps. The right statistics must be applied to get eye-opening stories, as it happens in human rights applications.
Director of research (now executive director) at the Human Rights Data Analysis Group (HRDAG), a US not-for-profit organization that uses statistics to find the most accurate truth in those controversial issues.
Mountains of documents, Xerox copies and printed pictures. This is the appearance of Big Data in certain situations, instead of light, immaterial terabytes of digitised information. The latter is used in the statistical analysis of wars and massive human rights violation. Megan Price is one of the world’s leading experts in this field.
“Quantitative analysis can contribute to human rights research, but if we get it wrong we can cause big damage,” Price alerted from the beginning of her workshop. “Combining the aching desire for answers with some data can be misleading. Applying technology to data does not provide miraculous answers,” she pointed out. Her concern resonated among all speakers: crunching data blindly, just hoping to find whatever correlation, is a dangerous path. “We really need more science, not more technology, because technology can get it wrong,” said Price.
How many people have died in a conflict? Which party of the conflict perpetrated more victims? Which groups suffered more violence? These are the questions that HRDAG was asked to answer in trials on conflicts like Guatemala’s and Kosovo’s, and is now trying to answer in real time in Syria.
The HRDAG tries to answer based on data collected by sources on the ground: NGOs, activists, official registers. “If we rely only on these observations we can get the answer completely wrong,” Price observed. “These are rarely complete data – or random samples. Usually, they are incomplete, convenience datasets,” she pointed out.
It’s not necessarily that somebody is trying to bias data intentionally: the very process of data collection in war is biased. People colleting data may have limited resources, may cover just certain regions, or may have cultural problems in getting information from certain communities. Certain groups of victims may report more casualties than others, simply because they have access to the Internet. Moreover, the worse massacres are paradoxically the least reported, because they leave few living testimonies. “Even a very large dataset must be checked for bias. Applying technology to a biased dataset emphasizes the bias,” said Price.
Between December 2012 and March 2013, the four organizations collecting data in the Syrian city of Hama registered a few hundred deaths per month, with a decreasing trend. This coincidence may drive to the conclusion that violence was decreasing: a seeming reasonable, but wrong conclusion. “The first thing we do is record-linkage. We get different lists that contain names of victims with pieces of demographic information, and we identify records contained in different lists that refer to the same victim,” Price explained.
The superposition between the lists contains key-information on how data are sampled. This information can be fed into statistical tools that adjust for the sampling biases and estimate how many victims are not yet reported by any of the sources. These tools are called Multiple Systems Estimation, and they are standard statistical methods inspired by the classic capture-recapture models of ecology. After the HRDAG’s analysis, a peak of more than 1000 victims appeared in January. In those months, the clashing parties were fighting in neighbourhood after neighbourhood for the control of the city: this likely resulted in more deaths on the street and less in the registers.