Today the cost of storage and transfer of information is negligible. There’s no more economic barrier to prevent the copying and storing of someone else’s data. The solution for the problem of data protection should rely on technologies, rather than laws and regulations.
There are systematic methods for the solution of all problems
The Internet today is centralized and censored as never before. Snowden’s revelations have made clear that corporations such as Google1 and Microsoft2 collaborate with intelligence agencies in order to make the data of private communications available to government authorities.
Leaving aside various political reasons, we will focus on the technical reasons that have made possible these programs of mass surveillance Snowden revealed.
With the advent of the internet, the amount of data generated by humanity has grown, and continues to grow exponentially3. This phenomena is known as the information explosion. Conversely, the price of storage and computing power drops, complying with Moore’s Law. The combination of these resulting conditions of massive amounts of digital information and inexpensive computing power set the conditions for the implementation of systems of mass data processing.
Such systems have been created for many purposes, including surveillance.
Surveillance systems intercept the digital communications of every citizen while machine learning algorithms recognize and classify the data, dramatically improving the efficiency of any human analysis.
This results in tremendous asymmetry of information, as well as the obliteration of citizens’ digital privacy.4
The implementation of this type of infrastructure is virtually impossible without the centralization of the control of massive amounts of data in the hands of a few companies. A single top-level agreement between a company and a government agency might immediately render millions of individuals vulnerable. Even when a company publicly renounces such a co-operation5, the fact that an enormous amount of valuable users’ data is kept and protected centrally6 creates enormous incentives for intruders.7
Since the invention of public key protocols by Merkle, Diffie, Hellman and others in the 70s, humanity has made great progress in cryptography and distributed systems, and created an entire toolbox of technologies that address the risks due to the centralization of information. In this article we will explore some of them: we will start with decentralized systems, go through end-to-end encryption and zero knowledge proofs, and conclude with a definition of decentralized proof systems.
Complex systems are systems composed of multiple independent components. We define the independence of a component through its capability to resist the failures of other components. The entire system is called fault-tolerant if it can handle the failures of its components up to a certain degree. Complex systems are generally fault-tolerant. Such a system naturally incorporates some redundancy and implements mechanisms for the detection and bypassing of failures.
In the world of computers, when we talk about fault-tolerant systems, we usually consider simple failures, i.e. the cases in which a component stops responding to requests or returns meaningless, inconsistent or incomplete responses. The failures of this type occur naturally due to hardware malfunctions, misconfiguration or some rare incidental circumstances.
A hierarchical system vs a peer-to-peer (decentralized) system
There exists another, more subtle type of failure: Byzantine failures. A component may still be on, responding to requests with well-formatted consistent messages, but violate some higher-level protocol. For instance, the component may be required to forward any message it receives. The component that fails in a simple manner just stops performing its task — e.g. message forwarding. Such an abnormal behavior gets recognized and bypassed. In contrast, a Byzantine failed component remains online, reports no failures, and forwards all received messages. It, however, may deliberately alter messages in order to introduce inconsistency or bias on a system level. It remains unnoticeable by standard failure detection mechanisms, but the impact can lead to a systemic failure.
Byzantine failures are common in many kinds of complex systems. They appear naturally in environments that are controlled simultaneously by multiple participants. Examples include markets of various types, as well as social and biological systems. The failures allow the systems to evolve and adapt to changing conditions: a non-standard business strategy may provide an opportunity for a start-up (or its competitors.) A leader, breaking old rules, has more chances to succeed (and, equally, fail).
Computer systems aim to model the processes of the natural world. As the components of computer systems gain more autonomy and intelligence, their increasing complexity requires more automation. As a result, Byzantine failures start to occur in the world of strict logic too.8
A system capable of mitigating Byzantine failures is called Byzantine fault-tolerant system. At the core of a Byzantine fault-tolerant system lies a specific protocol of communication.9, 10, 11, 12 The protocol is built around the notion of consensus — an agreement among participants. The theory states that such a system, composed out of multiple components, controlled by different participants, remains resistant to any kind of deviation from the protocol. In practice it means that none of the participants is able to abuse the system unilaterally. In other words, if the participants respecting the protocol are in the majority, they are always capable of finding an agreement regardless of a dishonest minority.
The agreement is defined as the outcome of a computation: transaction and data processing are particular forms of computation. An agreement between two participants is reached either if one trusts another and blindly accepts its outcome, or, when there’s no trust, each participant independently arrives at the same outcome by reproducing the computation. The first case is trivial. The agreement always exists as it’s implicitly assumed by default. In environments without explicit trust, however, the agreement becomes a valuable asset. The agreement reached by the majority is called consensus.
A decentralized system is a Byzantine fault-tolerant system that is controlled by multiple participants with competing incentives.
Decentralized systems should not be confused with another widely-used notion: distributed systems. Despite the apparent similarity, there are core differences in the way consensus is found.
A distributed system in its classical definition is designed to be efficient and scalable, run in a trusted environment and owned by a single authority. Let’s take the well-known model of distributed computation: map-reduce. The caller distributes (maps) the computation among multiple components (executors), waits for them to finish, accepts the outcomes and performs additional non-parallelizable computation (reduce). The computation is reproduced only in the case of the simple explicit failure of one of the components. The agreement is trivial—the outcome is accepted if the component didn’t explicitly report its failure. Such systems can attain tremendous performance, being horizontally scaled to simultaneously execute multiple workflows. Centralization is the price to pay for such performance. Components are tightly coupled and their implicit trust relations make it impossible for the system to be owned by participants with competing incentives.
On the contrary, decentralized systems reach agreement in a completely trustless environment. A component never blindly accepts another’s outcomes; it instead reproduces the computation by itself. Provided the majority of components follow the protocol, and the computation is deterministic, the majority will eventually come out with the same outcome. The consensus protocol makes possible to the secure exchange of outcomes as well as the synchronization of data.
Whereas the primary goals of distributed systems are performance and scalability, decentralized systems mostly focus on security. The strict requirement of reproducibility prevents decentralized systems from scaling as efficiently as distributed systems: each participant is obliged to keep the entire dataset and process all computations. Various approaches have been made to find the best tradeoffs between security and performance,13, 14 but the problem still remains an open issue.15
In the presence of multiple competing parties, trust relations are usually established through a neutral trusted third party. This natural structure inevitably comes with the previously mentioned risks of centralization. Alternatively, the parties may put in place a decentralized system, so that none of them will be able to abuse it. The transition, called disintermediation, generally greatly reduces the cost introduced by an intermediary and has extraordinary potential benefits for society.
Decentralized and multi-user systems manage users’ data differently. The difference comes from how and by whom the data is accessed. A centralized system keeps the data within the same security perimeter. It requires putting in place complex data protection infrastructure. Access is regimented by means of authentication and authorization.
In many cases it is enough to find and break a weak link of a centralized system’s perimeter to gain access to the entire dataset. This alone explains why information security is a non-negligible expenditure for companies storing sensitive data.
By definition, decentralized systems have no centralized data storage. It’s very common for each participant to store a full copy of the data set. However, mass replication makes data almost public. Even though consensus provides protection against inconsistent modifications, it never restricts unauthorized access.
Not only is the data public, but so is the logic of computation — the code. The strong condition of reproducibility requires opening access to the code and the entire history of computations, input arguments and outcomes.
When everything is public, a reasonable question arises — how can the data be protected from unauthorized access? One of the solutions is to use encryption.
A simple illustration of the encryption of a plaintext message
The virtue of encryption lies in the possibility of decoupling the access of data from the transfer and storage of data.
Encrypted data is indistinguishable from a random byte sequence. It makes a clear separation between the owner, the one who knows the key, and everybody else, who doesn’t. The ownership of gigabytes of data can be reduced to the ownership of a 16-byte key, which can be embedded securely into a chip or even memorized. That concept had been stated by Auguste Kerckhoffs16 long before the actual strong algorithms were invented. Consequently, encrypted data can be publicly stored17 in multiple locations in a trustless environment, and decrypted only at the very ends of communicational channels by actual keyholders. The protocol is known as end-to-end encryption.
Data protection is closely related to the idea of data ownership. It is somewhat complicated to give a theoretical definition of ownership and possession,18 so let us instead take some examples of popular methods of sharing digital information and try to figure out who owns the data in each particular case.
In its Statement of Rights and Responsibilities,19 Facebook says:
For content that is covered by intellectual property rights … you grant us [Facebook] a non-exclusive, transferable, sub-licensable, royalty-free, worldwide license to use any IP content that you post on or in connection with Facebook.
In plain English, this statement explicitly gives Facebook the rights to do with their users’ data as they wish. Facebook has the same rights on the data; the user loses his exclusive ownership.
Google Terms of Service20 look no better than Facebook ones:
When you upload, submit, store, send or receive content to or through our Services, you give Google … a worldwide license to use, host, store, reproduce, modify, create derivative works …, communicate, publish, publicly perform, publicly display and distribute such content.
However, many of Google’s services (Gmail, Drive, etc.), are perceived to be private, when, according to their Terms of Service, they are in fact not. There should be no illusions concerning the ownership: Google has the same rights on the file uploaded on Drive as Facebook does on a publicly shared wall post. User’s data is protected against unauthorized access of other users, but never against the company itself. Therefore, the very notion of “private data” has to be reconsidered.
Dropbox’s Terms of Service21 gives the company the limited rights on content stored on Dropbox:
'When you use our Services, you provide us with things like your files, content, email messages, contacts and so on (“Your Stuff”). Your Stuff is yours. These Terms don’t give us any rights to Your Stuff except for the limited rights that enable us to offer the Services.
‘We need your permission to do things like hosting Your Stuff, backing it up and sharing it when you ask us to. Our Services also provide you with features like photo thumbnails, document previews, email organisation, easy sorting, editing, sharing and searching. These and other features may require our systems to access, store and scan Your Stuff. You give us permission to do those things, and this permission extends to our affiliates and trusted third parties we work with.’
Dropbox statements are more in favor of end user than those of Google and Facebook, but still lack a clear definition of the limit of its rights. Its well-known controversial feature of cross-user content de-duplication22 is considered as the Service. Indeed, it optimizes the storage and access to data, but it makes unclear to whom a piece of shared data belongs to. The Service can also include the analysis of data, such as automated content recognition, in order to comply with DMCA:23
‘Dropbox will take whatever action … including removal of the challenged content from the Site.’
Telegram has widely announced itself as a secure messaging app.24 The messages sent through a secret chat are end-to-end encrypted. The code25 of the client app which performs key generation, encryption and decryption, as well as detailed documentation of the protocol26 are publicly available. However, much controversy27, 28 has been raised regarding the new homebrew cryptographic protocol, which has not been extensively publicly tested and reviewed. As of today, there is no known information on breaking the Telegram crypto-system, but there’s no guarantees it won’t happen in the future.
The first version of GnuPG was released 17 years ago; today it still remains one of the most reliable cryptographic tools. Its cryptographic algorithms and their implementations are widely accepted and reviewed by cryptographic and open-source software communities, and the absence of known vulnerabilities gives its users a high level of certainty that the software is secure.
A message encrypted with GnuPG and stored remotely cannot be said to be owned by anyone except the keyholder, and it’s very unlikely its underlying cryptography will be broken in the next decade. However, no one knows what the future holds for this technology.29
One has to be extremely careful reading various security pamphlets.30, 31, 32 The oft-mentioned term encryption rarely means end-to-end encryption. It can give users the false impression that software is secure. Unless it’s the user who generates the keys and performs encryption/decryption using widely-audited open-source tools, such ‘encryption’ is no better than a mere additional security measure that makes data harder to access for an unauthorized user. If the key is generated by a closed-source ‘official client’, encrypted data can be always compromised.
Coming back to the original question of data protection, we may now give a complete description of a decentralized system through its properties:
The system has a publicly known set of rules, defining all possible data mutations. The rules are validated independently by independent participants; the outcomes are exchanged according to a Byzantine fault-tolerant protocol. The protocol protects the majority of honest participants against any dishonest minority.
The end user fully controls his data, relying on end-to-end encryption. Keys are known only to the user.
Data is considered public. Private data is stored publicly in encrypted form. Encryption keys are kept separately by independent keyholders.
Cryptographic algorithms are widely accepted by the community and considered secure.
An open-source implementation with high-level documentation is publicly available for independent review and audit.
Centralized systems provide great efficiency and flexibility, but nonetheless fail to deliver enough credibility to give users complete trust in the security of their data. It explains today’s gradual shift towards decentralization, and the rise of public interest in modern security: consensus algorithms and advanced cryptography.
All strong guarantees given by an encryption protocol bring us to another side of the spectrum. Knowing/not knowing the key is an ‘all or nothing’ game, with a huge gap in between. Information that is accessible to a single person only has little value for society. Encrypted data becomes useless and cannot be processed in any other way, except decryption with the right key. The value of such useless data is zero, since the value comes from the sharing.
Sharing involves the risk of losing control. Today the cost of storage and transfer of the information is negligible. There’s no more economic barrier to prevent the copying and storing of someone else’s data. Existing legal barriers that treat information as a material commodity have never worked well.33, 34
The solution for the problem of data protection should rely on technologies, rather than laws and regulations.
Here is a typical scenario. Alice wants to book an airline ticket, but due to restrictions, she must be at least 16 years old. Bob is a travel agent, responsible for complying with regulations. Alice, wishing to prove her age, shows Bob her ID card, he verifies the date of birth and performs the transaction if everything is fine.
An illustration of a zero knowledge proof from Proof of Process35
The problem of this information exchange is that Bob receives more data than he needs in order to accomplish his duty. Alice risks being judged based on her gender, place of birth and exact age. She also risks the theft of digital identity, since she loses the control of her data by sharing it with Bob, a new actor. Therefore, Alice has to explicitly trust Bob.
A neat solution would be to keep the data private and share only the facts of the data. Instead of revealing her ID card, Alice may send Bob a message “I’m at least 16 years old”. Such a message, however, is completely useless to Bob: he cannot use it in any way, because Alice can lie with no cost. It is not the case with an ID card, since its physical form falsifying its data much harder, bringing value to the data.
The theory of computation is one of the pillars of cryptography. This branch of mathematics classifies problems based on the amount of computing power required to have them solved. Some problems are said to be intractable — impossible to be solved efficiently. Such intractable problems play the same role in cryptography as stamps, watermarks and holograms in securing papers and cards.
Returning to the example, it is possible for Alice, using modern cryptographic protocols, to produce the message ‘I’m at least 16 years old’ in a way that its veracity can be verified using encrypted ID card data only. The original data remains encrypted, retaining Alice’s ownership over it. On the other side, her message is valuable for Bob, since he knows that she cannot provide false data. Providing false data is intractable for Alice due to theoretical limits imposed by the algorithms. The branch of cryptography that provides these ways to construct such messages is known as zero knowledge proof.
Alice shares no data beyond the facts she deliberately reveals, therefore her privacy is protected. The facts Alice communicates to Bob may leak so little data about herself, that she may be completely fine to have them publicly known. In this case Bob is not obliged to keep and process private data; he can be a completely public and transparent executor. If Bob’s decision- making process can be partially automated, some of his duties can be replaced with software. Bob, as a piece of software, has to behave well and not be abused by the travel agency, nor by the airline in pursuit of economic interests. Bob operates with public data and can be fully transparent and auditable. These requirements make the perfect use case for decentralized systems. In addition, the exchange of verifiable facts of data gives users the possibility of making transactions. We can define such systems as decentralized proof systems.
A decentralized proof system is a decentralized system which transparently provides trust, strong privacy protection, and process automation with Byzantine fault-tolerant protocols, end-to-end encryption, zero knowledge proofs, and reproducible deterministic computation.
Most decentralized systems provide a binary choice: utility or privacy (utility being in the sense of how much value and service the system provides).36 Some systems37, 38, 39 require keeping the data in cleartext with little or no privacy protection; others are data-agnostic and can store encrypted gibberish40, 41, 42 but fail to impose any logic on the data itself. In recent years a few hybrid systems have appeared.43, 44, 45 They all leverage the power of zero knowledge proofs in order to execute the code in a decentralized manner, while preserving privacy. However, these systems remain confined within a simple digital asset exchange model.
A digital asset exchange model can be used to cover many use cases, but the entire terrain is still to be explored. At Stratumn, our major goal is to build generalized proof systems, capable of fulfilling the requirements in a wide spectrum of applications: from energy to insurance, supply chains to medicine, banking to public services. Such systems model natural processes within the framework of prover-verifier dialogue (see Proof of Process46). We believe that the instrumentalization of trust is an inevitable process over the coming years, and generalized proof systems are the right tools to make this transition happen.
“Don’t think. If you think, then don’t speak. If you think and speak, then don’t write. If you think, speak and write, then don’t sign. If you think, speak, write and sign, then don’t be surprised.”
This saying appeared in the time of Soviet intelligentsia under omnipresent government surveillance. Surveillance narrows the freedom of expression and thought; it undermines trust. Without trust, society disintegrates. In today’s increasingly complex world, technological innovation with regard to the very fundamentals of trust and information exchange is perhaps the best way both to help organizations run more smoothly and to help individuals feel free to express themselves—without fear of surprise.
Barton Gellman, Ashkan Soltani (October 30, 2013). ‘NSA infiltrates links to Yahoo, Google data centers worldwide, Snowden documents say’. The Washington Post.
Greenwald, MacAskill, Poitras, Ackerman, Rushe (July 11, 2013). ‘Microsoft handed the NSA access to encrypted messages’. The Guardian.
Gil Press (May 9, 2013). ‘A Very Short History Of Big Data’. Forbes.
Russell Walker (August 12, 2015). ‘Big Data, Open Data, and the Asymmetry of Information’.
Will Evans (December 12, 2016). ‘Uber said it protects you from spying. Security sources say otherwise’. The Center for Investigative Reporting.
Charles Arthur (September 1, 2014) ‘Naked celebrity hack: security experts focus on iCloud backup theory’. The Guardian.
Matthew Philips (August 3, 2012). ‘Knight Shows How to Lose $440 Million in 30 Minutes’. Bloomberg.
Lamport, Shostak, Pease (July 1982). ‘The Byzantine Generals Problem’. ACM Transactions on Programming Languages and
Systems, Vol. 4, No. 3. Castro, Liskov (November 2002). ‘Practical Byzantine Fault Tolerance and Proactive Recovery’. ACM Transactions on Computer Systems, Vol. 20, No. 4.
Kotla (2008). ‘xBFT: Byzantine Fault Tolerance with High Performance, Low Cost, and Aggressive Fault Isolation’. UT Electronic Theses and Dissertations.
Lamport (December 5, 2010). ‘Byzantine Paxos by Refinement’. Proceedings of the 25th international conference on Distributed computing. pp. 211-224
Poon, Dryja (January 14, 2016). ‘The Bitcoin Lightning Network: Scalable Off-Chain Instant Payments’.
Buchmann, Demirel, Derler, Schabhüser, Slamanig (May 4, 2016). ‘Overview of Verifiable Computing Techniques Providing Private and Public Verification’.
Marko Vukolic (May 1, 2016). ‘The Quest for Scalable Blockchain Fabric: Proof-of-Work vs. BFT Replication’. Open Problems in Network Security. Lecture Notes in Computer Science, volume 9591.
Peticolas, Fabien. Electronic version and English translation of ‘La cryptographie militaire’.
Jason Murdock (June 21, 2016). ‘WikiLeaks unleashes new 88GB “insurance file” onto the web – but what’s inside them?’. International Business Times.
Hegel (1820). ‘Elements of the Philosophy of Right’, §54-58.
Facebook Statement of Rights and Responsibilities, rev.: January 30, 2015.
Google Terms of Service, rev.: April 14, 2014.
Dropbox Terms of Service, rev.: May 1, 2015.
Christopher Soghoian (April 12, 2011). ‘How Dropbox sacrifices user privacy for cost savings’.
‘What is Telegram?’. Telegram FAQ.
‘Secret chats, end-to-end encryption’. Telegram API.
Moxie Marlinspike (December 19, 2013). ‘A Crypto Challenge For The Telegram Developers’.
William Turton (June 26, 2016). ‘Why You Should Stop Using Telegram Right Now’. Gizmodo.
‘Encryption’. Microsoft Trust Center.
Fred von Lohmann (May 2, 2007). ‘09 f9: A Legal Primer’. EFF Deeplinks.
‘Digital Rights Management: A failure in the developed world, a danger to the developing world’, (March 23, 2005). EFF Whitepapers.
Das Gupta, Caetano, Ali Ansar, Florquin, Cieplak (2017). ‘Proof of Process’.
Richard Caetano (20 November 2016). ‘Breaking out of Distributed Ledgers’.
Satoshi Nakamoto (October 2008). ‘Bitcoin: A Peer-to-Peer Electronic Cash System’.
McConaghy, Marques, Müller, De Jonghe, McConaghy, McMullen, Henderson, Bellemare, Granzotto (February 2016). BigchainDB White Paper.
Juan Benet (April 1, 2015). ‘IPFS - Content Addressed, Versioned, P2P File System’.
Bram Cohen (May 22, 2003). ‘Incentives Build Robustness in BitTorrent’.
Maymounkov, Mazières (March 2002) ‘Kademlia: A Peer-to-peer Information System based on the XOR metric’. Proceedings of the 1st International Workshop on Peer-to-Peer Systems (IPTPS '02), p. 53-65
Nicolas van Saberhagen (October 17, 2013). ‘CryptoNote v 2.0’.
Kosba, Miller, Shi, Wen, Papamanthou (2016). ‘Hawk: The Blockchain Model of Cryptography and Privacy-Preserving Smart Contracts’. IEEE Symposium on Security & Privacy (Oakland).
Ben-Sasson, Chiesa, Garman, Green, Miers, Tromer, Virza (2014). ‘Zerocash: Decentralized Anonymous Payments from Bitcoin’. In proceedings of the IEEE Symposium on Security & Privacy (Oakland).
Das Gupta, Caetano, Ali Ansar, Florquin, Cieplak (2017). ‘Proof of Process’.