# VeriStream – A Framework for Verifiable Data Streaming

- First Online:

## Abstract

In a Verifiable Data Streaming (VDS) protocol a computationally weak client outsources his storage to an untrusted storage provider. Later, the client can efficiently append and update data elements in the already outsourced and authenticated data set. Other users can stream arbitrary subsets of the authenticated data and verify their integrity on-the-fly, using the data owner’s public verification key. In this work, we present VeriStream, a fully-fledged framework for verifiable data streaming with integration into Dropbox. At its core, our framework is based upon a novel construction of an authenticated data structure, which is the first one that allows verifiable data streams of unbounded length and at the same time outperforms the best known constructions in terms of bandwidth and computational overhead. We provide a detailed performance evaluation, showing that VeriStreamonly incurs a small bandwidth overhead, while providing various security guarantees, such as freshness, integrity, authenticity, and public verifiability, at the same time.

## 1 Introduction

Cloud storage providers like Dropbox, Amazon Cloud Drive, and Google Drive are on the rise and constantly gain popularity. Users are able to outsource their storage into the “cloud” of some dedicated provider and access or share their data with others later on. The advantages of cloud storage are manifold. Among many, users are no longer bound to specific devices or locations when accessing their data and users can share or collaborate on their data with others easily. Many of these providers allow their users to retrieve, i.e. stream, smaller subsets of the initially outsourced data set. In the case of multimedia content, prominent examples are YouTube and SoundCloud. They allow users to upload their audio and video files and share them with others. A different user can stream the whole uploaded file or just smaller parts of it. This streaming scenario is not solely limited to multimedia content. Another interesting example can be found in the stock market. Here, stock brokers base their purchasing decisions on the latest published stock quotes. These stock quotes are published by trusted stock managers and distributed through web services like Yahoo Finance or quote.com Brokers use these services to stream the latest published stock quotes and buy or sell stocks accordingly.

Comparison of existing and proposed CAT constructions. N is the upper bound of elements that can be authenticated, whereas M is the number of already authenticated elements. Security proof indicates whether the construction’s proof of security is given in the standard model or whether it requires the random oracle model.

Proof size | Client’s state | Upload time/space | Update time/space | Streaming time/space | Unbounded | Security proof | |
---|---|---|---|---|---|---|---|

[22] | \(\mathcal {O}(\log N)\) | \(\mathcal {O}(\log N)\) | \(\mathcal {O}(\log N)\) | \(\mathcal {O}(\log N)\) | \(\mathcal {O}(\log N)\) | ✗ | Standard |

Dynamic | \(\mathcal {O}(\log M)\) | \(\mathcal {O}(1)\) | \(\mathcal {O}(\log M)\) | \(\mathcal {O}(\log M)\) | \(\mathcal {O}(\log M)\) | ✓ | ROM |

\(\delta \)-bounded | \(\mathcal {O}(\log M)\) | \(\mathcal {O}(1)\) | \(\mathcal {O}(\log M)\) | \(\mathcal {O}(\log M)\) | \(\mathcal {O}(\log M)\) | ✗ | Standard |

### 1.1 Our Contribution

On the practical side, we present VeriStream, the first fully-fledged framework for providing streaming applications with security guarantees, such as stream authenticity, integrity, correct ordering of the streamed elements, public verifiability, and efficient updates simultaneously. The VeriStreamstandalone client can be used upload, update, and stream content from personal web servers. In addition, VeriStreamallows its users to use their Dropbox account as the underlying storage layer. Apart from up- and downloading arbitrary files in an authenticated fashion, our framework also supports video and audio streaming with on-the-fly verification. In Sect. 5 we provide a detailed performance evaluation of VeriStreamand compare its performance to the construction from [22].

On the theoretical side we improve upon the state-of-the-art for verifiable data streaming [22]. Their construction is upper bounded during the initialization by some parameter *N*, meaning that it can authenticate up to *N* elements. Their construction incurs a computational and bandwidth overhead of \(\mathcal {O}(\log N)\) for each outsourced, updated, or streamed element. The size of their client’s state is \(\mathcal {O}(\log N)\). In particular this means, either *N* is chosen large, e.g. polynomial, to be able to authenticate a quasi unbounded amount of elements, which incurs a prohibitively large overhead, or *N* is chosen small, which in turn means that the resulting construction can only authenticate a limited number of elements.

*unbounded*number of elements and is secure in the random oracle model. The second one, the \(\delta \)-bounded CAT, has an upper bound on the number of elements it can authenticate, and we prove its security in the standard model. Both of our constructions only incur a computational and bandwidth overhead of \(\mathcal {O}(\log M)\) for each outsourced, updated, or streamed element, where

*M*is the number of authenticated elements so far. Note that in general the number of outsourced elements

*M*is significantly smaller than the upper bound

*N*. The size of our client’s state is \(\mathcal {O}(1)\). For a concise comparison of the existing and our proposed constructions see Table 1.

### 1.2 System Overview

In this section we outline the high-level workflow and usage of VeriStreambased on the classic task of outsourcing and sharing data. An overview is given in Fig. 1. The major entities are the data owner, other clients that may read data uploaded by the data owner, and the untrusted server storing the data. To allow an easy integration of our framework into already deployed systems, we designed VeriStreamto coexist with the existing system. This means that our framework does not directly alter or modify any of the transmitted data, but only appends and strips its own additional data to and from the transmitted data chunks. All involved entities use VDS handlers to authenticate or verify transmitted data. When the data owner wants to upload some data to the server, he initializes his local VDS handler with his secret key. Rather than transmitting all data chunks directly to the server, they are passed through the VDS handler, which authenticates them on-the-fly by appending a proof of correctness to the data chunk. The retrieving server uses his VDS handler to strip the proof from each data chunk. The data itself is stored in a database and the proofs are stored in a CAT. It should be noted, that uploading data to the server does not require the owner to update his verification key after each chunk, which would put an unrealistic burden on the public directory or PKI that handles the keys.

A client can request data chunks by transmitting their indices. The server fetches the data chunks from the database and computes the corresponding proofs of correctness using his VDS handler, which has access to the CAT. The data and the appended proofs are sent to the client, who can verify the correctness and authenticity of each received data chunk separately on-the-fly, using the VDS handler and the data owner’s public verification key.

When the data owner wants to update some data chunk in the database, he first receives the element the same way other clients do. After verifying the authenticity of the retrieved chunk, he uses his VDS handler in combination with the retrieved data chunk, its proof of correctness, and the new data chunk to compute a new proof of correctness for the new chunk. The new data chunk with the appended proof is then sent to the server. The owner updates his public verification key at the PKI or the public directory. This is necessary to invalidate the now stale chunk and protect other users from retrieving old data from the server. However, even though the owner only requested and modified the updated chunk, all other outsourced data chunks remain valid under the new verification key.

### 1.3 Related Work

Verifiable data streaming (VDS) protocols have been introduced by Schröder and Schröder [22]. The authors formalized the problem on a theoretical level and gave a first construction for a bounded number of elements. Their shortcomings in comparison to our constructions are already discussed in Sect. 1.1. A related line of research investigates verifiable databases (VDBs). VDBs have been extensively investigated in the context of accumulators [5, 6, 16] and authenticated data structures [14, 15, 19, 26]. These approaches, however, often rely on non-constant assumptions, such as the *q*-Strong Diffie-Hellman assumption, as observed in [4]. More recent works, such as [4] or [8], only support a polynomial number of values instead of exponentially many, and the scheme of [4] is not publicly verifiable. Furthermore, the VDB schemes require the data owner to update his verification key after each newly uploaded element. In contrast, in a VDS protocol, data can be added non-interactively and without updating the verification key by sending a single message to the server. VDS can be seen as a generalization of VDBs.

Another line of research deals with (dynamic) proofs of retrievability (PoR). Here, the client uploads his data to some untrusted server. A PoR protocol allows the user to efficiently verify, whether all of his data is still stored on the server [7, 10, 23, 24]. A weaker form of PoR are so called proofs of data possession [9]. They only ensure that the server stores most of the data. In these scenarios the protocols only ensure that all or most of the data is stored, but they do not provide any security guarantees w.r.t. the authenticity of streamed content.

Recently, the notion of streaming authenticated data structures was introduced by Papamanthou, Shi, Tamassia and Yi [18], where a computationally weak client and a server observe a stream of data independently. Afterwards the client can perform range queries and verify the results from the server against a verification value it computed while observing the stream. However both notions differ in the following aspects: The verification token of their scheme changes after each streamed/uploaded element, while ours does not. In their scheme, no secret key is involved, which means that a client can only verify responses if he has either seen the seen stream, or if he obtained the verification token from a trusted party. Furthermore, since the key changes after each new element, all elements that are transmitted after receiving the verification token cannot be verified. Our proofs are logarithmic in the size of the uploaded dataset, while theirs are logarithmic in the size of the universe from which the elements are drawn. Finally, we provide comprehensive benchmark results, while their work only provides a asymptotical run time analysis.

Another successful line of research consider “pure” streaming protocols between a sender and possibly multiple clients, such as TESLA and their variants such e.g., [20, 21]. In contrast to our setting, the TESLA protocols assume that the sender and the receiver are loosely synchronized and these protocols do not offer public verifiability. The signature based solution of [21] is also different, because the protocol does not support efficient updates, which is a necessary property for our applications, such as e.g., verifiable cloud storage.

Open image in new window There are a few seemingly simple solutions to the described problem, which do not work. In this paragraph we would like to discuss the shortcomings of some of them. The first idea might be to use a simple Merkle Tree. In a Merkle tree the data is stored in the leaves and the value of each internal node is defined as the hash of the concatenation of its children’s values. The verification key is the value of the root node and a proof of correctness for some data chunk consists of all nodes that are required to compute the root node’s value starting from the leaf, where the data chunk is stored. This solution would recompute the tree after each uploaded element. This means, that whenever we upload some new data chunk, the verification key is updated, which puts an infeasible burden on the public directory or PKI that stores the public keys. A different approach would be to use signature chains, i.e. for all two adjacent data chunks we compute one signature. Here, the problem is that efficient updates are not possible. When updating a data chunk, we need to invalidate its old version, but here the verification key is just the signature’s public key and does not depend on the uploaded data itself. The data owner would need to update the signature’s public key and recompute all signature in the chain, which is clearly infeasible. The same argument also holds for forward-secure signature schemes [13], where the secret key is updated from time to time. Since the public key remains the same, freshness cannot be ensured, i.e. a user is not able to distinguish the fresh data chunk from a stale version thereof.

## 2 Chameleon Authentication Trees

Our formal definition of CATs differs slightly from [22], since we directly model updates as a property of the CATs. This allows us to build VDS protocols in a black-box way from CATs, while [22] needed to make specific *non*black-box assumptions about the proof that might not hold in general. The second difference is that we do not put an upper bound on the number of leaves. Thus, the only input of \({{\mathsf {catGen}}}\) is the security parameter. The formal definition of VDS from [22] can easily be adapted.

**Definition 1**

\({{\mathsf {catGen}}}(1^{{\lambda }})\): The key generation algorithm takes the security parameter \({{\lambda }}\) and outputs a key pair \(({\mathsf {vp}}, {\mathsf {sp}})\). For simplicity we always assume that \({\mathsf {vp}}\) is contained in \({\mathsf {sp}}\).

\({\mathsf {catAdd}}({\mathsf {sp}}, \ell )\): The insertion algorithm takes a secret key \({\mathsf {sp}}\), and a datum \(\ell \) from some data space \(\mathcal {L}\). It outputs a new secret key \({\mathsf {sp}}'\), a position

*i*at which \(\ell \) was inserted and a proof \(\pi _i\), which is a publicly verifable proof showing that \(\ell \) is indeed stored at position*i*.\({\mathsf {catUpdate}}({\mathsf {sp}}, i, \ell )\): The update algorithm takes the secret key \({\mathsf {sp}}\), a position

*i*at which we want to perform the update, and the new datum \(\ell \in \mathcal {L}\) as input. It replaces the current datum at*i*with \(\ell \) and outputs a new key pair \(({\mathsf {vp}}', {\mathsf {sp}}')\) as well as a proof \(\pi _i\) for the new datum.\({\mathsf {catVerify}}({\mathsf {vp}}, i, \ell , \pi _i)\): The verification algorithm takes the public key \({\mathsf {vp}}\), a position

*i*, a datum \(\ell \) and a proof \(\pi _i\) as input and outputs 1 iff \(\ell \) is stored at position*i*. It outputs 0 otherwise.

*secure*if no efficient adversary can modify the tree by changing the sequence of the data stored in it, substituting any datum, or by adding further data to it. In particular, the definition also prevents the adversary from returning stale to clients. The game is defined as follows:

Setup: The challenger generates a key-pair \(({\mathsf {sp}},{\mathsf {vp}}){{\,}\leftarrow {\,}}{\mathsf {catGen}}(1^{{\lambda }})\) and hands \({\mathsf {vp}}\) over to the adversary \({\mathcal {A}}\).

Uploading: Proceeding adaptively, the attacker \({\mathcal {A}}\) uploads a datum \(\ell \in \mathcal {L}\) to the challenger. The challenger adds \(\ell \) to the database, computes \(({\mathsf {sp}}',i,{\mathsf {\hat{\pi }}}) {{\,}\leftarrow {\,}}{\mathsf {catAdd}}({\mathsf {sp}},\ell )\), and returns \((i,{\mathsf {\hat{\pi }}})\) to \({\mathcal {A}}\). Alternatively, the adversary may update any element in the outsourced database by sending an index

*i*, a datum \(\ell '\) to the challenger. The challenger then runs the update algorithm with \({\mathcal {A}}\) updating \(\ell _i\) to \(\ell '\). At the end of the update protocol the challenger returns the updated proof \(\pi '_i\) and the updated public-key \({\mathsf {vp}}'\) to \({\mathcal {A}}\). Denote by \(Q:=\{(\ell _1,1,{\mathsf {\hat{\pi }}}_{1}), \dots ,(\ell _{q({{\lambda }})},q({{\lambda }}),{\mathsf {\hat{\pi }}}_{q({{\lambda }})})\}\) the ordered sequence of the latest versions of all uploaded elements and let \({\mathsf {vp}}^*\) be the corresponding public key.- Output: Eventually, \({\mathcal {A}}\) outputs \((\ell ^*,i^*,{\mathsf {\hat{\pi }}}^*)\). The attacker \({\mathcal {A}}\) is said to win the game if one of the following two conditions is true:
- a)
If \(1 \le i^* \le q({{\lambda }})\) and \((\ell ^*,i^*,{\mathsf {\hat{\pi }}}^*)\not \in Q\) and \({\mathsf {catVerify}}({\mathsf {vp}}^*,i^*,\ell ^*,{\mathsf {\hat{\pi }}}^*)=1\).

- b)
If \(i^* > q({{\lambda }}) \) and \({\mathsf {catVerify}}({\mathsf {vp}}^*, i^*,\ell ^*,{\mathsf {\hat{\pi }}}^*)=1\).

- a)

We define \({\mathbf {Adv}}^\mathsf{sec}_{\mathcal {A}}\) to be the probability that the adversary \({\mathcal {A}}\) wins in the above game.

**Definition 2**

A chameleon authentication tree \({\varPi _\mathsf {CAT}}=({{\mathsf {catGen}}}, {\mathsf {catAdd}},{\mathsf {catUpdate}},{\mathsf {catVerify}})\) is secure if for any \(q \in \mathbb {N}\), and for any efficient algorithm \({\mathcal {A}}\), the probability \({\mathbf {Adv}}^\mathsf{sec}_{\mathcal {A}}\) evaluates to 1 is negligible (as a function of \({{\lambda }}\)).

## 3 Constructing Fully Dynamic CATs

We now present our fully dynamic CAT construction, which is the first construction that is able to authenticate an unbounded number of data elements and improves upon the state-of-the-art in terms of computational and bandwidth overhead. In the following we first recall the definition of chameleon hash functions and then present our construction.

### 3.1 Chameleon Hash Functions

A chameleon hash function is a randomized hash function that is collision-resistant but provides a trapdoor to efficiently compute collisions. It is defined through the tuple \({\mathcal {CH}}= ({\mathsf {chGen}}, {\mathsf {ch}},{\mathsf {col}})\) [12], where the key generation algorithm \({\mathsf {chGen}}(1^{{\lambda }})\) returns a key pair \(({ csk }, { cpk })\). We set \({\mathsf {ch}}(\cdot ) := {\mathsf {ch}}({ cpk },\cdot )\) for the remainder of this paper. The function \({\mathsf {ch}}(x;r)\) produces a hash value \(h \in \{0,1\}^{\text {out}}\) for a message \(m \in \{0,1\}^{\text {in}}\) and a randomness \(r \in \{0,1\}^{{{{\lambda }}}}\). The function is collision-resistant meaning that given \({ cpk }\) it is computationally difficult to compute a tuple \((m,r),(m',r')\) such that \((m,r)\ne (m',r')\) and \({\mathsf {ch}}(m,r)={\mathsf {ch}}(m',r')\). However, using the trapdoor \({ csk }\) and the collision-finding algorithm \({\mathsf {col}}({ csk }, x, r, x')\) we can break the collision-resistance property and find a value \(r'\) such that both, (*x*, *r*) and \((x',r')\) map to the same hash value.

We call \({\mathcal {CH}}\) invertible if it is surjective and there exists an efficient algorithm \({\mathsf {scol}}({ csk }, x, y)\) that outputs an *r* for any input *x* and *y* such that \(y={\mathsf {ch}}(x;r)\). This property has previously been defined by Shamir and Tauman [25].

Chameleon hash functions can be instantiated from the discrete-logarithm assumption [2, 12], the factoring assumption [25], the RSA assumption [2, 11], or in a generic way from certain \(\varSigma \)-protocols [3].

### 3.2 Intuition

The main idea of our construction is to build a binary tree, which stores the data elements in its leaves and grows dynamically from bottom up. Whenever the tree of a certain depth *d* is full, the data owner can increase the tree’s depth by one using his secret trapdoor. The resulting tree is of depth \(d+1\) and can therefore store another \(2^d\) elements. The previous full tree becomes the left child under the new root and the right child serves as a place holder for an empty subtree, which can be used to store new data.

This new approach of dynamically increasing the depth of the tree means that we cannot simply store the root node’s value in the public key, since it changes whenever the depth is extended. Instead, our idea is to define the new root through a deterministic function that is applied to a fixed value in the public key and depends on the current depth of the tree. More precisely, let \(\rho \) be the value in the public key \({ pk }\) and assume that the depth of the tree is *d*. Then, the root node is defined as \(H^{d}(\rho )\), where *H* is a collision-resistant hash function and by \(H^{d}(\rho )\) we denote the *d*-fold application of *H* to \(\rho \).

Our binary tree is structured as follows: Each node value is computed as the output of a function of the concatenated values of its children. For nodes which are left children themselves we use a collision-resistant hash function. For right children we use a chameleon hash function. The only exception to this rule are all nodes, that have been a root at some point, i.e., all nodes at the very left of each level. We insert the data elements into the trees starting from the leftmost leaf moving to the right. Whenever we insert some data element into a leaf we have to ensure that, roughly speaking, the root node’s value computed from that leaf remains the same as before the insertion. Therefore, we search for a node computed by a chameleon hash function on the path from the inserted leaf to the root and compute an appropriate collision using our secret trapdoor.

In the following, we exemplify basic idea of our construction with a small example, where we will refer to a node *v* at height *h* and index *i* by \(v_{h,i}\). Please note, even though we will be including two leaves at the same time, this should not be seen as a restriction or a problem. It is done for the sake of clarity and the construction can be easily extended to insert one leaf at a time, as it is done in our framework.

Open image in new window In the first step, we append the elements \(\ell _1\) and \(\ell _2\) to the tree. Since the tree is empty, i.e. it has depth 0, it is necessary to increase its depth by one. In order to add \(\ell _1\) and \(\ell _2\) to the tree without changing the root, the data owner uses his secret trapdoor to compute \(r_{1,0} {{\,}\leftarrow {\,}}{\mathsf {scol}}({ csk }, \ell _1 \Vert \ell _2 , H^1(\rho ))\). Hence \({\mathsf {ch}}(\ell _1\Vert \ell _2;r_{1,0}) = H(\rho )\). Recall that \({\mathsf {scol}}\) outputs some randomness *r* when given (*y*, *x*) and the secret key \({ csk }\) such that \(y={\mathsf {ch}}(x;r)\). At this stage the entire tree consists only of two leaves and one root node as depicted in Fig. 2 (level 1). To verify that the leaves \((\ell _1,\ell _2)\) are in the tree, the verification algorithm checks whether \(H^1(\rho ) = {\mathsf {ch}}(\ell _1 \Vert \ell _2; r_{1,0})\) holds.

Open image in new window Next, we add \(\ell _3\) and \(\ell _4\) to the tree. Since the current tree is full, we need to extend its height to obtain new free leaf positions. Therefore, we pick a random \(x_{1,1}\) and \(r_{1,1}\) and we compute the dummy node \(v_{1,1}{{\,}\leftarrow {\,}}{\mathsf {ch}}(x_{1,1};r_{1,1})\). The randomly chosen pre-images are stored by the client in his secret local state. To ensure the integrity of the tree, we need to find a randomness \(r_{2,0}\) for the new root \(v_{2,0}\) such that \({\mathsf {ch}}(H^1(\rho ) \Vert v_{1,1}; r_{2,0}) = H^2(\rho )\). Again, this is achieved by exploiting the inversion property of the chameleon hash function to compute \(r_{2,0} {{\,}\leftarrow {\,}}{\mathsf {scol}}({ csk }, H^1(\rho ) \Vert v_{1,1}, H^2(\rho ))\). We can now add our leaves \(\ell _3\) and \(\ell _4\) to the tree by appending them to the lowest free right child, which is \(v_{1,1}\). Thus, we compute \(r'_{1,1} {{\,}\leftarrow {\,}}{\mathsf {col}}({ csk }, x_{1,1}, r_{1,1}, \ell _3 \Vert \ell _4)\). The resulting proof for \(\ell _3, \ell _4\) would therefore contain \((v_{1,0}, r_{2,0}, r_{1,1})\). The corresponding tree is shown in Fig. 2 (level 2).

Open image in new window Since the tree is full again, we need to increase its depth the same way we did before. Afterwards, we search for the lowest right child, which does not have any children yet. In this case the node is \(v_{2,1}\) that has been computed by \({\mathsf {ch}}(x_{2,1};r_{2,1})\). The dummy values \((x_{2,1},r_{2,1})\) can be retrieved from the local client state. In order to append \(\ell _5\) and \(\ell _6\), we generate an empty subtree below \(v_{2,1}\). This subtree consists of a dummy node \(v_{1,3}\) and the leaves \(\ell _5\) and \(\ell _6\). After appending \(\ell _5\) and \(\ell _6\) below \(v_{1,2}\), we compute \(r'_{2,1} {{\,}\leftarrow {\,}}{\mathsf {col}}({ csk }, x_{2,1}, r_{2,1}, v_{1,2} \Vert v_{1,3})\). The proof for these elements contains \((r'_{2,1}, v_{1,3}, v_{2,0})\). Next, \(\ell _7\) and \(\ell _8\) can authenticated by appending them to \(v_{1,3}\) and computing a collision in the same fashion as in the previous steps.

Open image in new window The verification algorithm works analogously to the one of a Merkle tree. One might get the impression that the size of the proofs grows with the number of leaves *for all leaves*. This, however, is not the case. For instance, the node \(v_{1,0}\) verifies the leaves \(\ell _1\) and \(\ell _2\) even if \(2^{50}\) elements are stored in the tree. The verification algorithm still simply checks whether \(H(\rho )={\mathsf {ch}}(\ell _0 \Vert \ell _1;r_{{1,0}})\).

Open image in new window Whenever we wish to update the *i*-th element in the database to some element \(\ell _i'\), we simply replace the element, recompute the values on the path from \(\ell _i'\) to the corresponding root node, pick a fresh value \(\rho '\), and update all sub-roots w.r.t. \(\rho '\). Updating the sub-roots means that the client has to compute logarithmically many collisions.

Open image in new window In the intuitive description of our construction, the client’s state is logarithmic in the depth of the tree, since all created dummy nodes \(y{{\,}\leftarrow {\,}}{\mathsf {ch}}(x;r)\) are stored by the client. To reduce the client’s state to \(\mathcal {O}(1)\) we use a pseudorandom function PRF to compute the dummy elements on the fly. That is, for each dummy node \(v_{h,i}\), the clients computes the pair \((x_{h,i},r_{h,i}) {{\,}\leftarrow {\,}}{\mathsf {PRF}}(k,h\Vert i)\), rather than choosing it randomly. This allows us to recompute the dummy nodes we need on-the-fly without storing them.

The secret seed of the PRF is stored as part of the secret key. Therefore, the final secret key in our construction consists of the trapdoor \({ csk }\) of the chameleon hash function, the seed *k* of the PRF, and a counter *c* that keeps track of the next free leaf index. In practice and in our framework, one can instantiate the PRF using a symmetric encryption scheme, such as AES.

### 3.3 Formal Construction

We now provide a detailed description of all algorithms, that have been sketched in the previous section. We avoid using the PRF in this description for the sake of clarity, but the modification is absolutely straightforward as described above.

**Construction 1**

Let \(H: \{0,1\}^{*} \mapsto \{0,1\}^{len}\) be a hash function and \({\mathcal {CH}}= ({\mathsf {chGen}}, {\mathsf {ch}}, {\mathsf {col}}, {\mathsf {scol}})\) an invertible chameleon hash function that maps strings of length \(\{0,1\}^{*}\) to \(\{0,1\}^{len}\). The fully-dynamic chameleon authentication tree \({\varPi _\mathsf {CAT}}=({{\mathsf {catGen}}}, {\mathsf {catAdd}}, {\mathsf {catUpdate}}, {\mathsf {catVerify}})\) consists of the following efficient algorithms:

Open image in new window The setup algorithm generates a key-pair of the chameleon hash \(({ cpk }, { csk }) {{\,}\leftarrow {\,}}{\mathsf {chGen}}(1^{{\lambda }})\), it picks a uniformly random value \(\rho {{\,}\leftarrow {\,}}\{0,1\}^{{\lambda }}\), and denote by \({\mathsf {st}}\) the private state. This state stores the next free leaf index *c*, a set of pre-images of unused dummy nodes and the last computed proof. Initially we set \(c {{\,}\leftarrow {\,}}0\), while the set of pre-images and the last computed proof are both empty. It returns the public verification key \({\mathsf {vp}}= ({ cpk }, \rho )\) and the private key \({\mathsf {sp}}= ({ csk }, {\mathsf {st}}, {\mathsf {vp}})\).

Open image in new window Parse \({\mathsf {sp}}\) as \(({ csk }, {\mathsf {st}}, {\mathsf {vp}})\) and check whether the current tree is full, i.e., whether *c* is a power of two:

Open image in new window In this case the current tree is full, we need to increase its current depth by one to obtain a tree of depth *d*, which has free leaves again. To do so, we store the old root node \(H^{d-1}(\rho )\) as the left child of the new root node \(H^{d}(\rho )\) and we create a dummy node \(v_{d-1,1}\) for the right child as follows: First, we pick the values \(x_{d-1, 1}\) and \(r_{d-1, 1}\) uniformly at random and we compute the dummy node \(v_{d-1, 1} {{\,}\leftarrow {\,}}{\mathsf {ch}}(x_{d-1,1};r_{d-1,1})\). Second, we exploit the inversion property of the chameleon hash function in order to compute \(r_{d, 0} {{\,}\leftarrow {\,}}{\mathsf {scol}}({ csk }, H^{d-1}(\rho ) \Vert v_{d-1,1}, H^{d}(\rho ))\). Next, we add \((x_{d-1,1}, r_{d-1,1})\) to the set of pre-images and \(v_{d-1,1}\) to the proof in \({\mathsf {st}}\) and proceed as in the case where *c* is not a power of two.

Open image in new window Since the tree is not full, we search for the lowest right child \(v_{i, j}\), which has no children, in the proof that was stored in \({\mathsf {st}}\) during the last run of \({\mathsf {catAdd}}\). Then, we generate a skeleton subtree below \(v_{i, j}\) the following way. First, we descend from \(v_{i, j}\) along the left edge until we reach height 1. Then we create one adjacent dummy node at each height, i.e., we create dummy nodes at \(v_{i-k, 2^k \cdot j + 1}\) for \(k = 1 \ldots i-1\). The pre-image of each created dummy node is added to \({\mathsf {st}}\). Now, we append the given leaf \(\ell \) as the left most child to the newly generated subtree at height 0. Given the leaf and the dummy nodes, the value of \(v_{i, j}\) can now be determined recursively by computing \(v_{i, j} {{\,}\leftarrow {\,}}v_{i-1, 2 \cdot j} \Vert v_{i-1, 2 \cdot j + 1}\). We re-establish the tree’s integrity by computing a randomness \(r'_{i, j} {{\,}\leftarrow {\,}}{\mathsf {col}}({ csk }, x_{i, j}, r_{i, j}, v_{i, j})\). We create a proof \(\pi \) for \(\ell \), which contains all newly created dummy nodes, \(r'_{i, j}\), the node adjacent to \(v_{i, j}\) and all nodes from the old proof, which were above \(v_{i, j}\). Finally, we increase the next free leaf index *c* in the client state by one, replace the proof in \({\mathsf {st}}\) with the newly generated one, and return it.

Open image in new window Parse \({\mathsf {vp}}\) as \(({ cpk }, \rho )\). In order to verify, whether \(\pi \) authenticates \(\ell \) we compute, starting from the bottom, each node as the hash or the chameleon hash of the concatenation of its two children until we compute a node with index 0. All nodes and randomnesses that are needed are taken from the given \(\pi \). In case the node we want to compute has a odd index, we use the chameleon hash function. Otherwise we use the hash function. Let \(v_{d,0}\) be the node at which we terminated. We return 1 iff \(v_{d,0} = H^{d}(\rho )\), and 0 otherwise.

Open image in new window Parse \({\mathsf {sp}}\) as \(({ csk }, {\mathsf {st}}, {\mathsf {vp}})\) and \({\mathsf {vp}}\) as \(({ cpk }, \rho )\). Request \(\ell _i\), with its proof of correctness \(\pi _i\). Request \(\ell _0\) with its proof of correctness \(\pi _0\). Compute \({\mathsf {catVerify}}({\mathsf {vp}}, i, \ell _i, \pi _i)\) and \({\mathsf {catVerify}}({\mathsf {vp}}, 0, \ell _0, \pi _0)\) and abort if one of them outputs 0. Let \(\pi = \pi _i \cup \pi _0\) denote the total set of nodes and randomnesses obtained by the client at this point. Replace \(\ell _i\) with \(\ell '\) and recompute all values that are on the path from \(\ell '\) to the root recessively. Pick a new \(\rho ' {{\,}\leftarrow {\,}}\{0,1\}^{{\lambda }}\) and replace \(\rho \) with \(\rho '\) in \({\mathsf {vp}}\). This means that at each height *h* we now have to ensure again that \(H^{h}(\rho ') = {\mathsf {ch}}(v_{h-1,0} \Vert v_{h-1,1})\). Therefore, we compute \(r'_{h,0} {{\,}\leftarrow {\,}}{\mathsf {scol}}({ csk }, v_{h-1,0} \Vert v_{h-1,1}, H^{h}(\rho '))\) at each height *h* and add all newly computed \(r'_{h,0}\) to \(\pi \) and return \(\pi \).

**Theorem 1**

If \(\mathcal {CH}\) is an invertible one-way collision-resistant chameleon hash function and *H* is a collision-resistant hash function modeled as a random oracle, then Construction 1 is a secure unbounded verifiable data streaming protocol.

Due to space constraints the security proof will be available in the full version.

## 4 Implementation

VeriStreamis written in Java and it contains all protocols described in this paper as well as a separate library for chameleon hash functions, which contains implementations of the Krawczyk-Rabin [12], the Ateniese and de Medeiros chameleon hash [2], and its elliptic curve equivalent. For the elliptic curve operations we used the Bouncy Castle Cryptographic API (Release 1.49) [1]. In addition we provide a generic interface for transforming \(\varSigma \)-protocols, that fulfil certain properties into chameleon hash functions [3]. Using this interface we instantiated a chameleon hash function from the Fiat-Shamir protocol [3]. Many chameleon hash functions only take input from certain message spaces, e.g., Krawczyk-Rabin expects messages from \(\mathbb {Z}_q^*\). We provide a simple wrapper that transforms them into functions that take arbitrary large inputs, by first hashing the input with a common collision-resistant hash function, like SHA-256, before passing it to the chameleon hash function. The remaining algorithms of the chameleon hash functions are adapted accordingly by the wrapper.

We developed a platform independent standalone client that uses VeriStream (see Fig. 3). It allows its users to manage, upload, download, or share files of an arbitrary format in an authenticated fashion. Users can choose whether they want to upload their data to a private web storage or whether they want to use their Dropbox account as the underlying storage layer. The client is able to stream audio and video content with on-the-fly-verification, even if Dropbox is the underlying storage layer.

**Efficiency Optimizations.** For performance and bandwidth reasons, our framework differs from the theoretical description in several points that we discuss in the following.

Open image in new window When uploading data to the server it is not necessary to always send the whole proof. Instead, it is sufficient to only send all nodes that are below the chameleon hash node that was extended. One can think of this optimization as transmitting only the delta between the previous proof(s) and the current one.

Open image in new window Recall that the insertion algorithm always generates a certain amount of chameleon dummy nodes by picking random pre-images and storing them in the client state. Later on, when we insert elements below one such dummy node, we use the pre-images from the state to compute a collision in that dummy accordingly. Now, instead of picking these pre-images completely at random, we provide the possibility to pick them using a pseudorandom function, which takes the dummy nodes position as input. This way, we reduce the client state to constant size, since we do not need to store the pre-images anymore. Furthermore, being able to compute the values of dummy nodes independent of the actual existing tree allows us to obtain concurrent versions of all protocols. The position of an element in the data stream while uploading uniquely defines its position in the CAT. Having the element, and using the pseudorandom function to obtain the dummy value to which that element will be appended, we can compute the short proof independent of the remaining tree. We believe that it is not straightforward to see that the parallelization of the CAT indeed works, because almost half of the nodes are computed using a collision-resistant hash functions and these nodes cannot be pre-computed without knowing the pre-images. However, a closer look at our construction shows that all these values belong to the left part of the tree and these elements have all been pre-computed before.

Open image in new window Depending on the concrete scenario, a single or multiple elements in succession can be requested. In the case, where multiple adjacent elements are requested, we exploit the following observation: Given the proof \(\pi _i\) for some leaf *i* and the proof \(\pi _{i+1}\) for its successor \(i+1\), all nodes in \(\pi _{i+1}\), that are above the node which was extended when inserting the leaf \(i+1\) are also contained in \(\pi _i\). Hence when a set of adjacent elements is requested, we send the full proof for the first and short proofs for the remaining elements.

## 5 Evaluation

In this section, we provide a comprehensive efficiency analysis of VeriStream. In this analysis, the chunk size is an important variable, because we compute one proof for each chunk and we therefore test our implementation with different chunk sized to obtain detailed insights into the protocols performance. In addition, we conduct several different benchmarks highlighting all possible operations for all discussed protocols. Our experimental analysis was conducted on a Intel Core i3-2120 CPU with 8 GB of RAM running Ubuntu 12.04 with Java 1.6.0.

*static*. To obtain meaningful and detailed performance results, we computed the averages of the following measurements:

Time for hashing a chunk and authenticating it.

Time for obtaining and verifying a proof from the CAT.

Bandwidth overhead produced by a proof retrieved from the CAT.

In the next step we analyzed the computational and bandwidth overhead of the retrieval operation. We measured the time it took to compute the proof from the CAT upon a client request for a certain element and verify the returned proof. More precisely we created a CAT that contained proofs for a 2 GB large data set and requested all chunks from it, such that each proof in the CAT had to be computed once. For each received proof the verification algorithm was executed once. We stress that we did not use the efficent method for retrieving sequential parts of the uploaded data, but purposely requested each chunk on its own with its full proof. The results of this experiment can be seen in Fig. 5. At the top one can see the average time it took to obtain a proof from the CAT and verify it. At the bottom one can see the average size of such a retrieved proof. The protocol from [22] performs worst w.r.t. to computational and bandwidth overhead, what confirms our expectation, since, in contrast to the dynamically growing trees, all elements in their construction verify against the very top level root value. This requires, on average, much more bandwidth and more computational power. Our two constructions perform roughly equally well as expected. Using the elliptic curve variant of the Ateniese and de Medeiros chameleon hash results in a improvement of roughly factor 5 with regards to the size of the proofs and a speed up of about factor 3.

## Acknowledgements

Dominique Schröder and Mark Simkin were supported by the German Federal Ministry of Education and Research (BMBF) through funding for the Center for IT-Security, Privacy, and Accountability (CISPA; see www.cispa-security.org). Dominique Schröder is also supported by an Intel Early Career Faculty Honor Program Award.