My main research area is coding theory. Anytime information is transmitted or stored, it is inevitable that errors or erasures will occur in transit. Unaltered, this would result in an unreadable message. Coding theory is the study how to encode a raw message into a codeword capable of preserving the information of the original message, despite the presence of errors or erasures. The collection of all possible codewords in a particular encoding scheme is known as a code.
My particular research deals with constructing high rate codes capable of correcting multiple erasures. I have specifically studied a family of repair schemes known as linear exact repair schemes. There are several natural applications of these constructions, but I have mainly focused on developing codes for distributed storage systems and for distributed multiplication schemes. To learn more, please see the sections below.
All codes are designed to transmit information efficiently, however, efficiently means different things in different contexts. One measure of efficiency is the rate of code, which indicates how long the codewords of the code are compared to the amount of information they encode. The higher the rate of a code, the shorter its codewords are, and the more efficiently the information of the original message is packed into the codeword. Higher rate codes are sought after as they require less space to store and less data to be sent in transmission.
More broadly, however, there are many different codes designed for many different applications. There are codes designed to detect/correct as many errors as possible, codes designed to repair as many erasures as possible, codes designed to fix some errors/erasures but utilize the smallest codewords possible, codes designed to repair erasures using the least information from the remaining codeword as possible, etc, etc. However, we may not both minimize the length of codewords and maximize our error/erasure fixing capabilities in the same code. This is due to what I call the "fundamental relationship in coding theory."
The fundamental relationship in coding theory is the trade off between the number of errors or erasures a code can fix and the length of its constituent codewords. Essentially, every code works by adding redundancy to the information of the original message. This redundancy is often introduced by creating codewords with a structure. An example is the natural code found in the English language. Consider the following sentence which you could say to communicate that you are leaving.
This is not only a beautiful farwell, but also a great demonstration of the structures hidden in natural language. Think about how in the above sentence the structures of the English language force certain words to be used in certain places, irregardless of the intent of the speaker. Take these examples.
(1) Once the subject of the sentence was chosen to be "I", the present "to be" verb in the sentence had to be "am". It could not be "is" or "are" by the conjugations of English.
(2) As "am" indicates present/future tense the action verb here virtually has to end in "ing" to match the present/future tense.
(3) Since the place you are travelling is "away" the only possible travelling "verb" which matches that locale is "go". Drastically oversimplifying here, in English you can essentially "come" or "go". And while you can "come home" and "go home" you cannot "come away". Thus the sentence must employ a "go" verb here.
The point being, the structure of English forces each word of our sentence to be dependent on all of the words around it. This adds redundancy to our message as certain words are not chosen by the speaker, but determined by the rest of the message. The fact that the sentence starts with "I" is completely redundant information, as once the "to be" verb is chosen to be "am" the only possible start of the sentence is "I am".
Importantly, we can use this redundancy to fix errors/erasures! For example, if you heard the sentence "__ am going away" where you missed the first word, you could be pretty sure the missing word should be "I" based on the redundancy we discussed above. Or if you heard "I am joined away" you could detect there is an error somewhere based on how "joined" does not make sense with the present tense of "am" or the locale of "away". And you can even use the structures above to determine that most likely the speaker must have said "I am going away" as "going" is one of the only words that works with both of these structures.
However, while this structure in English is great for fixing errors/erasures it comes with a cost. A very small percent of possible strings of English words are actually valid sentences. Let's say there are M potential ideas that a human might want to communicate via a sentence and q words in the English language. If there was no structure to the English language then every string of English words would represent a sentence. This would mean there are roughly q one-word sentences, q2 two-word sentences, q3 three-word sentences etc. Hence, to cover every possible idea, we would roughly need about n words per sentence where n is such that,
But now let's think about the average sentence length given the actual structure of English. Suppose that only 5% of strings of English words are valid sentences. Then very roughly there are 0.05qn n-word sentences. Thus, redoing the math above,
i.e the average length of a sentence must increase. Meaning the same structure which allows us to fix errors/erasures in spoken language also forces our sentences to be longer, requiring more breath, time, and pages to communicate. This is equivalent to lowering the rate of the code.
Another way to think about this is to imagine the length of English sentences if we removed all of the redundancy. For example, as discussed above in the sentence "I am going away" the "I" and the "ing" (and arguably even the "go") are completely redundant. So if we take out all of the redundant information this sentence would just read "am go away". Notice this sentence is shorter, which illustrates the point that adding redundancy to a message must fundamentally increases its length.
This is the fundamental relationship of coding theory, and coding theorists can spend their entire careers optimizing tradeoffs like this for an endless variety of applications.
While I described the so-called fundamental relationship in coding theory above qualitatively, it can also be described quantitatively in the form of a bound. First we need to quantize the amount of errors/erasures a code is capable of fixing. To do this, we will introduce what is called as the minimum distance of a code. The minimum distance of a code is just the minimum number of symbols in which two valid codewords differ. The distance is a measure of the error correcting capability of a code, since it is known if a code has distance d it can correct ⌊(d - 1)/2⌋ errors.
Then, given a linear code (this is a special type of code which is a vector space) with codewords of length n, rate R, and a minimum distance d, the relationship between the rate of this code and the maximum number of errors it can correct may be expressed as
which is known as the Singleton bound. Note if you are familiar with the Singleton bound this inequality may look strange, but it is simply the traditional inequality with the Rn being substituted for the dimension.
Using this bound, given a rate R we can determine the maximum number of errors a code with that rate could ever hope to correct. Any code meeting the singleton bound is optimal in the sense that it corrects the most errors possible for its rate. That is not to say there is an optimal rate or an optimal number of erasures corrected for any code. The exact balance we reach between these two opposing parameters is completely dependent on the application for which a code is designed. However, the Singleton bound can be used to provide the ideal parameters within these constraints.
A defining theme in my research is the application of linear exact repair schemes. Linear exact repair schemes are a subtopic of coding theory concerned with repairing erasures. In certain codes, known as evaluation codes, we are able to repair erasures in transmitted or stored data by interpolating polynomials over finite fields. These codes have the form,
where F is some set of allowable polynomials. If an entry f(a*)is erased or corrupted, a user can use a subset of the other entries to interpolate what the polynomial f must have been, and then evaluate it at the missing point a* to reproduce the erased data.
Despite their historical success, there is an inefficiency in the traditional way of repairing evaluation codes. Traditional interpolation schemes are wasteful in that they recover the whole polynomial just to repair a single evaluation. Linear exact repair schemes are an alternative way to repair an erased entry without ever recreating the full polynomial. They work by leveraging the fact that for any polynomial r(x) in the dual code of C,
Importantly, as these schemes do not require the user to interpolate polynomials, they can be applied in settings where erasure repair was traditionally very tedious or non-existent, such as multivariate evaluation codes. These codes deserve special attention as they are a well of untapped potential, which my collaborators and I have utilized with great success in several applications
Beyond just applying to codes void of traditional repair schemes, linear exact repair schemes can also be competitive in settings where there are previously well-established alternatives. To start, we need to discuss how we compare different repair schemes. There are many different metrics, but in general we quantify the efficiency of an exact repair scheme with what is known as the bandwidth. The bandwidth of a scheme is the amount of information we need from the remaining symbols of a codeword to repair an erased symbol. Generally this is measured in symbols of the ambient finite field, but is also sometimes measured in bits. In this case it is known as the bitwidth. In almost every setting it is advantageous to minimize the bandwidth as this requires the least amount of information to be transmitted.
Naively, it would seem that linear exact repair schemes have very high bandwidth. For example, using the most general formulation of a linear exact repair scheme above, it would seem that every remaining symbol of the codeword is required to repair a single erased symbol. If there are n symbols in our codeword this would translate to a bandwidth of n - 1. This is literally the maximum bandwidth possible, which is not ideal considering we would like to minimize the bandwidth. However, with some simple additions to the general formulation, it is possible to construct linear exact repair schemes with very competitive bandwidths.
In their 2016 paper Repairing Reed-Solomon Codes, Guruswami and Wootters outlined their twist on linear exact repair schemes that make them more efficient than traditional repair schemes in certain contexts. The core idea of their scheme revolves around a particular finite field operation known as the field trace. Given any finite field with qt elements there is a polynomial called the field trace and denoted Tr(x) defined as
For reasons beyond the scope of this discussion, the field trace will wind up having the two following properties.
These properties are what allowed Guruswarmi and Wooters to create a competitive linear exact repair scheme. The setting for their scheme is a Reed-Solomon code over a finite field with qt elements where z1, ..., zt is a basis for this field over Fq. Then there must exist some dual basis z1', ..., zt' such that for any element of the finite field, f(a*), we have
Notice then, if f(a*) is an erased symbol of a codeword in an evaluation code (f(a1), ..., f(an)) we may recover f(a*) using the t subfield elements Tr(z1f(a*)), ..., Tr(ztf(a*)). As the large field is a degree t extension over the subfield Fq, each subfield element contains 1/t the information of an element in the large field. Thus, each trace Tr(zif(a*)) only contains 1/t the information of a larger field element so theoretically may require less information from the remaining symbols to recover. However, there are t of these traces, so it would appear in the aggregate it will take the same amount of information to recover these traces as it would f(a*). Interestingly, this is not the case.
To see why, let us perform t independent general linear exact repair schemes that we will use to recover Tr(zif(a*)) for each i. This requires us to construct a polynomial ri in the dual code such that ri(a*) = zi. The brilliance in the Guruswami and Wootters 2016 paper is the construction of this polynomial. In that article, they define
Then, for each i in {1, 2, ..., t} we can proceed with a general linear exact repair scheme like normal,
Taking the trace of both sides
we can compute Tr(zif(a*)) using the above sum. Let's compare all of these sums for each i in {1, ..., t}.
Note that each of the terms in red only require computations involving zi, a, and a* which do not require information from the remaining symbols. The only terms whose computation requires information from the remaining symbols are the terms in green. Fortunately, all of these green terms are identical in every equation. Thus for each entry a not equal to a*, we only need this one green trace term to calculate Tr(z1f(a*)), ..., Tr(ztf(a*)) with the above sums. But remember using the dual basis elements from before these t blue traces are enough to calculate the erased symbol f(a*).
As the field trace outputs to subfield elements containing 1/t the information of an element in the larger field and the above process only requires one green trace element from each of the remaining symbols, the bandwidth of the Guruswami and Wootters repair scheme is not n - 1 but actually (n - 1)/t. An improvement by a factor of t! Not only is this a massive improvement on the general linear exact repair scheme, but in certain situations it will even have a lower bandwidth than traditional interpolation repair schemes common for Reed-Solomon codes.
My research has been dedicated to extending these ideas to a multivariate and multi-erasure context to create schemes ideal for the distributed storage and distributed computation setting.
A distributed storage system is a system in which a single data file is stored over multiple storage nodes. The data must be distributed to these nodes in such a way that if a single storage node fails, the information in the remaining nodes may be used to recover it. Models like these have direct applications to places like server farms, cloud storage systems, and really anywhere that vast amounts of important data is stored. This includes organizations such as Amazon, Google, Microsoft, and more.
Distributed storage systems are a natural application of coding theory, in particular erasure correcting codes, as exact repair schemes may be utilized in the following way. The core idea is to encode the file being stored into a codeword (f(a1), ..., f(an)) and then store each symbol of the codeword f(ai) at its own server-ai. Then, if any server-a* goes down, simply use the remaining servers, i.e the remaining symbols of the codword to recover the missing symbol f(a*) using any applicable exact repair schemes.
The ideal repair scheme would utilize a code of high rate (as this translates less data required to store the file) and have a low bandwidth (as this translates to less information transmitted over the network).
My research on the topic has been focused on developing linear exact repair schemes of high rate capable of repairing multiple erasures simultaneously.
In my master’s thesis I developed a linear exact repair scheme to be used to repair an erasure in a particular type of multivariate evaluation code known as a Cartesian code. The scheme I developed required the construction of a novel multivariate polynomial of the form
which is 0 at exactly one point. Importantly, this idea can be extended to the m-variate case through a recursive construction. It has been shown that in certain situations, the schemes developed here are more efficient than schemes based on Reed-Solomon codes themselves.
In this ISIT (IEEE Symposium on Information Theory) conference paper I worked to developed new evaluation codes, known as augmented Reed-Muller codes, which are optimized monomial-Cartesian codes designed to have very high rates. It is important to note the higher rate the code, the more efficiently it can store information, so this is a significant contribution. We developed repair schemes specifically for these codes based on the linear exact repair concept. In situations where the data storage capacity of nodes is limited, distributed storage systems based off of Augmented Reed-Muller codes 1 \& 2 are the most efficient or sometimes the only systems capable of both respecting the storage requirements and restoring failed nodes.
In a continuation of the above paper, I worked to develop linear exact repair schemes for a class of codes known as decreasing monomial-Cartesian codes. These codes are the most general multivariate evaluation codes admitting a linear exact repair scheme known thus far. We also developed extended versions of augmented Reed-Muller codes known as augmented Cartesian codes. These codes are an extension of augmented Reed-Muller codes for which the degrees of polynomials can be restricted in each variable independently. The extra degrees of freedom present when constructing decreasing monomial-Cartesian codes allows for the development of highly specialized codes which can be optimized for any situation.
In this same paper we also generalized our previous results further, by introducing schemes capable of repairing two and in some cases three failed nodes simultaneously. These schemes were inspired by the Dau, et.al. 2018 paper on the subject \cite{Dau} of repairing multiple failed nodes in the Reed-Solomon setting. I hope to generalize these results even further in the future, creating systems capable of restoring four or more failed nodes at once.
Secure distributed matrix multiplication schemes (SDMM schemes) have gained popularity in the last decade as several cutting edge technologies from optimization problems to machine learning applications require fast and efficient matrix multiplication. The basic idea of an SDMM scheme is to perform matrix multiplication in parallel, distributing the computation across several helper servers. In order to provide security, it is important that the scheme is designed such that no T helper servers can collude to obtain any information about A or B. When successful, SDMM schemes provide valuable tools to drastically reduce the time required to compute matrix products, speeding up any technology previously bottlenecked by these computation.
SDMM schemes are another natural application of coding theory when approached from the right perspective. The idea is to somehow partition the matrices A and B into submtrices A1, ..., AL and B1, ..., BL respectively. Then encode these matrices into codewords (A1, ..., AL, c1, ..., cN) and (B1, ..., BL, d1, ..., dN) of some erasure correcting code and send each symbol not of the partitions, ci and di to helper server-i. Because we specifically do not send the symbols directly containing a submatrix of A or B a small group of helper servers will find it impossible to recover any information about the original matrices. This is what ensures the T-security of our scheme.
Further, however, since the helper servers do not possess any direct information about A and B it is also difficult for them to compute the product AB. Here is where coding theory can be utilized. We will essentially view the information contained in the submatrices purposefully not transmitted as erased information. Then, since our ci and di symbols came from an erasure correcting code, we can use any recovery scheme we would like to repair the 'lost' information and eventually compute the product AB.
My research on SDMM schemes has been devoted to adapting linear exact repair schemes to apply in this new context. Specifically, I have done work utilizing multivariate evaluation codes and their advanced degrees of freedom to construct more considered and ideal SDMM schemes.
In this conference paper I worked to apply our knowledge of linear exact repair schemes to the SDMM problem. It is known that locally recoverable codes can be used to great effect as the basis of an SDMM scheme. Previously, there have existed codes such as MatDot codes, which can compute AB even if several helper nodes are delayed, and DFT codes which provide T-security to ensure no T helper nodes can collude to get any information about A or B. Our contribution was the development of a new code known as a secure MatDot code, based on Reed-Solomon codes, which both defends against delayed helper servers and collusion. Interestingly, it has been shown that in many contexts secure MatDot codes are more efficient than DFT codes and more secure than MatDot codes.
H. H. Lopez, G. Matthews, D. Valvo. Secure MatDot codes: a secure, distributed matrix multiplication scheme. Information Theory Workshop, November 2022.
H. H. Lopez, G. Matthews, D. Valvo. Erasure repair for decreasing monomial-Cartesian and augmented Reed-Muller codes of high rate. IEEE Transactions on Information Theory, 2022.
H. H. Lopez, G. Matthews, D. Valvo. Augmented Reed Muller codes of high rate and erasure repair. IEEE International Symposium on Information Theory, May 2021.