Lumen Researcher Interview Series: Professor Daniel Seng

Adam Holland, Shreya Tewari on July 22, 2021

In the second part of the Lumen Researcher interview series, we spoke with , Associate Professor at National University of Singapore (NUS) and Director of NUS’s Centre for Technology, Robots, Artificial Intelligence, and the Law ().

Professor Seng pursued his doctoral degree from Stanford Law School, where he used machine learning, natural language processing, and big data techniques to conduct research on copyright takedown notices. His doctoral thesis was the beginning of his decade-long association with the Lumen Project, then known as Chilling Effects.As a part of his research, Prof. Seng has worked on five papers that have used the data in the Lumen Database. A bulk of Professor Seng’s focus has been on empirical and quantitative legal analysis, for which he has analyzed 203 million takedown notices and 5.66 billion URLs in the Database. In this interview, Adam Holland, Project Manager at the Lumen Project, and Shreya Tewari, Lumen’s 2021 Research Fellow spoke with Professor Seng about his research work, research methodology and how the Lumen Database catalyzed his research.

Shreya Tewari: Can you tell us about your articles where you've used Lumen in your research?*

Daniel Seng: Sure. I actually have two other papers that talked about Lumen. They are [chapters] in books so they're not as easily accessible. One is called “” where I illustrated quantitative legal analysis with reference to the Lumen database This has just been publish.ed in the Handbook of IP Research: Lenses, Methods, and Perspectives, by Oxford University Press, . The other is “” in the edited by Dr. Roland Vogl from Stanford University. It's just been published by Edward Elgar. There, I explained how we could do empirical data research using copyright databases. Obviously, there are very, very few such databases around.

So Lumen features prominently in a lot of my writings in addition to the first three papers. The trilogy of papers you referred to is actually part of a set of papers that I submitted for my doctoral thesis at Stanford University. They are a progression from a description or overview of the takedown notice mechanism to a survey [of notices and an analysis of the takedown mechanism]. [The ] cites the Chilling Effects database, which is the predecessor to Lumen.

This was followed by the “” paper, which initially went by a different name. Subsequently, I had to substantially redo and update all the references when Chilling Effects became Lumen. [This happened in 2014]. Also, I had to expand the coverage of the data set to 2015 by adding a lot more notices and analysis.

The last one is the “” paper. It's a work in progress. I have to admit that this is probably the most difficult paper of all because it involved [applying a lot of] machine learning. This one talks about how the intermediaries are actually processing copyright takedown notices and the mechanisms that make the takedown system work. So it's related to my research in artificial intelligence systems and their regulation, because this is a paper that researches AI systems built for processing takedown notices.

So, in a nutshell, that's the trilogy.

Adam Holland: How did you begin on this line of research? What motivated you to look for the takedown notices in these instances, and what led you to your research methodology?

Daniel Seng:

I felt there was an incomplete picture of the mechanics behind takedown notices. . . Having taught information technology and law for many years, there's only that much that law can tell you about [] how it actually works. For instance, we talk about the mechanisms in the DMCA - the safe harbors. I wanted to find answers to a simple question: which provision was most relied upon by whom, and for what purpose? When consulting the literature, I couldn't find any useful information to answer this very important question which I think goes to the root of the DMCA. [The DMCA] is very often underappreciated because in my view it is actually one of the two pieces of American legislation that make the Internet what it is today. [The other is the Communications Decency Act.] It's also not well appreciated that the DMCA has actually “gone viral” because it's “applied” in a lot of countries around the world, such as, for instance, China. So the question then is how do the mechanisms actually work? And what do fellow academics say about this? I found very little out there. And that got me really bothered. Because here I was writing all these monographs [about intermediary liability] but I felt that I was not getting a complete picture of the nuts and bolts behind the DMCA. Thus, for my Stanford thesis, I had the idea of doing an empirical survey of the DMCA takedown notices on Chilling Effects. At that point in time, the largest data set anyone had yet done at Stanford involved about 3,000 data points for empirical analysis. The doctoral committee was quite skeptical that I could handle something as large as this, because of course, no human being can look through (then) 12,000 notices and still maintain any level of coherence in his or her analysis. So the committee advised me to take a year off and do a proof of concept to establish that this was a viable thesis.It was quite disruptive, but the one year gave me the chance to dive deep into the mechanics of the takedown mechanism, examine the Chilling Effects database, (subsequently Lumen) in far more detail, work out how to collect the notices and correlate notices in an organized way for my own research. And then I spent a lot of time at the Stanford School of Engineering, picking up techniques in natural language processing and machine learning to get this scale of analysis, essentially building towards a set of tools that would allow me to work towards processing the takedown notices.By 2015, I had analyzed half-a-million notices -- from the 12,000 for my pilot study to 500,000. So we called it quits then and I published my thesis. But I really felt dissatisfied because there was so much left to do. So upon my return, I carried on the research. I was just going over the figures. To date, I have analyzed 9.4 million notices, 203 million complaints and 5.66 billion URLs.

It's a quest that I've started because I [am] trying to bring a greater understanding of the details and mechanics of the takedown process to the entire community, hoping that this will help us better understand its strengths and its weaknesses, to figure out the roles and responsibilities of the players in the ecosystem, what they're actually doing and what they're doing right and what they're doing wrong. And [I hope that this], together with the wonderful work that the Harvard team is doing with the Lumen database, will contribute towards an informed, policy decision-making process, so that if we can find this very elusive balance between regulating to protect the wonderful, rich commercial Internet and ensuring that free speech and exchange of ideas remain unimpeded and free flowing on the Internet.

Adam Holland: Could you talk a little bit more please about your research methodology?

Daniel Seng:

The methodology was challenging I must say. But when I started, I went to the Stanford Engineering School to pick up on machine learning and natural language processing and statistical analysis. I knew that these were the pieces of the puzzle I needed [to parse and analyze the takedown notices]. But when I asked around as to whether there were any pieces of software that could allow me to work on takedown notices, I was very quickly rebuffed. As in, "No. We don't have anything that allows you to do all these things that you want to do." So it was a learning experience for me because I had to quickly figure out [how to move forward]. . . So I ended up spending a lot of time developing my own tools. I call it a platform for parsing, analyzing, tagging, detecting patterns and extracting information from the notices. I did all the coding myself. It also made me become aware of a lot of the cutting-edge developments that happen in natural language processing, which I tried as far as possible to use because of the nature of my work. I threw whatever I learned into this platform that I developed to make sense of this huge dataset of notices out there, to be able to rationalize it, and to be able to use it.

Shreya Tewari:
What would you have done if the notices had not been available? Would there have been an alternate approach for your research? If so, would that approach had been as effective as the current one?

Daniel Seng:

That question really got me thinking. I haven't the chance to watch the recent Marvel movies where they talk about alternative timelines. But if indeed I do have access to an alternate timeline and the Lumen database were not available, what I would do is I would apply to do my doctoral studies at Harvard and then find Adam and create the database and hopefully after creating it, [use it to] work towards my doctoral thesis! Of course, that also means that even now I will not have finished my doctoral thesis because I know how much work is involved in putting together the database! So, yeah, to me, it's as if I'm talking about a parallel universe that I cannot contemplate , which actually tells you how critical the Lumen database and, of course, its predecessor Chilling Effects, has been for me and my research.

The other possibility would be to conduct a survey, a selective survey, a snowball survey, of a few interested parties in the area and ask them questions. But, again, my concern is that this gives a distorted view of what is taking place in the DMCA system [because it introduces problems of selection bias]. It's not that it's impossible to do this. Each approach comes with its strengths and weaknesses. In my case, I strongly believe that the quantitative approach has more advantages than a qualitative approach.

Adam Holland: What other features would you like to see added to the Lumen database? How could we make it better for you and for researchers like you to do your research more effectively?

Daniel Seng:A few things come to mind. First, it’s always useful to be able to see the statements of accuracy and good faith and the electronic signature that is part of any compliant DMCA notice. One of the areas of my interest has been whether those formalities have been observed. I also think that the geolocation detail and IP address information (if available) associated with each notice would be useful.

Finally, I wonder if it is possible on a selected basis to make available to some researchers the unredacted information [in notices]. [I understand that the redaction, which may be made by the contributing intermediary or by the Lumen team, is largely to protect the privacy and interest of the parties.] That can actually come in handy. I can offhand tell you that there have been several cases where I've been quite puzzled by several notices that have their contents or URLs redacted. For instance, a redacted URL could interfere with the analysis of the URL, especially when I want to find out if there are [certain] characteristics associated with the [sites targeted for takedowns]. This [access to the redacted information] would be only on a selective case-by-case basis just to verify that the [redaction] did not affect any material information. Maybe one way forward is to have two different types of redaction tags: [one for redactions from the source and the other is for redactions by the Lumen team]. That would make it easy then for us [as researchers] to determine if this is a notice that requires further review or further inspection.

Shreya Tewari: As a final question, what are your thoughts as a scholar in copyright intermediary liability about the importance of transparency through notice sharing? How that reflects on policy making and research?

Daniel Seng:
[It is] most certainly [important]. I always cite the need for transparency as it goes hand in hand with accountability. This is [represented by] the availability of [complete information] about a takedown notice. I think in this regard Lumen's achievements are just incredible because you've actually set the path forward for much of this debate that is taking place right now amongst the engineering community regarding what pertains to explainable AI and the need for transparency and accountability. What better way to promote accountability than to be transparent about it? [In this case] this is to, within the limits of considerations like information and privacy and data protection, explain the mechanics behind what [the content companies and intermediaries] are actually doing. I'm actually surprised that you yourselves have been able to get so much cooperation from so many of these Internet intermediaries to make this a sustainable project. [The Lumen project] would actually be the hallmark of what every intermediary should do if they want to encourage accountability. The world will have to thank Lumen for this wonderful job in shedding light upon this very [special] area of copyright protection. Because it's often not frequently appreciated that copyright is a formality free system. The Lumen system is its proxy because if you think about it, trademark systems and patent systems require registration, but copyright does not. So what's the alternative or substitute for that? The Lumen database. So thank you very much for illuminating the way forward for the Internet community!

-----

* “” (2014), “” (2015, published 2021), “” (2015)

yuamikami

​Lumen Researcher Interview Series: Professor Daniel Seng

Lumen Researcher Interview Series: Professor Daniel Seng