Lumen Researcher Interview Series: Professor Daniel Seng
In the second part of the Lumen Researcher interview series, we spoke with , Associate Professor at National University of Singapore (NUS) and Director of NUS’s Centre for Technology, Robots, Artificial Intelligence, and the Law ().
Shreya Tewari: Can you tell us about your articles where you've used Lumen in your research?*
Daniel Seng: Sure. I actually have two other papers that talked about Lumen. They are [chapters] in books so they're not as easily accessible. One is called “” where I illustrated quantitative legal analysis with reference to the Lumen database This has just been publish.ed in the Handbook of IP Research: Lenses, Methods, and Perspectives, by Oxford University Press, . The other is “” in the edited by Dr. Roland Vogl from Stanford University. It's just been published by Edward Elgar. There, I explained how we could do empirical data research using copyright databases. Obviously, there are very, very few such databases around.
So Lumen features prominently in a lot of my writings in
addition to the first three papers. The trilogy of papers you referred to is
actually part of a set of papers that I submitted for my doctoral thesis at
Stanford University. They are a progression from a description or overview of
the takedown notice mechanism to a survey [of notices and an analysis of the takedown
mechanism]. [The ]
cites the Chilling Effects database, which is the predecessor to Lumen.
The last one is the “” paper. It's a work in progress. I have to admit that this is
probably the most difficult paper of all because it involved [applying a lot of]
machine learning. This one talks about how the intermediaries are actually
processing copyright takedown notices and the mechanisms that make the takedown
system work. So it's related to my research in artificial intelligence systems and
their regulation, because this is a paper that researches AI systems built for
processing takedown notices.
Adam Holland: How did you begin on this line of research? What motivated you to look for the takedown notices in these instances, and what led you to your research methodology?
Daniel Seng:
I felt there was an incomplete picture of the mechanics behind takedown notices. . . Having taught information technology and law for many years, there's only that much that law can tell you about [] how it actually works. For instance, we talk about the mechanisms in the DMCA - the safe harbors. I wanted to find answers to a simple question: which provision was most relied upon by whom, and for what purpose? When consulting the literature, I couldn't find any useful information to answer this very important question which I think goes to the root of the DMCA. [The DMCA] is very often underappreciated because in my view it is actually one of the two pieces of American legislation that make the Internet what it is today. [The other is the Communications Decency Act.] It's also not well appreciated that the DMCA has actually “gone viral” because it's “applied” in a lot of countries around the world, such as, for instance, China. So the question then is how do the mechanisms actually work? And what do fellow academics say about this? I found very little out there. And that got me really bothered. Because here I was writing all these monographs [about intermediary liability] but I felt that I was not getting a complete picture of the nuts and bolts behind the DMCA. Thus, for my Stanford thesis, I had the idea of doing an empirical survey of the DMCA takedown notices on Chilling Effects. At that point in time, the largest data set anyone had yet done at Stanford involved about 3,000 data points for empirical analysis. The doctoral committee was quite skeptical that I could handle something as large as this, because of course, no human being can look through (then) 12,000 notices and still maintain any level of coherence in his or her analysis. So the committee advised me to take a year off and do a proof of concept to establish that this was a viable thesis.It was quite disruptive, but the one year gave me the chance to dive deep into the mechanics of the takedown mechanism, examine the Chilling Effects database, (subsequently Lumen) in far more detail, work out how to collect the notices and correlate notices in an organized way for my own research. And then I spent a lot of time at the Stanford School of Engineering, picking up techniques in natural language processing and machine learning to get this scale of analysis, essentially building towards a set of tools that would allow me to work towards processing the takedown notices.By 2015, I had analyzed half-a-million notices -- from the 12,000 for my pilot study to 500,000. So we called it quits then and I published my thesis. But I really felt dissatisfied because there was so much left to do. So upon my return, I carried on the research. I was just going over the figures. To date, I have analyzed 9.4 million notices, 203 million complaints and 5.66 billion URLs.
It's a quest that I've started because I [am] trying to
bring a greater understanding of the details and mechanics of the takedown
process to the entire community, hoping that this will help us better
understand its strengths and its weaknesses, to figure out the roles and
responsibilities of the players in the ecosystem, what they're actually doing and
what they're doing right and what they're doing wrong. And [I hope that this], together
with the wonderful work that the Harvard team is doing with the Lumen database,
will contribute towards an informed, policy decision-making process, so that if
we can find this very elusive balance between regulating to protect the wonderful,
rich commercial Internet and ensuring that free speech and exchange of ideas
remain unimpeded and free flowing on the Internet.
Adam Holland: Could you talk a little bit more please about your research methodology?
Daniel Seng:
The methodology was challenging I must say. But when I
started, I went to the Stanford Engineering School to pick up on machine
learning and natural language processing and statistical analysis. I knew that
these were the pieces of the puzzle I needed [to parse and analyze the takedown
notices]. But when I asked around as to whether there were any pieces of
software that could allow me to work on takedown notices, I was very quickly
rebuffed. As in, "No. We don't have anything that allows you to do all
these things that you want to do." So it was a learning experience for me because I had to quickly figure out [how
to move forward]. . . So I ended up spending a lot of time developing my own
tools. I call it a platform for parsing, analyzing, tagging, detecting patterns
and extracting information from the notices. I did all the coding myself. It
also made me become aware of a lot of the cutting-edge developments that happen
in natural language processing, which I tried as far as possible to use because
of the nature of my work. I threw whatever I learned into this platform that
I developed to make sense of this huge dataset of notices out there, to be able
to rationalize it, and to be able to use it.
Shreya Tewari:
What would you have done if the notices had not been available? Would there have been an alternate approach for your research? If so, would that approach had been as effective as the current one?
Daniel Seng:
That question really got me thinking. I haven't the chance
to watch the recent Marvel movies where they talk about alternative timelines.
But if indeed I do have access to an alternate timeline and the Lumen database
were not available, what I would do is I would apply to do my doctoral studies
at Harvard and then find Adam and create the database and hopefully after creating
it, [use it to] work towards my doctoral thesis! Of course, that also means
that even now I will not have finished my doctoral thesis because I know how
much work is involved in putting together the database! So, yeah, to me, it's
as if I'm talking about a parallel universe that I cannot contemplate , which
actually tells you how critical the Lumen database and, of course, its
predecessor Chilling Effects, has been for me and my research.
Adam Holland: What other features would you like to see added to the Lumen database? How could we make it better for you and for researchers like you to do your research more effectively?
Daniel Seng:A few things come to mind. First, it’s always useful to be able to see the statements of accuracy and good faith and the electronic signature that is part of any compliant DMCA notice. One of the areas of my interest has been whether those formalities have been observed. I also think that the geolocation detail and IP address information (if available) associated with each notice would be useful.Finally, I wonder if it is possible on a selected basis to
make available to some researchers the unredacted information [in notices]. [I
understand that the redaction, which may be made by the contributing
intermediary or by the Lumen team, is largely to protect the privacy and
interest of the parties.] That can actually come in handy. I can offhand tell
you that there have been several cases where I've been quite puzzled by several
notices that have their contents or URLs redacted. For instance, a redacted URL
could interfere with the analysis of the URL, especially when I want to find
out if there are [certain] characteristics associated with the [sites targeted for
takedowns]. This [access to the redacted information] would be only on a
selective case-by-case basis just to verify that the [redaction] did not affect
any material information. Maybe one way forward is to have two different types
of redaction tags: [one for redactions from the source and the other is for
redactions by the Lumen team]. That would make it easy then for us [as
researchers] to determine if this is a notice that requires further review or
further inspection.
Shreya Tewari: As a final question, what are your thoughts as a scholar in copyright intermediary liability about the importance of transparency through notice sharing? How that reflects on policy making and research?
Daniel Seng:
[It is] most certainly [important]. I always cite the need for transparency as it goes
hand in hand with accountability. This is [represented by] the availability of [complete
information] about a takedown notice. I think in this regard Lumen's
achievements are just incredible because you've actually set the path forward
for much of this debate that is taking place right now amongst the engineering
community regarding what pertains to explainable AI and the need for
transparency and accountability. What better way to promote accountability than
to be transparent about it? [In this case] this is to, within the limits of
considerations like information and privacy and data protection, explain the
mechanics behind what [the content companies and intermediaries] are actually
doing. I'm actually surprised that you yourselves have been able to get so much
cooperation from so many of these Internet intermediaries to make this a
sustainable project. [The Lumen project] would actually be the hallmark of what
every intermediary should do if they want to encourage accountability. The
world will have to thank Lumen for this wonderful job in shedding light upon this
very [special] area of copyright protection. Because it's often not frequently
appreciated that copyright is a formality free system. The Lumen system is its
proxy because if you think about it, trademark systems and patent systems
require registration, but copyright does not. So what's the alternative or substitute
for that? The Lumen database. So thank you very much for illuminating the way
forward for the Internet community!
-----
* “” (2014), “” (2015, published 2021), “” (2015)