Cloudflare Workers技术要点-天翼云开发者社区

Cloudflare Workers技术要点

传送门：和Cloudflare相关的值得阅读的文章
以下内容节选自：https://blog.cloudflare.com/cloud-computing-without-containers/
A basic Node Lambda running no real code consumes 35 MB of memory. When you can share the runtime between all of
the Isolates as we do, that drops to around 3 MB.
在所有隔离之间共享运行时时
Isolates are lightweight contexts that group variables with the code allowed to mutate them. Most importantly, a
single process can run hundreds or thousands of Isolates, seamlessly switching between them. They make it
possible to run untrusted code from many different customers within a single operating system process. They’re
designed to start very quickly (several had to start in your web browser just for you to load this web page), and
to not allow one Isolate to access the memory of another.
隔离是轻量级上下文，将变量与允许对其进行突变的代码分组在一起。最重要的是，一个进程可以运行数百或数千个隔离，并在它们之间无缝切换
。它们使在单个操作系统进程中运行来自许多不同客户的不受信任的代码成为可能。它们旨在快速启动（为了加载此网页，必须在您的Web浏览器中
启动几个），并且不允许一个隔离访问另一个内存。
以下内容节选自：https://blog.cloudflare.com/jamstack-podcast-with-kenton-varda/
But the one that has received by far the most scrutiny, and the most real-world battle testing over the years,
would be the V8 JavaScript engine from Google Chrome. We took that and embedded it in a new server environment
written in C++ from scratch.
但是，多年来受到最严格的审查和最真实的战斗测试的是Google Chrome的V8 JavaScript引擎。我们采用了它，并将其从头开始嵌入到用C
++编写的新服务器环境中。
以下内容节选自：https://blog.cloudflare.com/mitigating-spectre-and-other-security-threats-the-cloudflare-workers-security-model/
首先，我们需要创建一个执行环境，使代码无法访问不应执行的任何操作。
为此，我们的主要工具是V8，这是Google开发的可在Chrome中使用的JavaScript引擎。V8在“隔离”内部执行代码，从而防止该代码访问隔离外部
的内存-即使在同一进程中也是如此。重要的是，这意味着我们可以在一个流程中运行多个隔离。这对于像Workers这样的边缘计算平台至关重要，
在该平台上，我们必须在每台计算机上托管成千上万个来宾应用程序，并以最小的开销每秒在数千个来宾之间快速切换。如果我们必须为每个客人
运行一个单独的流程，那么我们可以支持的租户数量将大大减少，并且我们必须将边缘计算限制在少数可以支付很多钱的大型企业客户中。借助隔
离技术，我们可以使所有人都能使用边缘计算。
但是，有时我们确实决定按自己的私有流程安排工作人员。如果它使用某些我们认为需要额外隔离的功能，则可以执行此操作。例如，当开发人员
使用devtools调试器检查其工作程序时，我们在单独的进程中运行该工作程序。这是因为从历史上看，在浏览器中，检查器协议只能由浏览器的受
信任操作员使用，因此没有像其他V8一样受到安全审查。为了对付检查员协议中错误的风险增加，我们将受检查的工人转移到带有流程级沙箱的单
独流程中。我们还使用进程隔离作为对Spectre的额外防御，我将在本文后面描述。
此外，即使对于与其他隔离程序在共享进程中运行的隔离程序，我们也会在每台计算机上运行整个运行时的多个实例，我们将其称为“密码”。通
过为每个工作人员分配信任级别并将低信任度的工作人员与我们更信任的工作人员区分开来，将工作人员分配到警戒线中。作为这种操作的一个示
例：注册我们的免费计划的客户将不会与企业客户以相同的流程进行安排。在V8中发现零日安全漏洞的情况下，这提供了深度防御。但是，在本文
的后面，我将更多地讨论V8错误以及我们如何解决它们。
在全过程级别，我们应用了另一层沙箱来进行深度防御。“第2层”沙箱使用Linux名称空间和seccomp来禁止对文件系统和网络的所有访问。命名空
间和seccomp通常用于实现容器。但是，我们对这些技术的使用比容器引擎中通常使用的技术严格得多，因为我们在过程开始之后（但在未加载任何
隔离之前）配置名称空间和seccomp。这意味着，例如，我们可以（并且确实）使用完全空的文件系统（装入名称空间），并使用seccomp来绝对阻
止所有与文件系统相关的系统调用。容器引擎通常不能禁止所有文件系统访问，因为这样做将使它无法使用exec()从磁盘启动guest程序；在我们
的例子中，我们的访客程序不是本机二进制文件，并且在阻止文件系统访问之前，Workers运行时本身已经完成加载。
第2层沙箱也完全禁止网络访问。相反，该进程仅限于通过本地Unix域套接字进行通信，以便与同一系统上的其他进程进行通信。与外界的任何通信
都必须通过沙箱外部的其他本地过程来实现。
特别是其中一个这样的过程（我们称为“管理者”）负责从磁盘或其他内部服务中获取工作程序代码和配置。主管确保沙盒进程无法读取任何配置
，但与应该运行的工作程序相关的配置除外。
例如，当沙盒进程收到从未出现过的工作人员请求时，该请求包括该工作人员代码的加密密钥（包括附加的机密）。然后，沙盒可以将该密钥传递
给主管，以请求代码。沙箱无法请求尚未为其接收适当密钥的任何工作线程。它不能列举已知的工人。它也不能请求不需要的配置。例如，它无法
向工作人员请求用于HTTPS流量的TLS密钥。
除了读取配置之外，沙箱与系统上其他进程进行对话的另一个原因是实现了对Workers公开的API。这使我们进入了API设计。
在沙箱环境中，API设计承担了新的责任。我们的API准确定义了工作人员可以做什么和不能做什么。我们必须非常小心地设计每个API，以便它只能
表示我们希望允许的操作，而不能表示更多操作。例如，我们希望允许Worker发出和接收HTTP请求，而我们不希望它们能够访问本地文件系统或内
部网络服务。
。。。
这样的API将如何实现？如上所述，沙盒进程无法访问真实的文件系统，我们希望保持这种状态。取而代之的是，文件访问将由主管进程来调解。沙
盒使用Cap'n Proto RPC（基于功能的RPC协议）与主管进行对话。（Cap'n Proto是目前由Cloudflare
Workers团队维护的一个开源项目。）此协议使基于功能的API的实现非常容易，因此我们可以严格限制沙箱仅访问属于Workers的文件它正在运行。
现在如何进行网络访问？如今，仅允许工人通过HTTP与世界其他地方进行对话-传入和传出。没有用于其他形式的网络访问的API，因此被禁止（尽
管我们计划在将来支持其他协议）。
如前所述，沙盒进程无法直接连接到网络。相反，所有出站HTTP请求都通过Unix域套接字发送到本地代理服务。该服务对请求实施限制。例如，它
验证请求是发送到公共Internet服务，还是发送到工作人员区域自己的原始服务器，而不是发送给本地计算机或网络上可能可见的内部服务。它还
会在每个请求中添加标头，以标识源自其的工作程序，以便可以跟踪和阻止滥用请求。一切就绪后，请求将发送到我们的HTTP缓存层，然后再发送
到Internet。
同样，入站HTTP请求也不会直接进入Workers Runtime。它们首先由入站代理服务接收。该服务负责TLS终止（Workers
Runtime从不查看TLS密钥），以及标识要为特定请求URL运行的正确Worker脚本。一切就绪后，请求将通过Unix域套接字传递到沙盒进程。
以下内容节选自：https://www.infoq.com/presentations/cloudflare-v8/
关于 Workers 的资源控制和安全：
用 Linux 的超时系统调用把每个 isolate 的运行时间控制在 50 毫秒。
For CPU time, we actually limit each isolate to 50 milliseconds of CPU execution per request. The way we do that
is the Linux timer create system call lets you set up to receive a signal when a certain amount of CPU time has
gone by. Then from that signal handler, we can call a V8 function, called terminate execution, which will
actually cancel execution wherever it is.
每个V8 的线程中同一时刻只运行一个 isolate，并发的请求通过多个线程处理。
An isolate in JavaScript is a single-threaded thing. JavaScript is inherently a single threaded event driven
language. So an isolate is only running on one thread at a time, other isolates can be on other threads. We don't
technically have to, but in our design, we never run more than one isolate on a thread at a time. We could have
multiple isolates assigned to one thread and handle the events as they come in. But what we don't want is for one
isolate to be able to block another with a long computation and create latency for someone else, so we put them
each on different threats.
通过监控手段控制 isolate 内代码的内存使用，如果超出就kill掉 isolate。这里也提到新的请求可能会启动新的
isolate，不知道是否复用已有的线程？
Instead, we end up having to do more of a monitoring approach. After each time we call into JavaScript when it
returns, we check how much use space it is now using. If it's gone a little bit over its limit, then we'll do a
soft eviction where it can continue handling in-flight requests. But for any new requests, we can just start up
another isolate. If it goes way over then we'll just kill it and cancel all the requests. This works in
conjunction with the CPU time limit because generally, you can't allocate a whole lot of data without spending
some CPU time on that, at least not JavaScript objects. Then type trays are something different, but it's a long
story.
Serverless程序代码提交后，3秒内发布到边缘服务器上，以加快请求处理时的启动时间。
Another problem is we need to get our code, or the user's code, to all the machines that run that code. It sure
would be sad if we had achieved our 5 millisecond startup time only to spend 200 milliseconds waiting for some
storage server to return the code to us before we could even execute it. So what we're doing right now is
actually we distribute the code to all of the machines in our fleet up front. We already had technology for this
to distribute configuration changes to the edge, and we just said code is another kind of configuration, and
threw it in there and it works. It takes about three seconds between when you upload your code and when it's on
every machine in our fleet.
对V8的bug和安全风险控制，他们监控V8代码仓库的更新，自动同步更新代码、自动构建出新的版本发布到生产环境。
We can see when the commit lands in the V8 repository, which happens before the Chrome update, and automate our
build system so that we can get that out into production within hours automatically. We don't even need some want
to click.
他们不允许程序中使用 eval 这样的执行功能，并且会监控 0day 的攻击代码，一旦发现会检查代码并提交给Google。也提到了不支持 timer
功能和并发特性。
There are some things, some risk management things we can do on the server, that we cannot do so easily on the
browser. One of them is we store every single piece of code that executes on our platform, because we do not
allow you to call eval to evaluate code at runtime. You have to upload your code to us and then we distribute it.
What that means is that if anyone tries to upload an attack, we now have a record of that attack. If it's a
zero-day that they have attacked, they have now burned their zero day, when we take a look at that code. We'll
submit it to Google, and then the person who uploaded won't get their $15,000.
他们也会在全部服务器上监控 segfaults 错误，并且报警，他们也会检查程序的 crash 报告。
每个HTTP请求会对应一个V8的线程去运行isolate，这些HTTP请求来自于同一个机器上的Nginx，每个线程同一时刻只运行一个isolate，通过多个线
程实现并发请求。（这里有个疑问，观众问到了是否会有 spare 空闲线程存在，以及最少数量和最大数量，好像并没有明确回答到）
As I said earlier, we start up a thread or we have different isolates running on different threads. We actually
start a thread for each incoming HTTP connection, which are connections incoming from an engine X Server on the
same machine. This is kind of a neat trick because engine X will only send one HTTP request on that connection at
a time. So this is how we know that we only have one isolate executing at a time. But we can potentially have as
many threads as are needed to handle the concurrent requests. The workers will usually be spending most of their
time waiting for some back end, so not actually executing that whole time.
每个边缘节点上要保证足够用的CPU，因为同一时刻一个CPU只能响应一个HTTP请求，如果一个边缘节点上的CPU不够用了，用户会被调度到其他有CP
U资源的边缘节点上去。
We need to make sure that we have plenty of CPU capacity in all of our locations. When we don't, when one
location gets a little overloaded, what we do is we actually shift. Usually the free users we just shift them to
other locations- the free users of our general service, there isn't a free tier of workers yet.
有个人对Cloudflare监控和查看客户代码的做法提出了质疑，他们回答解释是出于诊断错误和响应事故的目的。也提到了他们已经搭建了
appstore，允许客户发布自己的 Serverless 应用程序在 appstore 上面，这样应用市场上的代码也同样会被更多的客户看到。
We look at code only for debugging and incident response purposes. We don't dig through code to see what people
are doing for fun. That's not what we want to do. We have something called the Cloudflare app store, which
actually lets you publish a worker for other people to install on their own sites. It's being able to do it with
workers is in beta right now. So this will be something that will ramp up soon. But then you sell that to other
users, and we'd much rather have people selling their neat features that they built on Cloudflare to each other
in this marketplace, than have us just build it ourselves.
现在他们已经上线了 Serverless KV 存储，这段话提到了一个值得关注的细节，就是边缘上的 KV Store 是针对读多写少做的性能优化。
One of the first ones that's already in beta is called Workers KV. It's fairly simple right now, it's KV Store
but it's optimized for read-heavy workloads, not really for lots of writes from the edge. But there are things
we're working on that I'm very excited about but not ready to talk about yet, that will allow whole databases to
be built on the edge.
当前的 Serverless 计费模型，部署到边缘节点需要每个月5美元，处理每百万次请求0.5美元，前1000万个请求免费。比 AWS Lambda 要便宜。
Varda: Great question. If you go to cloudflareworkers.com, you can actually play around with it just in your web
browser. You write some code and it immediately runs, and it shows you what the result would be. That's free.
Then when you want to actually deploy it on your site, the cost is $5 per month minimum, and then it's 50 cents
per million reque
We can do a lot of monitoring. For example, we can watch for segfaults anywhere on any of our servers. They are
rare, and when they happen, we raise an alert, we look at it. And we see in the crash report, it says what script
was running. So we're going to immediately look at that script, which we have available.