The good news about AI Joanna: She never loses her voice, she has outstanding posture and not even a convertible driving 120 mph through a tornado could mess up her hair.

关于人工智能乔安娜的好消息是:她从来不会失声,她的姿势很好,即使是在龙卷风中以120英里每小时的速度行驶的敞篷车也不会弄乱她的头发。

The bad news: She can fool my family and trick my bank.

坏消息是:她可以骗过我的家人和银行。
原创翻译:龙腾网 http://www.ltaaa.cn 转载请注明出处


Maybe you’ve played around with chatbots like OpenAI’s ChatGPT and Google’s Bard, or image generators like Dall-E. If you thought they blurred the line between AI and human intelligence, you ain’t seen—or heard—nothing yet.

也许你已经和 OpenAI 的 ChatGPT 和 Google 的 Bard 等聊天机器人玩过了,或者使用过像 Dall-E 这样的图像生成器。如果你认为它们模糊了人工智能和人类智能之间的界限,那么你还没有看到或听到最新的进展。

Over the past few months, I’ve been testing Synthesia, a tool that creates artificially intelligent avatars from recorded video and audio (aka deepfakes). Type in anything and your video avatar parrots it back.

在过去的几个月里,我一直在测试Synthesia,这是一个从录制的视频和音频(又名深度伪造)中创建人工智能化身的工具。输入任何东西,你的视频化身就会鹦鹉学舌地回答。

Since I do a lot of voice and video work, I thought this could make me more productive, and take away some of the drudgery. That’s the AI promise, after all. So I went to a studio and recorded about 30 minutes of video and nearly two hours of audio that Synthesia would use to train my clone. A few weeks later, AI Joanna was ready.

由于我做了很多语音和视频工作,我认为这可以让我更有效率,减少一些枯燥乏味的工作。毕竟,这就是AI的承诺。所以我去了一个工作室,录制了大约30分钟的视频和近两个小时的音频,供Synthesia用来训练我的克隆。几周后,AI乔安娜准备就绪。

Then I attempted the ultimate day off, Ferris Bueller style. Could AI me—paired with ChatGPT-generated text—replace actual me in videos, meetings and phone calls? It was…eye-opening or, dare I say, AI-opening. (Let’s just blame AI Joanna for my worst jokes.)

然后我尝试了类似 Ferris Bueller 那样的完美假期。能否让 AI 版的我(与 ChatGPT 生成的文本配合)取代真正的我出现在视频、会议和电话中?这是…开了眼界,甚至可以说是开了 AI 的眼界。(让我们把最糟糕的笑话归咎于 AI乔安娜吧。)

Eventually AI Joanna might write columns and host my videos. For now, she’s at her best illustrating the double-edged sword of generative-AI voice and video tools.

最终,AI 乔安娜可能会写专栏,主持我的视频。目前,她最擅长揭示生成式人工智能语音和视频工具这把双刃剑。

My video avatar looks like an avatar.
Video is a lot of work. Hair, makeup, wardrobe, cameras, lighting, microphones. Synthesia promises to eradicate that work, and that’s why corporations already use it. You know those boring compliance training videos? Why pay actors to star in a live-action version when AI can do it all? Synthesia charges $1,000 a year to create and maintain a custom avatar, plus an additional monthly subscxtion fee. It offers stock avatars for a lower monthly cost.

我的视频形象看起来像一个化身(尤指电脑游戏或聊天室中代表使用者的)。
制作视频需要很多工作。包括发型、化妆、服装、摄像机、灯光和麦克风。Synthesia承诺可以消除这些工作,这就是为什么企业已经在使用它。你知道那些无聊的合规性培训视频吗?为什么要支付演员来出演一个真人版,而当人工智能可以做到一切?Synthesia每年收取1,000美元的费用来创建和维护一个定制化身,另外还有额外的每月订阅费用。它以较低的每月花费提供库存头像。
原创翻译:龙腾网 http://www.ltaaa.cn 转载请注明出处


I asked ChatGPT to generate a TikTok scxt about an iOS tip, written in the voice of Joanna Stern. I pasted it into Synthesia, clicked “generate” and suddenly “I” was talking. It was like looking at my reflection in a mirror, albeit one that removes hand gestures and facial expressions. For quick sentences, the avatar can be quite convincing. The longer the text, the more her bot nature comes through. See for yourself in my video.

我请求ChatGPT生成一个以乔安娜·斯特恩的声音写出的关于iOS技巧的TikTok稿子。我将其粘贴到Synthesia中,点击“生成”,然后“我”就在说话了。这就像是在镜子里看着自己的影像,尽管它没有手势和面部表情。对于简短的句子,化身可能会相当令人信服。文字越长,其机器人的本质就越明显。在我的视频中亲眼看看。
原创翻译:龙腾网 http://www.ltaaa.cn 转载请注明出处


On TikTok, where people have the attention span of goldfish, those computer-like attributes are less noticeable. Still, some quickly picked up on it. For the record, I would rather eat live eels than utter the phrase “TikTok fam” but AI me had no problem with it.

在TikTok上,人们的注意力持续时间像金鱼一样长,这些类似电脑的特征就不那么明显了。不过,有些人很快就注意到了。就我个人而言,我宁愿吃活鳗鱼也不愿说“抖音家族”(家人们),但是AI的我并没有问题。

The bot-ness got very obvious on work video calls. I downloaded clips of her saying common meeting remarks (“Hey everyone!” “Sorry, I was muted.”) then used software to pump them into Google Meet. Apparently AI Joanna’s perfect posture and lack of wit were dead giveaways.

这种机器人性质在工作视频通话中表现得非常明显。我下载了她在会议上讲话的片段(“大家好!“对不起,我静音了。”),然后用软件将它们注入Google Meet。显然,AI乔安娜完美的姿势和缺乏智慧是显而易见的。

All this will get better, though. Synthesia has some avatars in beta that can nod up and down, raise their eyebrows and more.

不过这一切都将变得更好。Synthesia正在测试一些beta版本的化身,它们可以向上向下点头,扬起眉毛等等。

My AI voice sounds a lot like me.
When my sister’s fish died, could I have called with condolences? Yes. On a phone interview with Snap CEO Evan Spiegel, could I have asked every question myself? Sure. But in both cases, my AI voice was a convincing stand-in. At first.

我的AI语音听起来很像我自己。
当我姐姐的鱼去世时,我可以打电话表示慰问吗?可以。在与Snap CEO Evan Spiegel的电话采访中,我可以自己问每个问题吗?当然可以。但在这两种情况下,我的AI语音都是一个令人信服的替身。起初是这样的。

I didn’t use Synthesia’s voice clone for those calls. Instead, I used one generated by ElevenLabs, an AI speech-software developer. My producer Kenny Wassus gathered about 90 minutes of my voice from previous videos and we uploaded the files to the tool—no studio visit needed. In under two minutes, it cloned my voice. In ElevenLabs’s web-based tool, type in any text, click Generate, and within seconds “my” voice says it aloud. Creating a voice clone with ElevenLabs starts at $5 a month.

我没有使用Synthesia的语音克隆进行这些通话。相反,我使用了一个由AI语音软件开发商ElevenLabs生成的语音克隆。我的制作人肯尼·沃瑟斯从以前的视频中收集了大约90分钟的我的语音,然后我们上传了这些文件到该工具中-无需参观录音室。不到两分钟,它就克隆了我的语音。在ElevenLabs的基于Web的工具中,输入任何文本,单击“生成”,几秒钟内,“我”的语音就会大声说出来。使用ElevenLabs创建语音克隆的费用起价为每月5美元。

Compared with Synthesia Joanna, the ElevenLabs me sounds more humanlike, with better intonations and flow.

与Synthesia乔安娜相比,ElevenLabs的我听起来更像人类,语调和流畅性更好。

My sister, whom I call several times a week, said the bot sounded just like me, but noticed the bot didn’t pause to take breaths. When I called my dad and asked for his Social Security number, he only knew something was up because it sounded like a recording of me.

我每周给姐姐打几次电话,她说 AI 声音听起来和我一模一样,但注意到 AI 没有停下来呼吸。当我打电话给父亲并询问他的社会保障号码时,他只是感觉听起来像我的录音。

The potential for misuse is real.
The ElevenLabs voice was so good it fooled my Chase credit card’s voice biometric system.

滥用的可能性是真实存在的。
ElevenLabs的声音太棒了,骗过了我大通银行(摩根大通子公司)信用卡的声音生物识别系统。

I cued AI Joanna up with several things I knew Chase would ask, then dialed customer service. At the biometric step, when the automated system asked for my name and address, AI Joanna responded. Hearing my bot’s voice, the system recognized it as me and immediately connected to a representative. When our video intern called and did his best Joanna impression, the automated system asked for further verification.

我用一些我知道大通银行会问的问题为AI乔安娜做好了准备,然后拨打了客户服务电话。在生物识别步骤中,当自动系统要求我的姓名和地址时,AI乔安娜进行了回应。听到我的机器人的声音,该系统将其识别为我,并立即连接到了代表。当我们的视频实习生打电话并尽力模仿乔安娜时,自动系统要求进一步验证。
原创翻译:龙腾网 http://www.ltaaa.cn 转载请注明出处


A Chase spokeswoman said the bank uses voice biometrics, along with other tools, to verify callers are who they say they are. She added that the feature is meant for customers to quickly and securely identify themselves, but to complete transactions and other financial requests, customers must provide additional information.

大通银行的一位女发言人表示,该银行使用语音生物识别技术和其他工具来验证呼叫者是否真实身份。她补充说,该功能旨在帮助客户快速安全地确认自己的身份,但是为了完成交易和其他金融请求,客户必须提供额外的信息。

What’s most worrying: ElevenLabs made a very good clone without much friction. All I had to do was click a button saying I had the “necessary rights or consents” to upload audio files and create the clone, and that I wouldn’t use it for fraudulent purposes.

最令人担忧的是:ElevenLabs几乎没有阻力就成功创建了一个非常好的克隆。我所需要做的就是点击一个按钮,表示我拥有上传音频文件并创建克隆的“必要权利或同意”,并且不会将其用于欺诈目的。

That means anyone on the internet could take hours of my voice—or yours, or Joe Biden’s or Tom Brady’s—to save and use. The Federal Trade Commission is already warning about AI-voice related scams.

这意味着互联网上的任何人都可以花费几个小时来制作和使用我的声音或您的声音、乔·拜登或汤姆·布雷迪的声音。联邦贸易委员会已经警告人们要注意与AI声音相关的骗局。
原创翻译:龙腾网 http://www.ltaaa.cn 转载请注明出处


Synthesia requires that the audio and video include verbal consent, which I did when I filmed and recorded with the company.

Synthesia要求音频和视频包括口头同意,我在与该公司进行拍摄和录制时遵守了这一要求。

ElevenLabs only allows cloning in paid accounts, so any use of a cloned voice that breaks company policies can be traced to an account holder, company co-founder Mati Staniszewski told me. The company is working on an authentication tool so people can upload any audio to check if it was created using ElevenLabs technology.

ElevenLabs只允许付费账户进行克隆,因此违反公司政策的任何克隆声音使用都可以追踪到一个账户持有人。该公司的联合创始人Mati Staniszewski告诉我,该公司正在开发身份验证工具,以便人们上传任何音频并检查其是否使用了ElevenLabs技术创建的。
原创翻译:龙腾网 http://www.ltaaa.cn 转载请注明出处


Both systems allowed me to generate some horrible things in my voice, including death threats.

这两个系统都允许我用自己的声音生成一些可怕的东西,包括死亡威胁。

A Synthesia spokesman said my account was designated for use with a news organization, which means it can say words and phrases that might otherwise be filtered. The company said its moderators flagged and dexed my problematic phrases later on. When my account was changed to the standard type, I was no longer able to generate those same phrases.

Synthesia的一位发言人表示,我的账户被指定用于新闻机构,这意味着它可以说出可能被过滤的单词和短语。该公司表示,其审核员后来标记并删除了我有问题的短语。当我的账户更改为标准类型后,我再也不能生成相同的短语了。

Mr. Staniszewski said ElevenLabs can identify all content made with its software. If content breaches the company’s terms of service, he added, ElevenLabs can ban its originating account and, in case of law breaking, assist authorities.

Staniszewski先生表示,ElevenLabs可以识别其软件创建的所有内容。如果内容违反了公司的服务条款,他补充说,ElevenLabs可以封禁其来源账户,在违法的情况下,还可以协助当局。

This stuff is hard to spot.
When I asked Hany Farid, a digital-forensics expert at the University of California, Berkeley, how we can spot synthetic audio and video, he had two words: good luck.

这种东西很难识别。
当我问加州大学伯克利分校的数字取证专家哈尼.法里德,我们如何识别合成音频和视频时,他说了两个单词:祝你好运。

“Not only can I generate this stuff, I can carpet-bomb the internet with it,” he said, adding that you can’t make everyone an AI detective.

他说:“我不仅可以生成这些东西,还可以在互联网上发布大量此类内容。”他补充道,你不能让每个人都成为人工智能侦探。

Sure, my video clone is clearly not me, but it will only get better. And if my own parents and sister can’t really hear the difference in my voice, can I expect others to?

当然,我的视频克隆显然不是我本人,但它只会越来越好。如果连我的父母和姐姐都听不出我的声音差异,我还能指望别人吗?

I got a bit of hope from hearing about the Adobe-led Content Authenticity Initiative. Over 1,000 media and tech companies, academics and more aim to create an embedded “nutrition label” for media. Photos, videos and audio on the internet might one day come with verifiable information attached. Synthesia is a member of the initiative.

听说Adobe领导的内容真实性计划让我有了一点希望。超过1000家媒体和技术公司、学术界等机构的目标是为媒体创建一个嵌入式“营养标签(强制在食物包装列明食物营养数据)”,互联网上的照片、视频和音频可能会带有可验证的信息。Synthesia是该计划的成员之一。

I feel good about being a human.
Unlike AI Joanna who never smiles, real Joanna had something to smile about after this. ChatGPT generated text lacking my personality and expertise. My video clone was lacking the things that make me me. And while my video producer likes using my AI voice in early edits to play with timing, my real voice has more energy, emotion and cadence.

我很高兴自己是人类。
与从不微笑的AI乔安娜不同,真实的乔安娜在这之后有了可以微笑的事情。ChatGPT生成的文本缺乏我的个性和专业知识。我的视频克隆缺乏让我成为我的那些东西。虽然我的视频制作人喜欢在早期编辑中使用我的人工智能声音来调整节奏,但我的真实声音更有活力、情感和节奏感。

Will AI get better at all of that? Absolutely. But I also plan to use these tools to afford me more time to be a real human. Meanwhile, I’m at least sitting up a lot straighter in meetings now.

人工智能会在这些方面做得更好吗?绝对的。但我也计划使用这些工具,让我有更多的时间成为一个真正的人。与此同时,我现在至少在开会时坐直了很多。