Then a clever weaver Joseph Jacquard designed a way to explain the intricate patterns of brocade and damask to his looms, using racks of punch cards to deliver the instructions. Wonderfully complicated new lines of communication were opened in his noisy mills. At the time, Jacquard couldn’t have guessed where his cue cards would eventually lead.
Today we have numerous programming languages to explain to our computers what we need, and they can reply with routines, solutions, and reports. You can ask your phone for travel directions, and it will locate where you are and will talk to you, explaining turn by turn how to get where you want to go. You can even ask your laptop to read your email to you while you have breakfast. We’re finally getting more comfortable with these kinds of conversations, and today, multimodal, multilingual conversations continuously flow back and forth between our computers, and our machines with embedded computers, and us.
4IR
The director of corporate messaging at online customer-relationship-management giant Salesforce has a comprehensive definition of this Fourth Industrial Revolution (4IR) we’re currently navigating. “The 4IR is a fusion of advances in artificial intelligence (AI), robotics, the Internet of Things (IoT), genetic engineering, quantum computing, and more.” Quite a noisy maelstrom, and it would make sense that we pay some attention to how our interfaces with all these technologies are serving and changing us.
To start, a review of the state-of-the-art systems that enable machines to read and write would be useful. They’re called LLMs, large language models, and one of the current leaders in this area of AI is the Generative Pre-trained Transformer 3 (GPT-3). GPT-3 was released by OpenAI in June 2020, and it’s one of the most powerful text-generating AIs yet developed, having been trained on 175 billion parameters. A new LLM from DeepMind was announced in a series of three papers released in December 2021. DeepMind’s system is called Gopher, and its deep learning preparation has produced a 280-billion parameter language model. Despite the staggering factual knowledge of the two systems, both suffer from a number of problems inherent with LLMs, including toxic language, misinformation, and bias.
Both GPT-3 and Gopher work on a simple principle of autocompletion. If you provide them some text, they will guess at what comes next, and they’re incredibly good at this. In fact, they can write essays, opinion pieces, and computer code, and they’re even better at summarizing material, including scholarly papers. Today, news coverage of sporting events can be done quickly and efficiently by computers. Just input the data—hits, runs, errors, and players—and you can have a recap for tomorrow’s early edition in no time.
GPT-3 was originally trained on 45 terabytes of text; that’s hundreds of billions of words. Adam Binks, researcher at Clearer Thinking, offers this context: “It is far more information than a human (reading a book per day) could read in 30 lifetimes.” But there can be a problem with that kind of scope because GPT-3 can take in massive amounts of disinformation, prejudice, and toxic language that it can then reproduce. The GPT-3 team readily admits that “internet-trained models have internet-scale biases.”
Both OpenAI and DeepMind have paid more attention to the problems of toxicity and bias in the recent versions of their language systems. According to Will Douglas Heaven writing in MIT Technology Review, the fully trained GPT-3 model recently had a second round of training “using reinforcement learning to teach the model what it should say and when, based on the preferences of human users.” The end result was InstructGPT, and OpenAI found that the users of the program’s application programming interface now favor InstructGPT over GPT-3 more than 70% of the time.
One of the three papers announcing DeepMind’s Gopher project, Ethical and social risks of harm from Language Models, specifically addresses six risk areas: 1. discrimination, exclusion, and toxicity; 2. information hazards; 3. misinformation harms; 4. malicious uses; 5. human-computer interaction harms; and 6. automation, access, and environmental harms. That covers just about everything from simple bias to the high environmental cost of running the hardware for natural language processing of this type. It’s encouraging to see this next major LLM being developed with attention paid to “the points of origin of different risks and…potential risk mitigation approaches” for the 21 specific risks DeepMind has chosen to address.
SYNTHESIZED STORYTELLING
Like reading and writing, computers also originally had problems with human speech. The first attempts at synthesizing speech were flat, affectless, syntax-clumsy imitations that were more like parody than serious AI. Today, the efforts have seriously improved, and there’s an interesting update on the quality of computer speech in an article about publishers using computer-generated speech programs to narrate their audiobooks.
In his Wired article “Synthetic Voices Want to Take Over Audiobooks,” Tom Simonite profiles two start-ups that offer speech synthesis for audiobooks, Speechki in San Francisco, Calif., and DeepZen in London.
Speechki currently has more than 300 synthetic voices for audiobook publishing across 77 languages and dialects, and the production process is a three-step effort. According to Simonite, “[Speechki] analyzes text with in-house software to mark up how to inflect different words, voices it with technology adapted from cloud providers including Amazon, Microsoft, and Google, and employs proof listeners who check for mistakes.” DeepZen has in-house speech synthesis to clone the voice of professional narrators. Its software is trained to seek clues in the narration for where it should apply its seven different emotional colorings. Simonite notes that Google is developing its own “auto-narration” service for publishers to generate audiobooks read by 20 different synthetic voices.
DeepZen’s CEO Taylan Kamis told Simonite that synthetic narration will help right the global imbalance in audiobooks, most of which are produced in English. Kamis explained, “A large backlist of titles never gets converted into audio or are converted only into English.” Overall, the market is growing from just under 5,000 for the year 2008 to 71,000 audiobooks published in the United States in 2020, according to Statista. The Association of American Publishers notes that the total U.S. book publisher revenue slightly declined between 2015 and 2020 with e-book revenues also shrinking. In the same period, audiobook revenue saw a growth of 157%.
Simonite was reassured by both start-ups that “they are not a threat to professional narrators because their technology will be used to make audiobooks that would not otherwise have been recorded.” Given the growth of the market for audiobooks and the disparity of costs of production—professional narrators charging $250 per finished hour of audio and DeepZen’s $120 per hour—any threat would seem to rest heavily on the quality of the recordings from Speechki and DeepZen and their acceptance.
Taking a long view of how far we have come in this evolving machine-human dialogue from Jacquard’s notes to his looms in 1801 to the present, the progress has been quite impressive.