ON AUGUST 8th 2023, it was reported that Google was in discussions with Universal Music, a record label, to license artists' voices to feed a songwriting AI tool.

Rumors swirl about AI labs approaching the BBC, Britain's public broadcaster. Another supposed target is JSTOR, a digital library of academic journals.  

HOLDERS of information are taking advantage of their greater bargaining power. Reddit, a discussion forum, and Stack Overflow, a question-and-answer site popular with coders, have increased the cost of access to their data.

Both sites are particularly valuable because users ''upvote'' preferred answers, helping models know which are most relevant.

Twitter X, a social-media site, has put in place measures to limit the ability of bots to scrape the site and now charges anyone who wishes to access its data. Elon Musk is planning to build his own AI business using the data.

EXPANDING the frontier : As a consequence model-builders are working hard to improve the quality of the inputs they already have. Many AI labs employ armies of data annotators to perform tasks such as labelling images and rating answers.

Some of that work is complex; an advert for one such job seeks applicants with a master's degree or doctorate in life sciences. But much of it is mundane, and is being outsourced to places such as Kenya and Pakistan where labour is cheap.

As firms are also gathering data through users' interactions with their tools. Many of these have a feedback mechanism, where users indicate which outputs are useful. 

Firefly's text-to-image generator allows users to pick from one of four options. Bard, Google's chatbot proposes three answers. Users can give ChatGPT a thumbs-up or thumbs-down to its responses.

That information can be fed back as an input into the underlying model, forming what Douwe Kiela, a co-founder of Contextual AI, a startup, calls the ''data fly-wheel.''

A stronger signal still of the quality of a chatbot's answers is whether users copy the text and paste it elsewhere, he adds. That information helped Google rapidly improve its translation tool.

There is,however, one source of data that remains largely untapped : the information that exists within the walls of the tech firms' corporate customers.

Many businesses possess, often unwittingly, vast amounts of useful data, from call-centre transcripts to customer spending records. 

Such information is especially valuable because it can be used to fine-tune models for specific business purposes, such as helping call-centre workers answer queries or analysts spot ways to boost sales.

Yet making use of that rich resources is not always straightforward. Roy Singh of Bain Consultancy, notes that most firms have historically paid little attention to the types of vast but unstructured datasets that would prove most useful for training AI tools.

Often these are spread across various systems, buried in company servers rather than in the cloud.

Unlocking that information would help companies customise AI tools to serve their needs better. Amazon and Microsoft, two tech giants now offer tools to help companies improve management of their unstructured datasets, as does Google.

Christian Kleinerman of Snowflake, a database firm, says that business is booming as clients look to  ''tear down data silos'' Startups are piling in.

In April Weaviate, an AI focused database business, raised $50 million at a valuation of $200 million. Barely a week later PineCone, a rival, raised $100 million at a $750 million valuation.

Earlier this month, Neon, another database startup, raised an additional $46 million in funding. The scramble for data is only just getting started.

!WELCOME! to The World Students Society - for every subject in the world and the exclusive and eternal ownership of every student in the world. 


Post a Comment

Grace A Comment!