The disadvantages of using a small data set are well documented including bias and inaccurate information. ChatGPT being able to access current results on the internet could make it much more useable and increase the data set. However, if anything this increases the risk for ChatGPT users and everyone else.
Just because the data appears on the internet does not mean the data:
- is accurate. If you wouldn't trust a source in your Google search, you probably wouldn't want it included in AI learning.
- is completely unbiased. Social media in particular is full of everyone's opinion.
- doesn't include any personal data. Data breaches can result in personal data being leaked onto the internet. That personal data could end up processed again and again without the subjects' knowledge and without a legitimate basis. This could result in inadvertent breaches and fines.
- doesn't include someone's intellectual property. From music to artistic works there is a huge amount of copyrighted material on the internet. Just because it's online doesn't mean it's free to use. It might have been shared online because it's available to purchase or as part of a portfolio. Trade marks, design registrations and patents are also all publicly available online. It's unclear how such information will be used or modified by AI chatbots.
Access to up-to-date information via the internet is necessary for ChatGPT to develop. This doesn't de-risk the output if you want to use ChatGPT in your business.
We are hosting a roundtable event on 25 January at our office in central London to discuss the risks. If you would like to attend or you need advice before then, contact me: firstname.lastname@example.org or +44 (0) 20 7611 2343.
In order to be truly useful, the guardrails have to come off, or at least loosen - but doing that makes the tech potentially more dangerous and open to misuse.