SmartHORIZON 23

The network is omnipresent, the computing power is ubiquitous, 

and the intelligence is everywhere :-)

Smart technology incorporates the Internet of Things (IoT), artificial intelligence (AI), and machine learning (ML) to gather, process and analyze data, and make decisions or perform tasks autonomously. Everything is now considered smart because of the advancements in technology and the proliferation of the internet. The Internet connectivity enables smart devices to communicate with other devices and systems, creating a network of interconnected devices that can work together to provide more sophisticated and automated services. The increasing demand for convenience and efficiency is also driving the trend towards smart technology.  Some examples of smart products and services include:
  • Smart home devices: products such as smart thermostats, smart lighting, and smart appliances that can be controlled remotely through a smartphone app or voice commands.
  • Smart wearables: products such as smartwatches, fitness trackers, and smart glasses that can track various health and fitness metrics, and provide notifications and other information.
  • Smart transportation: services such as ride-sharing apps and self-driving cars, which use technology such as GPS, real-time data, and traffic management systems to optimize transportation.
  • Smart healthcare: products such as wearable devices and telemedicine systems that can monitor patients' health and provide remote medical care.
  • Smart energy management: use of technology such as IoT and analytics to optimize energy consumption and reduce waste.
  • Smart industry: using technology such as IoT and automation to improve efficiency and productivity in manufacturing and other industries.
  • Smart cities: using technology such as IoT, sensors and data analytics to improve urban infrastructure and services.
  • Smart network infrastructure
Financial technology (fintech) innovations refer to the application of technology to the financial services industry. This can include advancements in areas such as payments, lending, banking, and asset management. These innovations have the potential to increase access to financial services, reduce costs, and improve the efficiency and security of financial transactions. As technology continues to evolve, it is likely that fintech will continue to play an important role in shaping the future of finance. 
  • Mobile payments: The widespread adoption of mobile payments, such as Apple Pay and Google Wallet, in the 2010s greatly increased the convenience and accessibility of digital payments.
  • Online lending platforms: The emergence of online lending platforms, such as Lending Club and Prosper, in the 2010s greatly expanded the reach and accessibility of personal and small business loans.
  • Digital currencies and blockchain technology: The development of digital currencies such as Bitcoin and blockchain technology in the 2010s greatly increased the security and speed of digital transactions.
  • Roboadvisers: The emergence of robo-advisers, which use algorithms to provide investment advice, in the 2010s greatly expanded the accessibility of investment management services.
  • Digital banking: The popularization of digital banking in the form of online and mobile banking apps greatly increased the convenience and accessibility of banking services in the 2010s.
  • Open banking: The concept of open banking, which allows customers to share their financial data with third-party providers, became more prevalent in the 2010s.
  • Biometrics: The use of biometrics such as fingerprint and facial recognition for identity verification in fintech became more prevalent in the 2010s.
  • Insurtech: The emergence of insurtech, which uses technology to make insurance more efficient and accessible, became more prevalent in the 2010s.
  • Digital identities: The development of digital identities, which allow individuals to prove their identity online, became more prevalent in the 2010s.
  • Artificial intelligence and machine learning: The use of artificial intelligence and machine learning in fintech, such as for fraud detection and risk management, became more prevalent in the 2010s.
Legal technology (legal tech) innovations refer to the application of technology to the legal industry. This can include advancements in areas such as legal research, contract review, and dispute resolution.  These innovations have the potential to increase the efficiency and accessibility of legal services, reduce costs, and improve the overall experience for clients. Legal tech is a rapidly evolving field, with new breakthroughs and applications being discovered regularly. Some examples of legal tech innovations include:
  • Legal research platforms, which use artificial intelligence to quickly search and analyze large volumes of legal information.
  • Contract review software, which uses natural language processing to automatically review and identify key terms and clauses in legal documents.
  • E-discovery tools, which automate the process of identifying, collecting, and reviewing electronic documents in the context of legal proceedings.
  • Online dispute resolution (ODR), which uses technology to facilitate the resolution of disputes outside of traditional court proceedings.
  • Legal chatbots, which use natural language processing to provide automated legal advice to individuals and businesses.
Biotechnology innovations refer to the application of technology to the study and manipulation of biological systems. This can include advancements in areas such as genetic engineering, medical research, and agricultural technology. These innovations have the potential to improve human health, increase crop yields, and enhance the sustainability of the environment. Biotechnology is a rapidly evolving field, with new breakthroughs and applications being discovered regularly.
  • CRISPR-Cas9: The development of CRISPR-Cas9 gene editing technology in the 2010s greatly expanded the capabilities of genetic engineering and opened up new opportunities for treating genetic diseases.
  • Personalized medicine: The use of biotechnology in the form of personalized medicine, which tailors medical treatment to an individual's genetic makeup, became more prevalent in the 2010s.
  • Synthetic biology: The development of synthetic biology in the 2010s allowed for the design and construction of new biological parts and systems.
  • Bioprinting: The use of bioprinting technology to print living tissue and organs became an active area of research in the 2010s.
  • Biodegradable plastics: The development of biodegradable plastics made from natural materials such as corn starch and sugarcane became more prevalent in the 2010s.
  • Immunotherapy: The use of immunotherapy, which harnesses the body's immune system to fight cancer and other diseases, became an active area of research in the 2010s.
  • Microbial biotechnology: The use of microbial biotechnology in the form of probiotics and fermented foods became more prevalent in the 2010s.
  • Biopesticides: The development of biopesticides, which use natural materials such as bacteria and viruses to control pests, became more prevalent in the 2010s.
  • Biorefineries: The use of biorefineries to convert biomass into a range of products including biofuels, chemicals, and materials became more prevalent in the 2010s.
  • Biodegradable packaging: The development of biodegradable packaging made from natural materials such as plant fibers and starch became more prevalent in the 2010s
  • In the last decade, DNA has been increasingly investigated as an alternative medium for cold data storage, presenting several advantages over standard hard drives such as a higher density, longer lifespan and lower energy consumption. However, such coding methods are limited by biochemical constraints that elevate the probability of errors being added to the coded nucleotides during synthesis, storage, and sequencing. Although such errors can be limited by carefully designing the produced strands, it is unfeasible to avoid them completely.
Medical technology (medtech) innovations refer to the application of technology to the healthcare industry. This can include advancements in areas such as diagnostic tools, medical devices, telemedicine, and digital health. These innovations have the potential to increase the efficiency and accessibility of healthcare services, improve patient outcomes, and reduce costs. Medtech is a rapidly evolving field, with new breakthroughs and applications being discovered regularly. The COVID-19 pandemic accelerated the adoption of telemedicine and digital health tools, as it was necessary to minimize in-person contact and continue providing healthcare in a safe way. Some examples of medtech innovations include:
  • Imaging technology such as MRI, CT and PET scans, which allow for non-invasive examination of internal organs and body structures.
  • Robotic surgery, which uses robots to assist with or perform complex surgical procedures.
  • Telemedicine, which allows for remote consultations and monitoring of patients using technology such as video conferencing and wearable devices.
  • Biomedical devices such as pacemakers, artificial joints and insulin pumps, which improve the quality of life of patients with chronic conditions.
  • Digital health tools, such as electronic health records (EHRs) and personal health apps, which allow for improved data sharing, monitoring and analysis of patients' health.
Meteorology technology (meteotech) has witnessed significant advancements and breakthroughs, particularly in the integration of artificial intelligence (AI) and machine learning to improve weather forecasting accuracy and speed. 
  • Pangu-Weather by Huawei: Pangu-Weather is an AI model developed by Huawei that can predict weekly weather patterns worldwide much faster than traditional forecasting methods, while maintaining comparable accuracy. It employs a deep neural network trained on 39 years of reanalysis data, which combines historical weather observations with modern models. Unlike conventional methods that analyze weather variables one by one, Pangu-Weather can analyze all variables simultaneously within seconds. The AI system has demonstrated the ability to accurately predict extreme weather events, such as the path of a tropical cyclone, even without specific training data on such events. Pangu-Weather's success indicates that AI models can generalize weather patterns and handle previously unseen scenarios effectively.
  • FourcastNet by Nvidia: Nvidia's FourcastNet is another AI-based weather forecasting model that has shown comparable accuracy to conventional methods while significantly reducing the time needed to produce forecasts. As one of the recent contributions to the AI-powered weather forecasting domain, FourcastNet aims to challenge traditional approaches with its efficient and rapid prediction capabilities. This technology, along with others like Pangu-Weather and GraphCast, is encouraging meteorologists to rethink how machine learning can complement existing weather prediction methods to achieve more reliable forecasts[2].
  • GraphCast by Google DeepMind: GraphCast is another AI model developed by Google DeepMind that contributes to the advancement of weather forecasting. Though specific details about GraphCast's performance are not provided in the reference, it is mentioned alongside Pangu-Weather and FourcastNet as one of the AI systems that is pushing the boundaries of weather forecasting using machine learning techniques.
The emergence of AI-powered weather forecasting systems presents several advantages, including rapid forecast generation, the ability to analyze multiple weather variables simultaneously, and the potential for improved prediction accuracy. These innovations come at a crucial time as the impacts of climate change are leading to more unpredictable and extreme weather events, making reliable forecasts essential for disaster preparedness and prevention.However, it's important to note that while AI models have shown great promise, they might not entirely replace traditional forecasting methods. AI models heavily rely on historical weather data for training, which could limit their effectiveness in predicting rare and unprecedented weather events. Meteorologists are likely to adopt a hybrid approach, combining AI-powered forecasting models with conventional methods to achieve the most accurate predictions.Overall, the integration of AI and machine learning in meteorology technology is transforming weather forecasting, and these breakthroughs are poised to have a significant impact on our ability to prepare for and respond to weather-related challenges.
Automotive innovations refer to the development and implementation of new technologies and designs in the automotive industry. This can include advancements in areas such as propulsion systems, vehicle design, safety features, and connectivity.  These innovations have the potential to improve energy efficiency, reduce emissions, increase safety, and enhance the overall driving experience. Additionally, they also contribute to make transportation more sustainable in the long run. 
  • Electric vehicles: The development and popularization of electric vehicles in the 2010s greatly increased the availability and accessibility of emissions-free transportation options.
  • Autonomous vehicles: The development of autonomous vehicle technology in the 2010s greatly increased the capabilities and safety of vehicles, with companies such as Waymo, Tesla, Uber, and GM Cruise working on it.
  • Connected cars: The integration of internet connectivity and advanced communication technology into vehicles in the 2010s greatly increased the capabilities and functionality of cars.
  • Advanced driver assistance systems (ADAS): The widespread adoption of advanced driver assistance systems (ADAS) such as lane departure warning, adaptive cruise control, and automatic emergency braking in the 2010s greatly increased the safety of vehicles.
  • Lightweight materials: The use of lightweight materials such as aluminum, carbon fiber, and composites in the 2010s greatly improved the fuel efficiency and performance of vehicles.
  • Hybrid and plug-in hybrid vehicles: The development and popularization of hybrid and plug-in hybrid vehicles in the 2010s greatly increased the fuel efficiency and reduced emissions of vehicles.
  • Fuel cell vehicles: The development of fuel cell vehicles, which use hydrogen fuel cells to generate electricity, became an active area of research in the 2010s.
  • 3D printing: The use of 3D printing in the automotive industry in the 2010s greatly increased the efficiency and flexibility of manufacturing processes.
Artificial intelligence  innovations refer to the development and application of new techniques and algorithms in the field of AI.  Innovations are also driven by advancements in hardware such as specialized AI chips and increased access to large amounts of data. These innovations have a wide range of applications, including healthcare, finance, transportation, and entertainment. This can include advancements in areas such as machine learning, natural language processing, computer vision, and robotics. 
  • Machine learning: The development and popularization of machine learning techniques such as deep learning and neural networks in the 2010s greatly expanded the capabilities of artificial intelligence. Machine learning innovation refers to the development and application of new techniques and algorithms in the field of machine learning. This can include advancements in areas such as deep learning, reinforcement learning, natural language processing, and computer vision. These innovations have led to significant improvements in the ability of machines to perform tasks such as image and speech recognition, language translation, and decision-making. Additionally, the increased availability of large amounts of data and computational power have also been key drivers of machine learning innovation.
  • Natural Language Processing (NLP): The development of natural language processing (NLP) technologies in the 2010s greatly improved the ability of machines to understand and generate human language.
  • Computer Vision: The development of computer vision technologies in the 2010s greatly improved the ability of machines to understand and interpret visual information.
  • Robotics: The development of advanced robotics and autonomous systems in the 2010s greatly expanded the capabilities and applications of artificial intelligence.
  • Virtual assistants: The emergence of virtual assistants such as Siri, Alexa, and Google Assistant in the 2010s greatly increased the accessibility and convenience of artificial intelligence.
  • Recommendation Systems: The popularization of recommendation systems in the 2010s greatly improved the ability of machines to personalize and optimize experiences for users.
  • Chatbots: The development and deployment of chatbots for customer service and other applications in the 2010s greatly increased the accessibility and convenience of artificial intelligence.
  • Generative Adversarial Networks (GANs): The development of Generative Adversarial Networks (GANs) in the 2010s greatly improved the ability of machines to generate new data, such as images and text.
  • Reinforcement Learning: The development of reinforcement learning in the 2010s greatly improved the ability of machines to learn and make decisions in complex, dynamic environments.
  • Explainable AI (XAI): The development of explainable AI (XAI) in the 2010s greatly improved the ability of machines to provide understandable and transparent explanations for their decisions.
The top 19 technology trends predicted to reach adoption in 2023 are:
  • Remote Healthcare & Wearables (B+): Remote healthcare with advanced wearables will enable patients to obtain remote medical assistance and physicians to perform procedures, consult with remote experts, and have access to vital health information
  • Augmented Reality (B): Seamless integration between the real world and cyberspace will increasingly materialize.
  • Software for the Edge2Cloud Continuum (B): This includes new software for the development and deployment of next-generation computing components, systems, and platforms that enable a transition to a compute continuum with strong capacities at the edge and far edge in an energy-efficient and trustworthy manner
  • Open Hardware (B): From open system (OCP) to ISAs (RISC-V) and interconnects (CXL, UCIe) the open-source movement has expanded into hardware.
  • AI-Assisted DevOps (B): The traditional DevOps approach will be improved to address the increasing complexity of software systems.
  • 3D Printing in Personalized Healthcare (B-): 3D printing in healthcare will evolve towards customized additive manufacturing for individuals.
  • Generative AI (B-): In the next few years generative AI will be used more and more, increasing effectiveness and enabling new services. It is also bound to raise ethical and societal issues. Expect strong impact on business (short term), education (long term), and society (medium to long term). Text-to-Image: Illustration and design demand exponentially increasing due to increase of digital channels and social-media outlets, TV and visual ads globally. Brands and entities (across industry) are struggling to have the perfect illustrations and designs on their digital channels. The time to illustrate and design a storyline within an app, a website or an ad is high, impacting time to market. Business requirements and briefs is challenging when done manually.
  • IT for Sustainability (B-): Technology will evolve from sustainable IT to novel uses of IT for sustainability, clean energy, and a green economy.
  • Autonomous Driving (B/C): Self-driving vehicles in controlled environments are starting to gain adoption at scale, backed by strong business cases.
  • Digital Distributed Manufacturing (B/C): Digital Distributed Manufacturing will reduce energy and environmental footprints and increase the resilience of supply chains.
  • Trusted Computing (B/C): There will be increased public awareness and attention to trusted/assured computation across all industry sectors. Governments will increase focus on legislative actions to ensure that public facing systems can be trusted.
  • Huge Graph Neural Networks (B/C): Applications that use huge models, such as chatGPT, have demonstrated a real impact on a substantial set of problems. Graph Neural Networks can represent complex, “real-world” structures. We predict that huge GNN models will widely be used in machine learning.
  • Adaptive, Generative Pharmaceuticals (C+): Advances in nanotechnology and AI could shorten the time to vaccine development and broaden their efficacy.
  • Autonomous Robots & Brain-Machine I/F (C+): Pervasive uptake of robotic platforms will take place, including as extensions of the human body.
  • Artificial General Intelligence (AGI) (C+): Advances in AI will lead to AGI systems that can understand or learn any intellectual task that a human being can perform.
  • Global Digitalization of Monetary Transactions (C+): Digital transformation of monetary transactions will open new disruptive opportunities in global markets.
  • Space ITC (C): As more companies send technology to space, the barriers to entry are decreasing rapidly.
  • Sustainable Space Manufacturing (C/D): Space manufacturing and recycling technologies and services will improve the sustainability, resilience, and cost of the space ecosystem.
  • Disinformation Detection/Correction (C/D): Improving the reliability of the information in public health, politics, and science will improve public information required for sound decisions from personal to societal levels.
The broad concept of convergence seems to be quite simple: combine the ideas, skills, and/or methods of multiple disciplines to create something new. More specific definitions vary and while the interest in convergence and convergent problems continues to increase, there is no easily operational definition of convergence. 
ICon intelligent connections is a concept that refers to the use of advanced technology to improve the connectivity and communication between devices, systems, and people. It can include the use of artificial intelligence, machine learning, and other technologies to make connections more efficient, accurate, and secure. The goal of ICon is to create a more seamless and intuitive experience for users, and to enable new applications and services that were previously not possible. The concept is based on the combination of the new 5th-generation (5G) networks and AI, in order to accelerate technological development and digital transformation. Technology trend will lead to more efficient, secure, and reliable connections between devices and systems and will enable new applications and services that were previously not possible.
  • Internet of Things (IoT): 5G networks are expected to play a major role in enabling the IoT by providing the necessary connectivity and low-latency required for IoT devices.
  • Industry 4.0: 5G networks will also enable Industry 4.0, by providing fast, low-latency, and reliable connections for industrial automation and robotics, and allowing for real-time monitoring and control of industrial processes.
  • Autonomous vehicles: With their low latency and high-bandwidth capabilities, 5G networks will enable the development of autonomous vehicles and smart transportation systems.
  • Augmented Reality and Virtual Reality (AR/VR): 5G networks will provide the necessary connectivity and low-latency required for immersive AR and VR experiences, making it possible to use them in a variety of applications such as remote collaboration, education, and entertainment.
  • Edge computing: 5G networks will enable the deployment of edge computing, which will bring computation and data storage closer to the devices that need it, reducing latency and providing faster response times.
  • Network slicing: 5G networks will allow for the creation of multiple "slices" of the network, each with different characteristics and capabilities, to support the specific needs of different use cases and applications.
  • Security: 5G networks will provide new security capabilities, such as network slicing, that will allow for more secure and reliable connections between devices and systems.
5G/6G wireless technology is seen as one of the enablers of real-time fusion of the physical and digital realms, as in the Metaverse, extended reality (XR), or Digital Twin (DT). The synchronization between physical and virtual events, as well as between different players located all around the world, will be a significant challenge: maintaining the illusion of locality and a coherent timeline across thousands or millions of endpoints stress 5G/6G capabilities. This would allow people to interact, work, and entertain themselves in immersive online 3D virtual environments.
  • Metaverse virtual world contains digital representations of individuals and objects in the physical world, allowing interaction and engagement in various contexts.
  • Extended Reality (XR) is an umbrella term that refers to computer-generated environments that merge the physical and virtual worlds or create an entirely virtual experience for users. XR encompasses three technologies: Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). XR takes the human-to-PC screen interface and modifies it by either immersing users in a virtual environment (VR) or adding elements to the user's surroundings (AR) or both (MR). 
  • Digital twins bridge the gap between the physical and digital realms by creating rich digital models of physical assets, objects, or environments of anything that is important. 
Metaverse concept refers to a virtual world that is fully immersive and interconnected, where users can interact with each other and with computer-generated environments in real-time. It is often described as a digital universe where people can live, work, and play together, and where the boundaries between the virtual and physical world are blurred. The metaverse is not only limited to gaming or entertainment, it could also be applied to areas such as education, business, and social interaction. The concept is still largely in the development stage and various technologies such as VR, AR, blockchain, 5G, and more is expected to play a role in bringing the metaverse to reality. The metaverse is a concept that is still largely in the development stage, and as such, there is a lot of innovation happening in the field. Many companies, startups and researchers are working to develop new technologies and ideas to make the metaverse a reality. Massive game networking (GN) platforms serving as precursors to the Metaverse. Some of the key areas where innovation is taking place include:
  • Virtual and Augmented Reality: The use of VR and AR technologies is crucial for creating fully immersive and believable virtual worlds.
  • Blockchain: Blockchain technology is being explored as a way to create secure and decentralized virtual worlds, where users have full control over their data and virtual assets.
  • 5G and Edge computing: High-speed internet and edge computing are necessary to support real-time interactions and large numbers of users within the metaverse.
  • Artificial Intelligence: AI is being used to create more realistic and responsive virtual characters and environments.
  • Game engines: Game engines such as Unity and Unreal Engine are being used to create metaverse experiences.
  • Interoperability: Interoperability between different virtual worlds, platforms and devices is important for creating a seamless metaverse experience.

The process of Metaverse standardisation includes the following steps:
  • Identification of Use Cases, either existing or possible (automotive, defence, education, enterprise, eSports, events, finance, food, gaming, healthcare, hospitality, professional training,real estate, remote work, retail, social media, travel, virtual spaces, workflows,.
  • Identification of External Services that a Metaverse can use (content creation /Roblox, MagicaVoxel, OmniVerse/, marketplace /OpenSea/, crypto wallets, cryptocurrency exchanges, development services /LandVault, InfiniteReality/, DigiSomni/, platforms /Metaverse.network, Vircadia/).
  • Identification of Functionalities from use cases and external services (instance, environment, content representation, perception, user, interaction, information search, economy support)
  • Analysis of the state and characteristics of the Technologies required to support the functionalities (sensory information, data processing, user devices, network, energy).
  • Analysis of the issues regarding the management of the Metaverse Ecosystem and legal issues (governance /regulation, property, trademark, authorship, contract, tort, defamation, privacy, taxation, mental health/, stakeholders /manager, operator, user/)

Digital Twin (DT) technology refers to the creation of a virtual representation of a physical object or system, allowing for simulation, analysis, and monitoring of its real-world counterparts. This technology can be applied in various industries such as manufacturing, healthcare, and transportation to improve efficiency, performance, and decision making. Digital twin technology is also used in IoT and Industry 4.0, as it enables real-time monitoring and analysis of data from connected devices and systems. The main features of digital twins typically include:
  • Real-time data integration: Digital twins are built with real-time data from sensors and other sources, providing a complete and accurate representation of the physical asset.
  • Simulation and modeling: Digital twins can be used to simulate the behavior and performance of physical assets, allowing for analysis and testing of different scenarios.
  • Predictive maintenance: Digital twins can analyze data to detect potential issues and predict when maintenance is required, reducing downtime and improving efficiency.
  • Remote monitoring: Digital twins allow for remote monitoring and control of physical assets, reducing the need for on-site visits and increasing accessibility.
  • Collaboration and decision making: Digital twins enable stakeholders to collaborate and make informed decisions based on a shared understanding of the physical asset's state and performance.
  • Contextual information: Digital twins can include a range of contextual information, such as design specifications, historical data, and environmental conditions.

In the telecommunications industry, a digital twin can monitor and manage complex 5G networks. As operators add layers of technology, including connected devices and additional spectrum bands, a digital twin can monitor and augment these complex systems in real-time. Digital Twin Emulator can support design decisions, test what-if system configurations, verify and validate the actual behavior of the complete system off-line, test realistic reactions, and provide statistics on the system's performance.With the support of AI, development of digital transformation through the notion of Digital Twin has been taking off in development and deployment the complex 5G environment.
Generative AI refers to a class of algorithms that generate new data, such as images, text, sounds, or videos, based on a learned pattern from existing data. The generated data can be used for various purposes such as creating new art, generating realistic simulations, or synthesizing new data for machine learning tasks.
Language is a prominent ability in human beings to express and communicate, which develops in early childhood and evolves over a lifetime. Machines, however, cannot naturally grasp the abilities of understanding and communicating in the form of human language, unless equipped with powerful artificial intelligence (AI) algorithms. It has been a longstanding research challengeto achieve this goal, to enable machines to read, write, and communicate like humans. Technically, language modeling (LM) is one of the major approaches to advancing language intelligence of machines. In general, LM aims to model the generative likelihood of word sequences, so as to predict the probabilities of future (or missing) tokens. The research of LM has received extensive attention in the literature, which can be divided into four major development stages:
  • statistical language models (SLM).
  • neural language models (NLM)
  • pre-trained language models (PLM)
  • large language models (LLM)

LLM (large language model) is a neural network designed to understand, generate, and respond to human-like text. These models are deep neural networks trained on massive amounts of text data, sometimes encompassing large portions of the entire publicly available text on the internet.The large in large language model refers to both the model's size in terms of parameters and the immense dataset on which it's trained. Models like this often have tens or even hundreds of billions of parameters, which are the adjustable weights in the network that are optimized during training to predict the next word in a sequence. Next-word prediction is sensible because it harnesses the inherent sequential nature of language to train models on very simple task and so it is surprising to many researchers that it can produce such capable models. We will discuss and implement the next-word training procedure in later chapters step by step.LLMs utilize an architecture called the transformer, which allows them to pay selective attention to different parts of the input when making predictions, making them especially adept at handling the nuances and complexities of human language.  The transformer architecture consists of twosubmodules, an encoder and a decoder. The encoder module processes the input text and encodes it into a series of numerical representations or vectors that capture the contextual information of the input. Then, the decoder module takes these encoded vectors and generates the output text from them.A key component of transformers and LLMs is the self-attention mechanism, which allows the model to weigh the importance of different words or tokens in a sequence relative to each other.Since LLMs are capable of generating text, LLMs are also often referred to as a form of generative artificial intelligence (AI), often abbreviated as generative AI or GenAI.The general process of creating an LLM includes pretraining and finetuning. The term pre in pretraining refers to the initial phase where a model like an LLM is trained on a large, diverse dataset to develop a broad understanding of language. This pretrained model then serves as a foundational resource that can be further refined through finetuning, a process where the model is specifically trained on a narrower dataset that is more specific to particular tasks or domains.This first training stage of an LLM is also known as pretraining, creating an initial pretrained LLM, often called a base or foundation model. A typical example of such a model is the GPT-3 model.  The GPT architecture employs only the decoder portion of the original transformer. It is designed for unidirectional, left-to-right processing, making it well-suited for text generation and next-word prediction tasks to generate text in iterative fashion one word at a time.After obtaining a pretrained LLM from training on large text datasets, where the LLM is trained to predict the next word in the text, we can further train the LLM on labeled data, also known as finetuning. The two most popular categories of finetuning LLMs include instruction-finetuning and finetuning for classification tasks. In instruction-finetuning, the labeled dataset consists of instruction and answer pairs, such as a query to translate a text accompanied by the correctly translated text. In classification finetuning, the labeled dataset consists of texts and associated class labels, for example, emails associated with spam and non-spam labels. 

Large language models (LLM)  are typically characterized by a very large number of parameters—many billions or even trillions—whose values are learned by crunching on enormous set of training data. Through a process called unsupervised learning, large language models automatically learn meaningful representations (known as embeddings) as well as semantic relationships among short segments of text. Then, given a prompt from a person, they use a probabilistic approach to generate new text. In its most elemental sense, what the neural network does is use a sequence of words to choose the next word to follow in the sequence, based on the likelihood of finding that particular word next in its training corpus. The neural network doesn’t always just choose the most likely word, though. It can also select lower-ranked words, which gives it a degree of randomness—and therefore “interestingness”—as opposed to generating the same thing every time. After adding the next word in the sequence, it just needs to rinse and repeat to build longer sequences. In this way, large language models can create very human-looking output, of various forms: stories, poems, tweets, whatever, all of which can appear indistinguishable from the works people produce.
Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering of language intelligence by machine. Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable artificial intelligence (AI) algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pretraining Transformer models over large-scale corpora, showing strong capabilities in solving various natural language processing (NLP) tasks. Since the researchers have found that model scaling can lead to an improved model capacity, they further investigate the scaling effect by increasing the parameter scale to an even larger size. Interestingly, when the parameter scale exceeds a certain level, theseenlarged language models not only achieve a significant performance improvement, but also exhibit some special abilities (e.g., incontext learning) that are not present in small-scale language models (e.g., BERT). To discriminate the language models in different parameter scales, the research community has coined the term large language models (LLM) for the PLMs of significant size (e.g., containing tens or hundreds of billions of parameters). Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT (a powerful AI chatbot developed based on LLMs), which has attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI community, which would revolutionize the way how we develop and use AI algorithms.
  • Theory and principle. To understand the underlying working mechanism of LLMs, one of the greatest mysteries is how information is distributed, organized, and utilized through the very large, deep neural network.
  • Model architecture. Due to the scalability and effectiveness, Transformer, consisting of stacked multi-head self-attention layers, has become the de facto architecture for building LLMs.
  • Model training. In practice, it is very difficult to pretrain capable LLMs, due to the huge computation consumption and the sensitivity to data quality and training tricks.
  • Model utilization. Since fine-tuning is very costly in real applications, prompting has become the prominent approach to using LLMs.
  • Safety and alignment. Despite their capacities, LLMs pose similar safety challenges as small language models.
  • Application and ecosystem. As LLMs have shown a strong capacity in solving various tasks, they can be applied in a broad range of real-world applications (i.e., following taskspecific natural language instructions). As a remarkable progress, ChatGPT has potentially changed the way how humans access information.
Advanced language model GPT-3 (Generative Pre-trained Transformer) uses deep learning techniques, specifically the Transformer architecture, to generate text that is similar to human language. GPT-3 is trained on a massive corpus of diverse texts, allowing it to generate highly coherent and context-aware text. The models are first pre-trained on large scale corpora collected from the Web via self-supervised learning task (auto-regressive language modeling), and then fine-tuned on specic donwstream tasks.  It has been used for a wide range of applications, including natural language processing, chatbots, machine translation, and content generation. GPT-3 has received significant attention for its advanced capabilities and its potential to transform various industries, but it also raises concerns about its potential impact on employment and the ethics of AI.GPT is proposed for language generation tasks via pre-training Transformer decoder model (multihead attention) on large scale text corpus in a classical casual language modeling task where model learns to predict the next word token only dependent on the previous word tokens. Further, GPT-3 with larger model size pre-trained on larger scale text corpus are proposed with remarkable performance in various downstream tasks (e.g., translation, summarization) including classication tasks (e.g., reading comprehension) even without ne-tuning (zero-shot) via appropriate prompts design. OpenAI GPT-3 is a machine learning model that can be used to generate predictive text via an API. OpenAI has different models that we can use, and the most capable one is called “text-davinci-002”.
  • Engine parameter specifies the AI model employed to generate predictions.
  • Max tokens parameter specifies the maximum number of tokens that can be generated by the model. A token can be seen as a piece of word. As a rule of thumb, 1 token is around 4 characters.
  • Temperature parameter close to 1 would mean that the logits are passed through the softmax function without modification. If the temperature is close to zero, the highest probable tokens will become very likely compared to the other tokens, i.e. the model becomes more deterministic and will always output the same set of tokens after a given sequence of words.
  • Top p  parameter specifies a sampling threshold during inference time. Top p sampling (sometimes called nucleus sampling) is a technique used to sample possible outcomes of the model.
  • Frequency penalty parameter controls the model’s tendency to repeat predictions.  It reduces the probability of words that have already been generated. The penalty depends on how many times a word has already occurred in the prediction.
  • Presence penalty parameter encourages the model to make novel predictions. The presence penalty lowers the probability of a word if it already appeared in the predicted text. Unlike the frequency penalty, the presence penalty does not depend on the frequency at which words appear in past predictions.

The ethics of GPT-3 and other advanced language models like it raise important questions about the responsible development and deployment of AI technology. It is important to address these ethical considerations in the development and deployment of GPT-3 and other language models, to ensure that the benefits of AI technology are widely distributed and the potential harm is minimized. Some of the ethical considerations include:
  • Bias and discrimination: GPT-3 and other language models are trained on data that reflects societal biases and prejudices. This can result in generated text that perpetuates harmful stereotypes and reinforces existing inequalities.
  • Misinformation and propaganda: The ability of GPT-3 to generate coherent and context-aware text makes it susceptible to use for spreading misinformation and propaganda.
  • Responsibility and accountability: As GPT-3 and other language models become more widely used, questions arise about who is responsible for their actions and who should be held accountable for any harm they cause.
  • Privacy and security: GPT-3 and other language models process large amounts of sensitive data, raising questions about data privacy and security.
  • Job displacement: The increasing capabilities of GPT-3 and other language models raise concerns about their potential impact on employment and the displacement of workers in language-related jobs.
Prompt engineering is a concept in AI that refers to the process of designing and fine-tuning the input given to a generative AI system or language model to generate specific outputs. This is often done by selecting or modifying the prompt or initial text, or by using techniques such as constraint satisfaction, to guide the generation process towards a desired outcome. The goal of prompt engineering is to improve the quality and reliability of the generated output, and to increase the range of applications for generative AI systems. Prompt engineering is an important area of research and development in AI, as it can help to address some of the challenges associated with AI models, such as bias, incompleteness, and control. GPT-3 uses hard prompts (manually designed discrete language phrases) to generate for different tasks. 
  • ChatGPT capabilities include answering questions, generating text, and translating text. GPT-3 demonstrate remarkable text generation ability. Given words, phrases or simple sentences as prefix, they can continue to generate text that are syntactically correct and semantically smooth conditioning on the given text. ChatGPT was fine-tuned on top of GPT-3.5 using supervised learning as well as reinforcement learning. In addition, OpenAI continues to gather data from users that could be used to further train and fine-tune ChatGPT. Unlike most chatbots, ChatGPT remembers previous prompts given to it in the same conversation. 
  • OpenAI plug-ins are tools designed specifically for language models that help ChatGPT access up-to-date information, run computations, or use third-party services. The plug-ins enables ChatGPT Plus to access live web data versus information the large language model was already trained on. ChatGPT uses the Bing Application Programming Interface (API) to do searches and a text-based web browser to explore and interact with page results. It may combine data from different sources into a coherent response. 
  • DALL-E is deep learning model based on Markov chains trained using variational inference. The goal of latent diffusion models (LDM) is to learn the latent structure of a dataset by modeling the way in which data points diffuse through the latent space. Feature space or embedding space is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another in the latent space. DALL-E's model is a multimodal implementation of GPT-3 trained on text-image pairs from the Internet to generate digital images from natural language descriptions (prompts) which combines concepts, attributes, and styles. The model can generate imagery in multiple styles, including photorealistic imagery, paintings, and emoji. It can manipulate and rearrange objects in its images, and can correctly place design elements in novel compositions without explicit instruction.
  • Stable Diffusion is a deep learning text-to-image model primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt
Stable Diffusion is in a class of models called latent diffusion models. Latent diffusion models encode images and their text captions into vectors (basically a unique numerical representation for each image). During training time, the model adds random values (noise) to the vectors. And then you train a model to go from a slightly more noisy vector to a slightly less noisy vector. In other words, the model tries to reproduce the original numerical representation of every image in its training set, based on that image’s accompanying text caption. So a generated image wants to be similar to the images that most influenced it, by having a similar numerical representation. 
This guide shares strategies and tactics for getting better results from large language models like GPT-4. The methods described here can sometimes be deployed in combination for greater effect:
  1. Write clear instructions.  These models can’t read your mind. If outputs are too long, ask for brief replies. If outputs are too simple, ask for expert-level writing. If you dislike the format, demonstrate the format you’d like to see. The less the model has to guess at what you want, the more likely you’ll get it.
Tactic: Include details in your query to get more relevant answers. In order to get a highly relevant response, make sure that requests provide any important details or context. Otherwise you are leaving it up to the model to guess what you mean.Tactic: Ask the model to adopt a persona.  The system message can be used to specify the persona used by the model in its replies.Tactic: Use delimiters to clearly indicate distinct parts of the input. Delimiters like triple quotation marks, XML tags, section titles, etc. can help demarcate sections of text to be treated differently.Tactic: Specify the steps required to complete a task.  Some tasks are best specified as a sequence of steps. Writing the steps out explicitly can make it easier for the model to follow them.Tactic: Provide examples.  Providing general instructions that apply to all examples is generally more efficient than demonstrating all permutations of a task by example, but in some cases providing examples may be easier. For example, if you intend for the model to copy a particular style of responding to user queries which is difficult to describe explicitly. This is known as "few-shot" prompting.Tactic: Specify the desired length of the output.  You can ask the model to produce outputs that are of a given target length. The targeted output length can be specified in terms of the count of words, sentences, paragraphs, bullet points, etc. Note however that instructing the model to generate a specific number of words does not work with high precision. The model can more reliably generate outputs with a specific number of paragraphs or bullet points.
  1. Provide reference text.  Language models can confidently invent fake answers, especially when asked about esoteric topics or for citations and URLs. In the same way that a sheet of notes can help a student do better on a test, providing reference text to these models can help in answering with fewer fabrications.
Tactic: Instruct the model to answer using a reference text.  If we can provide a model with trusted information that is relevant to the current query, then we can instruct the model to use the provided information to compose its answer.Tactic: Instruct the model to answer with citations from a reference text.  If the input has been supplemented with relevant knowledge, it's straightforward to request that the model add citations to its answers by referencing passages from provided documents. Note that citations in the output can then be verified programmatically by string matching within the provided documents.
  1. Split complex tasks into simpler subtasks.  Just as it is good practice in software engineering to decompose a complex system into a set of modular components, the same is true of tasks submitted to a language model. Complex tasks tend to have higher error rates than simpler tasks. Furthermore, complex tasks can often be re-defined as a workflow of simpler tasks in which the outputs of earlier tasks are used to construct the inputs to later tasks.
Tactic: Use intent classification to identify the most relevant instructions for a user query.  For tasks in which lots of independent sets of instructions are needed to handle different cases, it can be beneficial to first classify the type of query and to use that classification to determine which instructions are needed. This can be achieved by defining fixed categories and hardcoding instructions that are relevant for handling tasks in a given category. This process can also be applied recursively to decompose a task into a sequence of stages. The advantage of this approach is that each query will contain only those instructions that are required to perform the next stage of a task which can result in lower error rates compared to using a single query to perform the whole task. This can also result in lower costs since larger prompts cost more to run.Tactic: For dialogue applications that require very long conversations, summarize or filter previous dialogue.  Since models have a fixed context length, dialogue between a user and an assistant in which the entire conversation is included in the context window cannot continue indefinitely.Tactic: Summarize long documents piecewise and construct a full summary recursively.  Since models have a fixed context length, they cannot be used to summarize a text longer than the context length minus the length of the generated summary in a single query.
  1. Give the model time to "think".  If asked to multiply 17 by 28, you might not know it instantly, but can still work it out with time. Similarly, models make more reasoning errors when trying to answer right away, rather than taking time to work out an answer. Asking for a "chain of thought" before an answer can help the model reason its way toward correct answers more reliably.
Tactic: Instruct the model to work out its own solution before rushing to a conclusion.  Sometimes we get better results when we explicitly instruct the model to reason from first principles before coming to a conclusion. Suppose for example we want a model to evaluate a student’s solution to a math problem. The most obvious way to approach this is to simply ask the model if the student's solution is correct or not.Tactic: Use inner monologue or a sequence of queries to hide the model's reasoning process.  The previous tactic demonstrates that it is sometimes important for the model to reason in detail about a problem before answering a specific question. For some applications, the reasoning process that a model uses to arrive at a final answer would be inappropriate to share with the user. For example, in tutoring applications we may want to encourage students to work out their own answers, but a model’s reasoning process about the student’s solution could reveal the answer to the student.Tactic: Ask the model if it missed anything on previous passes.  Suppose that we are using a model to list excerpts from a source which are relevant to a particular question. After listing each excerpt the model needs to determine if it should start writing another or if it should stop. If the source document is large, it is common for a model to stop too early and fail to list all relevant excerpts. In that case, better performance can often be obtained by prompting the model with followup queries to find any excerpts it missed on previous passes.
  1. Use external tools.  Compensate for the weaknesses of the model by feeding it the outputs of other tools. For example, a text retrieval system (sometimes called RAG or retrieval augmented generation) can tell the model about relevant documents. A code execution engine like OpenAI's Code Interpreter can help the model do math and run code. If a task can be done more reliably or efficiently by a tool rather than by a language model, offload it to get the best of both.
Tactic: Use embeddings-based search to implement efficient knowledge retrieval.  A model can leverage external sources of information if provided as part of its input. This can help the model to generate more informed and up-to-date responses. For example, if a user asks a question about a specific movie, it may be useful to add high quality information about the movie (e.g. actors, director, etc…) to the model’s input. Embeddings can be used to implement efficient knowledge retrieval, so that relevant information can be added to the model input dynamically at run-time.Tactic: Use code execution to perform more accurate calculations or call external APIs.  Language models cannot be relied upon to perform arithmetic or long calculations accurately on their own. In cases where this is needed, a model can be instructed to write and run code instead of making its own calculations. In particular, a model can be instructed to put code that is meant to be run into a designated format such as triple backtick. After an output is produced, the code can be extracted and run. Finally, if necessary, the output from the code execution engine (i.e. Python interpreter) can be provided as an input to the model for the next query.Tactic: Give the model access to specific functions.  The Chat Completions API allows passing a list of function descriptions in requests. This enables models to generate function arguments according to the provided schemas. Generated function arguments are returned by the API in JSON format and can be used to execute function calls. Output provided by function calls can then be fed back into a model in the following request to close the loop. This is the recommended way of using OpenAI models to call external functions.
  1. Test changes systematically.  Improving performance is easier if you can measure it. In some cases a modification to a prompt will achieve better performance on a few isolated examples but lead to worse overall performance on a more representative set of examples. Therefore to be sure that a change is net positive to performance it may be necessary to define a comprehensive test suite (also known an as an "eval").
Tactic: Evaluate model outputs with reference to gold-standard answers.  Suppose it is known that the correct answer to a question should make reference to a specific set of known facts. Then we can use a model query to count how many of the required facts are included in the answer.
Prompt engineering is a concept in AI that refers to the process of designing and fine-tuning the input given to a generative AI system or language model to generate specific outputs. This is often done by selecting or modifying the prompt or initial text, or by using techniques such as constraint satisfaction, to guide the generation process towards a desired outcome. The goal of prompt engineering is to improve the quality and reliability of the generated output, and to increase the range of applications for generative AI systems. Prompt engineering is an important area of research and development in AI, as it can help to address some of the challenges associated with AI models, such as bias, incompleteness, and control. GPT-3 uses hard prompts (manually designed discrete language phrases) to generate for different tasks. 
  • ChatGPT capabilities include answering questions, generating text, and translating text. GPT-3 demonstrate remarkable text generation ability. Given words, phrases or simple sentences as prefix, they can continue to generate text that are syntactically correct and semantically smooth conditioning on the given text. ChatGPT was fine-tuned on top of GPT-3.5 using supervised learning as well as reinforcement learning. In addition, OpenAI continues to gather data from users that could be used to further train and fine-tune ChatGPT. Unlike most chatbots, ChatGPT remembers previous prompts given to it in the same conversation. 
  • OpenAI plug-ins are tools designed specifically for language models that help ChatGPT access up-to-date information, run computations, or use third-party services. The plug-ins enables ChatGPT Plus to access live web data versus information the large language model was already trained on. ChatGPT uses the Bing Application Programming Interface (API) to do searches and a text-based web browser to explore and interact with page results. It may combine data from different sources into a coherent response. 
  • DALL-E is deep learning model based on Markov chains trained using variational inference. The goal of latent diffusion models (LDM) is to learn the latent structure of a dataset by modeling the way in which data points diffuse through the latent space. Feature space or embedding space is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another in the latent space. DALL-E's model is a multimodal implementation of GPT-3 trained on text-image pairs from the Internet to generate digital images from natural language descriptions (prompts) which combines concepts, attributes, and styles. The model can generate imagery in multiple styles, including photorealistic imagery, paintings, and emoji. It can manipulate and rearrange objects in its images, and can correctly place design elements in novel compositions without explicit instruction.
  • Stable Diffusion is a deep learning text-to-image model primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt
Stable Diffusion is in a class of models called latent diffusion models. Latent diffusion models encode images and their text captions into vectors (basically a unique numerical representation for each image). During training time, the model adds random values (noise) to the vectors. And then you train a model to go from a slightly more noisy vector to a slightly less noisy vector. In other words, the model tries to reproduce the original numerical representation of every image in its training set, based on that image’s accompanying text caption. So a generated image wants to be similar to the images that most influenced it, by having a similar numerical representation. 
OpenAI Service provides large-scale language models LLM including the GPT-3, Codex and Embeddings model series. These models can be easily adapted to specific task including but not limited to content generation, summarization, semantic search, and natural language to code translation.
  • First, OpenAI charges for access to its platform and services. Companies pay for access to its proprietary technology and the ability to integrate their own data and algorithms into OpenAI's platform. 
  • Second, OpenAI makes money by licensing its technology to other organizations and companies. Over 300 applications are delivering GPT-3–powered search, conversation, text completion, and other advanced AI features through our API. 
The completions endpoint is the core component of the API service. This API provides access to the model's text-in, text-out interface. Users simply need to provide an input prompt containing the English text command, and the model will generate a text completion.The models used by Azure OpenAI use natural language instructions and examples provided during the generation call to identify the task being asked and skill required. When you use this approach, the first part of the prompt includes natural language instructions and/or examples of the specific task desired. The model then completes the task by predicting the most probable next piece of text. This technique is known as in-context learning. These models aren't retrained during this step but instead give predictions based on the context you include in the prompt.The service provides access to many different models, grouped by family and capability. A model family typically associates models by their intended task.
  • ChatGPT summarise and generate written content quickly and coherently for research, marketing and customer service scenarios. Capabilities include answering questions, generating text, and translating text. ChatGPT was fine-tuned on top of GPT-3.5 using supervised learning as well as reinforcement learning. In addition, OpenAI continues to gather data from users that could be used to further train and fine-tune ChatGPT. Unlike most chatbots, ChatGPT remembers previous prompts given to it in the same conversation.
  • GPT-4 LLM model is a large multimodal model that can accept both image and text inputs and generate text outputs. GPT-4 and GPT-3 differ significantly because GPT-4 includes more data than GPT-3. Compared to GPT-3's 17 gigabytes of data, GPT-4, the most recent iteration of OpenAI, has 45 gigabytes of training data. As a result, GPT-4 can deliver significantly more accurate results than GPT-3. The new GPT-4 model has a long-form mode that offers a context window of 32,000 tokens (52 pages of text). That's more than an order of magnitude larger than the previous GPT-3 API that offered only 2,049 tokens (three pages of text).
  • ClaudeAI context window was about 9,000 tokens, but the company has now increased it to 100,000 tokens (75,000 words). An average human can read 100,000 tokens in about five hours' time. However, this is only the time taken to read the tokens, and more time might be needed if one has to remember and analyze this information. OpenAI's GPT-4 LLM has a context window of 4,096 tokens (~3,000 words) when used with ChatGPT, but this can increase to 32,768 tokens while using GPT-4 API. 
  • Codex LLM model generate code and documentation to help developers build apps faster. GitHub Copilot is a cloud-based artificial intelligence tool developed by GitHub on top of OpenAI Codex, a system that translates natural language to code make recommendations in different programming languages based on the appropriate prompts , to assist users of integrated development environments (IDEs) by autocompleting code.
  • Embeddings LLM model embeddings measure the relatedness of text strings commonly used for: search (where results are ranked by relevance to a query string), clustering (where text strings are grouped by similarity), recommendations (where items with related text strings are recommended), anomaly detection (where outliers with little relatedness are identified), diversity measurement (where similarity distributions are analyzed), classification (where text strings are classified by their most similar label).
  • MLLM model with multimodal perception will be better equipped to acquire commonsense knowledge beyond the information they glean from text alone; and that this perception enrichment will facilitate LLM applications in new domains such as robotics and document intelligence. Multimodal perception also has the benefit of unifying multiple APIs to form a single general graphical user interface. Microsoft Introduces Kosmos-1 that can perceive general modalities, follow instructions, and perform In-iontext learningKOSMOS-1 follows the MetaLM training process, where a transformer-based LLM acts as a general-purpose interface and is augmented with various perception modules. Consistent with the MetaLM philosophy, the team treats language models as a universal task layer, enabling KOSMOS-1 to unify various task predictions as texts and capably handle natural-language instructions and action sequences. Given a previous context, KOSMOS-1 learns to generate texts in an autoregressive manner. All non-text input modalities are embedded and then fed into its backbone transformer-based causal language model, with the transformer decoder serving as a general-purpose interface for all modalities. By interacting with natural language and the other modalities, KOSMOS-1 naturally inherits the capabilities of in-context learning and instruction following; and can thus handle both language and perception-intensive tasks.
  • MS 365 Copilot combines the power of large language models LLMs with your data in the Microsoft Graph and the Microsoft 365 apps to turn your words into the powerful productivity tool.

Model GPT-4

---ARCH---
  • GPT-4 is more than 10x the size of GPT-3 (175 B). We believe it has a total of ~1.8 trillion parameters across 120 layers. Mixture Of Experts (16 experts, each ~111B). Not a dense transformer like e.g. PaLM (or GPT-3). They use MQA instead of MHA (classic at this point).
  • Each forward pass (generation of 1 token) only utilizes ~280B parameters and ~560 TFLOPs. This contrasts with the ~1.8 trillion parameters and ~3,700 TFLOP that would be required per forward pass of a purely dense model.

---DISTRIBUTED---
  • To parallelize across all their A100s GPUs They utilized 8-way tensor parallelism. Beyond that, they are using 15-way pipeline parallelism. Also apparently they used DeepSpeed ZeRo Stage 1 or block-level FSDP.
  • (You can check out my video on all of these strategies here: https://lnkd.in/e8ammDs5 3D parallelism is what you're looking for & ZeRo)

---VISION---
  • They have a separate vision encoder from the text encoder, with cross-attention. The architecture is similar to Google DeepMind's Flamingo (I used to work on this project :) ). This adds more parameters on top of the 1.8T of GPT-4. It is fine-tuned with another ~2 trillion tokens, after the text-only pre-training.

---DATA---
  • Trained on ~13T tokens (multiple epochs, not unique). Plus millions of rows of instruction fine-tuning data from ScaleAI & internally (I guess acquired through ChatGPT + their API before they changed the policy).
  • 8k context length for the pre-training phase. The 32k seqlen version of GPT-4 is based on fine-tuning of the 8k after the pre-training. See e.g. MosaicML's blog on how to achieve this: https://lnkd.in/dagmDAZb)

---COST---
  • OpenAI’s training FLOPS for GPT-4 is ~2.15e25, on ~25,000 A100s for 90 to 100 days at about 32% to 36% MFU. Part of this extremely low utilization is due to an absurd number of failures requiring checkpoints that needed to be restarted from.
  • If their cost in the cloud was about $1 per A100 hour, the training costs for this run alone would be about $63 million.
  • Today, the pre-training could be done with ~8,192 H100 in ~55 days for $21.5 million at $2 per H100 hour

---INFERENCE---
  • OpenAI might be using speculative decoding on GPT-4's inference. See this paper: https://lnkd.in/d-C-QpwW
  • The inference runs on a cluster of 128 GPUs. There are multiple of these clusters in multiple datacenters in different locations (it'll be hard for Elizier to nuke these xD). 8-way tensor parallelism and 16-way pipeline parallelism.
Models in StableDiffusion 
  • Stable Diffusion is a deep learning text-to-image model primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt
  • Models (checkpoint files) are pre-trained Stable Diffusion weights intended for generating general or a particular genre of images. What images a model can generate depends on the data used to train them. A model won't be able to generate a cat's image if there's never a cat in the training data. HuggingFace is the go-to platform for creators who build AI models for Stable Diffusion. There are currently over 270 models related to Stable Diffusion on the platform. You can use Stable Diffusion online through DreamStudio, without any kind of hardware requirements at all.
  • Stable Diffusion is in a class of models called latent diffusion models. Latent diffusion models encode images and their text captions into vectors (basically a unique numerical representation for each image). During training time, the model adds random values (noise) to the vectors. And then you train a model to go from a slightly more noisy vector to a slightly less noisy vector. In other words, the model tries to reproduce the original numerical representation of every image in its training set, based on that image’s accompanying text caption. So a generated image wants to be similar to the images that most influenced it, by having a similar numerical representation.
  • Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. Stable Diffusion makes its source code available https://github.com/CompVis/stable-diffusion
  • Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from Common Crawl data scraped from the web, where 5 billion image-text pairs were classified based on language and filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted "aesthetic" score (e.g. subjective visual quality).
  • To address the limitations of the model's initial training, end-users may opt to implement additional training to fine-tune generation outputs to match more specific use-cases. As visual styles and compositions are not subject to copyright, it is often interpreted that users of Stable Diffusion who generate images of artworks should not be considered to be infringing upon the copyright of visually similar works. However, individuals depicted in generated images may be protected by personality rights if their likeness is used,[48] and intellectual property such as recognizable brand logos still remain protected by copyright. 

New progress on key tech for making AR

Meta has introduced the Segment Anything Model (SAM), which aims to set a new bar for computer-vision-based object segmentation —the ability for computers to understand the difference between individual objects in an image or video. Segmentation will be key for making AR genuinely useful by enabling a comprehensive understanding of the world around the user.Object segmentation is the process of identifying and separating objects in an image or video. With the help of AI, this process can be automated, making it possible to identify and isolate objects in real-time. This technology will be critical for creating a more useful AR experience by giving the system an awareness of various objects in the world around the user.SAM is both a segmentation model and a massive set of training images the company is releasing for others to build upon. The project aims to reduce the need for task-specific modeling expertise. SAM is a general segmentation model that can identify any object in any image or video, even for objects and image types that it didn’t see during training. SAM allows for both automatic and interactive segmentation, allowing it to identify individual objects in a scene with simple inputs from the user. SAM can be ‘prompted’ with clicks, boxes, and other prompts, giving users control over what the system is attempting to identifying at any given moment.

ImageBind: Holistic AI learning across six modalities

When humans absorb information from the world, we innately use multiple senses, such as seeing a busy street and hearing the sounds of car engines. Today, we’re introducing an approach that brings machines one step closer to humans’ ability to learn simultaneously, holistically, and directly from many different forms of information — without the need for explicit supervision (the process of organizing and labeling raw data). Meta have built and are open-sourcing ImageBind, the first AI model capable of binding information from six modalities. The model learns a single embedding, or shared representation space, not just for text, image/video, and audio, but also for sensors that record depth (3D), thermal (infrared radiation), and inertial measurement units (IMU), which calculate motion and position. ImageBind equips machines with a holistic understanding that connects objects in a photo with how they will sound, their 3D shape, how warm or cold they are, and how they move.ImageBind can outperform prior specialist models trained individually for one particular modality. But most important, it helps advance AI by enabling machines to better analyze many different forms of information together. For example, using ImageBind, Meta’s Make-A-Scene could create images from audio, such as creating an image based on the sounds of a rain forest or a bustling market. Other future possibilities include more accurate ways to recognize, connect, and moderate content, and to boost creative design, such as generating richer media more seamlessly and creating wider multimodal search functions.

Glossary


  • Artificial intelligence (AI) is the ability of software to perform tasks that traditionally require human intelligence.
  • Artificial neural networks (ANNs) are composed of interconnected layers of software-based calculators known as “neurons.” These networks can absorb vast amounts of input data and process that data through multiple layers that extract and learn the data’s features.
  • Deep learning is a subset of machine learning that uses deep neural networks, which are layers of connected “neurons” whose connections have parameters or weights that can be trained. It is especially effective at learning from unstructured data such as images, text, and audio.
  • Early and late scenarios are the extreme scenarios of our work-automation model. The “earliest” scenario flexes all parameters to the extremes of plausible assumptions, resulting in faster automation development and adoption, and the “latest” scenario flexes all parameters in the opposite direction. The reality is likely to fall somewhere between the two.
  • Fine-tuning is the process of adapting a pretrained foundation model to perform better in a specific task. This entails a relatively short period of training on a labeled data set, which is much smaller than the data set the model was initially trained on. This additional training allows the model to learn and adapt to the nuances, terminology, and specific patterns found in the smaller data set.
  • Foundation models (FM) are deep learning models trained on vast quantities of unstructured, unlabeled data that can be used for a wide range of tasks out of the box or adapted to specific tasks through fine-tuning. Examples of these models are GPT-4, PaLM, DALL·E 2, and Stable Diffusion.
  • Generative AI is AI that is typically built using foundation models and has capabilities that earlier AI did not have, such as the ability to generate content. Foundation models can also be used for nongenerative purposes (for example, classifying user sentiment as negative or positive based on call transcripts) while offering significant improvement over earlier models. For simplicity, when we refer to generative AI in this article, we include all foundation model use cases.
  • Graphics processing units (GPUs) are computer chips that were originally developed for producing computer graphics (such as for video games) and are also useful for deep learning applications. In contrast, traditional machine learning and other analyses usually run on central processing units (CPUs), normally referred to as a computer’s “processor.”  Unfortunately, many explanations and books about programming GPUs (NVIDIA, AMD) mix up techniques suitable for CPUs, such as spinlocks, mutexes, traditional locks, and thread sleeping/wake-up mechanisms, which are totally at odds with the way GPUs should be programmed. The reason for this, as many tech-bros around here know, lies in how CPU and GPU architectures differ. CPU architecture is built on time-slicing and MIMD instructions (Multiple Instruction, Multiple Data), while GPU groups its super-stellar type of instruction, SIMD (Single Instruction, Multiple Data ), into warps (NVIDIA-like) or wavefronts (AMD-like). From an easy abstract algebra point of view, the reason that GPU and CPU screw up each other lies in the impossible perfect mapping (morphisms) of CPU code into GPU code. Why? Take the easy explanation for category theory: CPU deals with objects (processing units) and arrows (operations) on different data streams in MIMD. Each processing unit (object) can execute different operations (arrows) on independent data streams. So, you have here, instead of a massive amount of threads (GPU), a massive amount of independent categories. From this mathematical perspective, a CPU is nothing more than a lot of independent operations, each with a diverse composition of morphisms, generating a vast number of independent execution paths (and saying 'a lot' is still an understatement).  In contrast, GPU with its SIMD involves only a single operation (arrow in category terms) applied simultaneously across multiple data elements (in category terms: objects). This can be seen as a single category where the same morphism is applied across various objects in parallel (for the math-savvy tech-bros, in reality, we have here an endomorphism operating in the same  category, kinda casting the same object into different types in traditional checked-type programming). Ok, now, after this brief excursion into the realm of category theory, we are ready to understand the core of the problem of why Parallelism used in CPU sucks when transferred to GPU: MIMD architectures support complex, branching operations where each processor can follow a different execution path. Translating this to SIMD would require aligning these diverse paths into a single, unified operation ( kinda a catamorphism from the general morphism of MIMD to the endomorphism within SIMD, not a trivial task, I tell you). The morphisms with MIMD allow processors to operate independently without waiting for others. SIMD requires data and operation synchronization across all processing elements, introducing complexity in aligning independent MIMD operations into a SIMD framework. And thus, my tech-bros, we lose the flexibility and independence in MIMD lost in SIMD.
  • Large language models (LLMs) make up a class of foundation models that can process massive amounts of unstructured text and learn the relationships between words or portions of words, known as tokens. This enables LLMs to generate natural-language text, performing tasks such as summarization or knowledge extraction. GPT-4 (which underlies ChatGPT) and LaMDA (the model behind Bard) are examples of LLMs.
  • Machine learning (ML) is a subset of AI in which a model gains capabilities after it is trained on, or shown, many example data points. Machine learning algorithms detect patterns and learn how to make predictions and recommendations by processing data and experiences, rather than by receiving explicit programming instruction. The algorithms also adapt and can become more effective in response to new data and experiences.
  • Modality is a high-level data category such as numbers, text, images, video, and audio.
  • Prompt engineering refers to the process of designing, refining, and optimizing input prompts to guide a generative AI model toward producing desired (that is, accurate) outputs.
  • Self-attention, sometimes called intra-attention, is a mechanism that aims to mimic cognitive attention, relating different positions of a single sequence to compute a representation of the sequence.
  • Structured data are tabular data (for example, organized in tables, databases, or spreadsheets) that can be used to train some machine learning models effectively.
  • Transformers are a relatively new neural network architecture that relies on self-attention mechanisms to transform a sequence of inputs into a sequence of outputs while focusing its attention on important parts of the context around the inputs. Transformers do not rely on convolutions or recurrent neural networks.
  • Technical automation potential refers to the share of the worktime that could be automated. We assessed the technical potential for automation across the global economy through an analysis of the component activities of each occupation. We used databases published by institutions including the World Bank and the US Bureau of Labor Statistics to break down about 850 occupations into approximately 2,100 activities, and we determined the performance capabilities needed for each activity based on how humans currently perform them.
  • Use cases are targeted applications to a specific business challenge that produces one or more measurable outcomes. For example, in marketing, generative AI could be used to generate creative content such as personalized emails.
  • Unstructured data lack a consistent format or structure (for example, text, images, and audio files) and typically require more advanced techniques to extract insights.
Foundation models (FM)AI is undergoing a paradigm shift with the rise of models trained on broad data (generally using self-supervision at scale) that can be adapted to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character.A foundation model is any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks (current examples include BERT, GPT-3, CLIP). From a technological point of view, foundation models are not new — they are based on deep neural networks and self-supervised learning, both of which have existed for decades. However, the sheer scale and scope of foundation models from the last few years have stretched our imagination of what is possible. At the same time, existing foundation models have the potential to accentuate harms, and their characteristics are in general poorly understood.The significance of foundation models can be summarized by two words: emergence and homogenization. Emergence means that the behavior of a system is implicitly induced rather than explicitly constructed; it is both the source of scientific excitement and anxiety about unanticipated consequences. Homogenization indicates the consolidation of methodologies for building machine learning systems across a wide range of applications; it provides strong leverage towards many tasks but also creates single points of failure. To better appreciate emergence and homogenization, let us reflect on their rise in AI research over the last 30 years.The story of AI has been one of increasing emergence and homogenization. With the introduction of machine learning, how a task is performed emerges (is inferred automatically) from examples; with deep learning, the high-level features used for prediction emerge; and with foundation models, even advanced functionalities such as in-context learning emerge. At the same time, machine learning homogenizes learning algorithms (e.g., logistic regression), deep learning homogenizes model architectures (e.g., Convolutional Neural Networks), and foundation models homogenizes the model itself (e.g., GPT-3).
  • Machine learning. Most AI systems today are powered by machine learning, where predictive models are trained on historical data and used to make future predictions. The rise of machine learning within AI started in the 1990s, representing a marked shift from the way AI systems were built previously: rather than specifying how to solve a task, a learning algorithm would induce it based on data — i.e., the how emerges from the dynamics of learning. Machine learning also represented a step towards homogenization: a wide range of applications could now be powered by a single generic learning algorithm such as logistic regression. Despite the ubiquity of machine learning within AI, semantically complex tasks in natural language processing (NLP) and computer vision such as question answering or object recognition, where the inputs are sentences or images, still required domain experts to perform “feature engineering” — that is, writing domain-specific logic to convert raw data into higher-level features (e.g., SIFT) in computer vision) that were more suitable for popular machine learning methods.
  • Deep learning. Around 2010, a revival of deep neural networks under the moniker of deep learning [LeCun et al. 2015] started gaining traction in the field of machine learning. Deep learning was fueled by larger datasets, more computation (notably, the availability of GPUs), and greater audacity. Deep neural networks would be trained on the raw inputs (e.g., pixels), and higher-level features would emerge through training (a process dubbed “representation learning”). This led to massive performance gains on standard benchmarks, for example, in the seminal work of AlexNet [Krizhevsky et al. 2012] on the ImageNet dataset [Deng et al. 2009]. Deep learning also reflected a further shift towards homogenization: rather than having bespoke feature engineering pipelines for each application, the same deep neural network architecture could be used for many applications.
  • Foundation models. Foundation models have taken shape most strongly in NLP, so we focus our story there for the moment. That said, much as deep learning was popularized in computer vision but exists beyond it, we understand foundation models as a general paradigm of AI, rather than specific to NLP in any way. By the end of 2018, the field of NLP was about to undergo another seismic change, marking the beginning of the era of foundation models. On a technical level, foundation models are enabled by transfer learning [Thrun 1998] and scale. The idea of transfer learning is to take the “knowledge” learned from one task (e.g., object recognition in images) and apply it to another task (e.g., activity recognition in videos). Within deep learning, pretraining is the dominant approach to transfer learning: a model is trained on a surrogate task (often just as a means to an end) and then adapted to the downstream task of interest via fine-tuning. Transfer learning is what makes foundation models possible, but scale is what makes but scale is what makes them powerful. Scale required three ingredients: 
  1. improvements in computer hardware — e.g., GPU throughput and memory have increased 10× over the last four years; 
  2. the development of the Transformer model architecture [Vaswani et al. 2017] that leverages the parallelism of the hardware to train much more expressive models than before; and 
  3. the availability of much more training data. 
The importance of the availability of data and the ability to harness it cannot be underestimated. Transfer learning with annotated datasets has been common practice for at least a decade, for example, pretraining on the ImageNet dataset [Deng et al. 2009] for image classification in the computer vision community. However, the non-trivial cost of annotation imposes a practical limit on the benefits of pretraining. In self-supervised learning on the other hand, the pretraining task is derived automatically from unannotated data. For example, the masked language modeling task used to train BERT is to predict a missing word in a sentence given its surrounding context. Self-supervised tasks are not only more scalable, only depending on unlabeled data, but they are designed to force the model to predict parts of the inputs, making them richer and potentially more useful than models trained on a more limited label space. Foundation models have also led to surprising emergence which results from scale. For example, GPT-3, with 175 billion parameters compared to GPT-2’s 1.5 billion, permits in-context learning, in which the language model can be adapted to a downstream task simply by providing it with a prompt (a natural language description of the task), an emergent property that was neither specifically trained for nor anticipated to arise.
Foundation models are scientifically interesting due to their impressive performance and capabilities, but what makes them critical to study is the fact that they are quickly being integrated into realworld deployments of AI systems with far-reaching consequences on people. Before reasoning about the social impact of foundation models, it is important to understand that they are part of a broader ecosystem that stretches from data creation to deployment. At both ends, we highlight the role of people as the ultimate source of data into training of a foundation model, but also as the downstream recipients of any benefits and harms. Thoughtful data curation and adaptation should be part of the responsible development of any AI system. Finally, note that the deployment of adapted foundation models is a decision separate from their construction, which could be for research.
Foundation models have demonstrated raw potential, but we are still in the early days. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties.Despite their deployment into the real world, these models are very much research prototypes that are poorly understood. Even the professional norms — what Robert Merton calls the ethos of science [Merton 1979] — around foundation models are underdeveloped. For example, there is lack of agreement on basic questions such as when models are “safe” to release or how the community should react in response to methodological misconduct. Given that the future of foundation models is thus filled with uncertainty, a big question is: who will determine this future?
There are tremendous economic incentives to push the capabilities and scale of foundation models, so we anticipate steady technological progress over the coming years. But the suitability of a technology relying largely on emergent behavior for widespread deployment to people is unclear. What is clear that we need to be cautious, and that now is the time to establish the professional norms that will enable the responsible research and deployment of foundation models. Academia and industry need to collaborate on this: industry ultimately makes concrete decisions about how foundation models will be deployed, but we should also lean on academia, with its disciplinary diversity and non-commercial incentives around knowledge production and social benefit, to provide distinctive guidance on the development and deployment of foundation models that is both technically and ethically grounded.

EU AI Act

In April 2021, the European Commission proposed the first EU regulatory framework for AI, which uses a tiered structure based on risks. It says that AI systems that can be used in different applications are analysed and classified according to the risk they pose to users. The different risk levels will mean more or less regulation.
  • AI applications that pose an “unacceptable risk” would be banned; high-risk applications in such fields as finance, the justice system, and medicine would be subject to strict oversight. 14 June 2023, passed its draft of this law—an important step, but only a step, in the process. Parliament and Council of the European Union, have been proposing amendments to the Act since its 2021 inception. Three-way negotiations over the amendments will begin in July, with hopes of reaching an agreement on a final text by the end of 2023. If the legislation follows a typical timeline, the law will take effect two years later. In the meantime, European officials have suggested that companies worldwide could sign on to a voluntary AI code of conduct, drafting in July 2023 set of rules outlining the norms, rules, and responsibilities or proper practices of an individual party or an organization. 
  • In the US, in October 2022, the White House issued a nonbinding Blueprint for an AI Bill of Rights, which framed AI governance as a civil rights issue, stating that citizens should be protected from algorithmic discrimination, privacy intrusion, and other harms. Blueprint suggests a civil rights approach in hopes of creating flexible rules that could keep up with fast-changing technologies. In April 2023 that he was circulating the draft of a “high level framework” for AI regulations. 
  • Chinese regulations have already been put in force, starting with rules for recommendation algorithms that went into effect in March 2022, requiring transparency from the service providers and a way for citizens to opt out. Next, in January 2023, the Chinese government issued early rules governing generative AI, and further draft rules were proposed in April 2023. China’s initial set of rules for generative AI required websites to label AI-generated content, banned the production of fake news, and required companies to register their algorithms and disclose information about training data and performance. The draft rules go even further, requiring that AI companies verify the veracity of all the data used to train their models. 

In June 19, 2023, parliaments negotiating position on the AI Act. The act vote passed with an overwhelming majority, but the final version is likely to look a bit different.It was a big week in tech policy in Europe with the European Parliament’s vote to approve its draft rules for the AI Act. The AI Act vote passed with an overwhelming majority, and has been heralded as one of the world’s most important developments in AI regulation. The European system is a bit complicated. Next, members of the European Parliament will have to thrash out details with the Council of the European Union and the EU’s executive arm, the European Commission, before the draft rules become legislation. The final legislation will be a compromise between three different drafts from the three institutions, which vary a lot. It will likely take around two years before the laws are actually implemented.Structured similarly to the EU’s Digital Services Act, a legal framework for online platforms, the AI Act takes a “risk-based approach” by introducing restrictions based on how dangerous lawmakers predict an AI application could be. Businesses will also have to submit their own risk assessments about their use of AI. Some applications of AI will be banned entirely if lawmakers consider the risk “unacceptable,” while technologies deemed “high risk” will have new limitations on their use and requirements around transparency.  Here are some of the major implications:
  • Ban on emotion-recognition AI. The European Parliament’s draft text bans the use of AI that attempts to recognize people’s emotions in policing, schools, and workplaces. Makers of emotion-recognition software claim that AI is able to determine when a student is not understanding certain material, or when a driver of a car might be falling asleep. The use of AI to conduct facial detection and analysis has been criticized for inaccuracy and bias, but it has not been banned in the draft text from the other two institutions, suggesting there’s a political fight to come.
  • Ban on real-time biometrics and predictive policing in public spaces. This will be a major legislative battle, because the various EU bodies will have to sort out whether, and how, the ban is enforced in law. Policing groups are not in favor of a ban on real-time biometric technologies, which they say are necessary for modern policing. Some countries, like France, are actually planning to increase their use of facial recognition.
  • Ban on social scoring. Social scoring by public agencies, or the practice of using data about people's social behavior to make generalizations and profiles, would be outlawed. That said, the outlook on social scoring, commonly associated with China and other authoritarian governments, isn’t really as simple as it may seem. The practice of using social behavior data to evaluate people is common in doling out mortgages and setting insurance rates, as well as in hiring and advertising. 
  • New restrictions for gen AI. This draft is the first to propose ways to regulate generative AI, and ban the use of any copyrighted material in the training set of large language models like OpenAI’s GPT-4. OpenAI has already come under the scrutiny of European lawmakers for concerns about data privacy and copyright. The draft bill also requires that AI generated content be labeled as such. That said, the European Parliament now has to sell its policy to the European Commission and individual countries, which are likely to face lobbying pressure from the tech industry.
  • New restrictions on recommendation algorithms on social media. The new draft assigns recommender systems to a “high risk” category, which is an escalation from the other proposed bills. This means that if it passes, recommender systems on social media platforms will be subject to much more scrutiny about how they work, and tech companies could be more liable for the impact of user-generated content.

In Dec. 2023, EU Council and Parliament strike a deal on the AI ActFollowing 3-day ‘marathon’ talks, the Council presidency and the European Parliament’s negotiators have reached a provisional agreement on the proposal on harmonised rules on artificial intelligence (AI), the so-called artificial intelligence act. The draft regulation aims to ensure that AI systems placed on the European market and used in the EU are safe and respect fundamental rights and EU values. This landmark proposal also aims to stimulate investment and innovation on AI in Europe. The European Parliament will vote on the AI Act proposals early next year, but any legislation will not take effect until at least 2025.
In January 21,  2024. the final text was circulated to EU Member StatesIt is rapidly progressing towards a vote by COREPER on 2 February, after a discussion in the Telecom Working Party. 
AI definitionWe now have a final definition of an AI system, which is a machine-based system designed to operate with varying levels of autonomy and that may exhibit adaptiveness after deployment and that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as content, predictions, recommendations, or decisions that can influence physical or virtual environments.TimelinesOnce entered into force, the AI Act will apply to prohibited systems from six months, of the commencement date, to GPAI from 12 months, and high-risk AI from 36 months. Codes of practice must be ready at the latest by nine months from when the AI Act entered into force.ConclusionIn conclusion, the AI Act marks a significant shift in the regulatory environment for organisations involved in AI and IP law, particularly within the Irish and EU contexts. The AI Act necessitates a proactive approach from organisations to ensure ethical compliance, especially in light of prohibitions on certain AI practices like manipulative methods and exploiting vulnerabilities. The careful assessment of high-risk AI systems and general-purpose AI models is essential, not only for compliance but also for shaping product development strategies and managing systemic risks.The advent of sophisticated deep fakes and the requirement for transparency in AI-generated content bring new challenges, particularly for the media and entertainment sectors. Human oversight becomes increasingly crucial in ensuring accountability and reliability of high-risk AI systems, requiring a significant investment in training and expertise. Additionally, organisations need to maintain transparent communication about AI deployment in workplaces, adhering to both European Union and national laws to build trust and avert legal disputes.Simplified technical documentation for SMEs and start-ups does not lessen the importance of accurate compliance, where advisory support can be highly beneficial. Robust data governance practices are imperative in maintaining data integrity and trust in AI systems. Organisations must also be vigilant about the substantial fines for non-compliance and stay updated on evolving standards to develop comprehensive compliance strategies.Preparing for the AI Act’s timelines is critical for strategic planning, with early efforts in compliance recommended, especially regarding prohibited systems and general-purpose AI models. Overall, the AI Act presents multifaceted challenges and opportunities, demanding careful navigation and strategic positioning for organisations to thrive in this new, AI-driven regulatory landscape.
AI ActThe AI act is a flagship legislative initiative with the potential to foster the development and uptake of safe and trustworthy AI across the EU’s single market by both private and public actors. The main idea is to regulate AI based on the latter’s capacity to cause harm to society following a ‘risk-based’ approach: the higher the risk, the stricter the rules. As the first legislative proposal of its kind in the world, it can set a global standard for AI regulation in other jurisdictions, just as the GDPR has done, thus promoting the European approach to tech regulation in the world stage.The main elements of the provisional agreementCompared to the initial Commission proposal, the main new elements of the provisional agreement can be summarised as follows:
  • rules on high-impact general-purpose AI models that can cause systemic risk in the future, as well as on high-risk AI systems
  • a revised system of governance with some enforcement powers at EU level
  • extension of the list of prohibitions but with the possibility to use remote biometric identification by law enforcement authorities in public spaces, subject to safeguards
  • better protection of rights through the obligation for deployers of high-risk AI systems to conduct a fundamental rights impact assessment prior to putting an AI system into use.

In more concrete terms, the provisional agreement covers the following aspects:Definitions and scopeTo ensure that the definition of an AI system provides sufficiently clear criteria for distinguishing AI from simpler software systems, the compromise agreement aligns the definition with the approach proposed by the OECD.The provisional agreement also clarifies that the regulation does not apply to areas outside the scope of EU law and should not, in any case, affect member states’ competences in national security or any entity entrusted with tasks in this area. Furthermore, the AI act will not apply to systems which are used exclusively for military or defence purposes. Similarly, the agreement provides that the regulation would not apply to AI systems used for the sole purpose of research and innovation, or for people using AI for non-professional reasons. Classification of AI systems as high-risk and prohibited AI practicesThe compromise agreement provides for a horizontal layer of protection, including a high-risk classification, to ensure that AI systems that are not likely to cause serious fundamental rights violations or other significant risks are not captured. AI systems presenting only limited risk would be subject to very light transparency obligations, for example disclosing that the content was AI-generated so users can make informed decisions on further use.A wide range of high-risk AI systems would be authorised, but subject to a set of requirements and obligations to gain access to the EU market. These requirements have been clarified and adjusted by the co-legislators in such a way that they are more technically feasible and less burdensome for stakeholders to comply with, for example as regards the quality of data, or in relation to the technical documentation that should be drawn up by SMEs to demonstrate that their high-risk AI systems comply with the requirements.Since AI systems are developed and distributed through complex value chains, the compromise agreement includes changes clarifying the allocation of responsibilities and roles of the various actors in those chains, in particular providers and users of AI systems. It also clarifies the relationship between responsibilities under the AI Act and responsibilities that already exist under other legislation, such as the relevant EU data protection or sectorial legislation.For some uses of AI, risk is deemed unacceptable and, therefore, these systems will be banned from the EU. The provisional agreement bans, for example, cognitive behavioural manipulation, the untargeted scrapping of facial images from the internet or CCTV footage, emotion recognition in the workplace and educational institutions, social scoring, biometric categorisation to infer sensitive data, such as sexual orientation or religious beliefs, and some cases of predictive policing for individuals.General purpose AI systems and foundation modelsNew provisions have been added to take into account situations where AI systems can be used for many different purposes (general purpose AI), and where general-purpose AI technology is subsequently integrated into another high-risk system. The provisional agreement also addresses the specific cases of general-purpose AI (GPAI) systems.Specific rules have been also agreed for foundation models, large systems capable to competently perform a wide range of distinctive tasks, such as generating video, text, images, conversing in lateral language, computing, or generating computer code. The provisional agreement provides that foundation models must comply with specific transparency obligations before they are placed in the market. A stricter regime was introduced for ‘high impact’ foundation models. These are foundation models trained with large amount of data and with advanced complexity, capabilities, and performance well above the average, which can disseminate systemic risks along the value chain.Transparency and protection of fundamental rightsThe provisional agreement provides for a fundamental rights impact assessment before a high-risk AI system is put in the market by its deployers. The provisional agreement also provides for increased transparency regarding the use of high-risk AI systems. Notably, some provisions of the Commission proposal have been amended to indicate that certain users of a high-risk AI system that are public entities will also be obliged to register in the EU database for high-risk AI systems.  Moreover, newly added provisions put emphasis on an obligation for users of an emotion recognition system to inform natural persons when they are being exposed to such a system.Measures in support of innovationWith a view to creating a legal framework that is more innovation-friendly and to promoting evidence-based regulatory learning, the provisions concerning measures in support of innovation have been substantially modified compared to the Commission proposal.Notably, it has been clarified that AI regulatory sandboxes, which are supposed to establish a controlled environment for the development, testing and validation of innovative AI systems, should also allow for testing of innovative AI systems in real world conditions. Furthermore, new provisions have been added allowing testing of AI systems in real world conditions, under specific conditions and safeguards. To alleviate the administrative burden for smaller companies, the provisional agreement includes a list of actions to be undertaken to support such operators and provides for some limited and clearly specified derogations. Entry into forceThe provisional agreement provides that the AI act should apply two years after its entry into force, with some exceptions for specific provisions.Next stepsFollowing today’s provisional agreement, work will continue at technical level in the coming weeks to finalise the details of the new regulation. The presidency will submit the compromise text to the member states’ representatives (Coreper) for endorsement once this work has been concluded.The entire text will need to be confirmed by both institutions and undergo legal-linguistic revision before formal adoption by the co-legislators.

EU AI Act: first regulation on artificial intelligence.  AI Act: different rules for different risk levelsDeadlines: prohibitions (6 months)                         GPAI (12 months)                         high risk AI systems (Annex III 24 months, Annex II 36 months)                          all other parts (24 months)The new rules establish obligations for providers and users depending on the level of risk from artificial intelligence. While many AI systems pose minimal risk, they need to be assessed.Unacceptable riskUnacceptable risk AI systems are systems considered a threat to people and will be banned. They include:
  • Cognitive behavioural manipulation of people or specific vulnerable groups: for example voice-activated toys that encourage dangerous behaviour in children
  • Social scoring: classifying people based on behaviour, socio-economic status or personal characteristics
  • Real-time and remote biometric identification systems, such as facial recognition
Some exceptions may be allowed: For instance, “post” remote biometric identification systems where identification occurs after a significant delay will be allowed to prosecute serious crimes but only after court approval.High riskAI systems that negatively affect safety or fundamental rights will be considered high risk and will be divided into two categories:1) AI systems that are used in products falling under the EU’s product safety legislation. This includes toys, aviation, cars, medical devices and lifts.2) AI systems falling into eight specific areas that will have to be registered in an EU database:
  • Biometric identification and categorisation of natural persons
  • Management and operation of critical infrastructure
  • Education and vocational training
  • Employment, worker management and access to self-employment
  • Access to and enjoyment of essential private services and public services and benefits
  • Law enforcement
  • Migration, asylum and border control management
  • Assistance in legal interpretation and application of the law.
All high-risk AI systems will be assessed before being put on the market and also throughout their lifecycle.Generative AIGenerative AI, like ChatGPT, would have to comply with transparency requirements:
  • Disclosing that the content was generated by AI
  • Designing the model to prevent it from generating illegal content
  • Publishing summaries of copyrighted data used for training
Limited riskLimited risk AI systems should comply with minimal transparency requirements that would allow users to make informed decisions. After interacting with the applications, the user can then decide whether they want to continue using it. Users should be made aware when they are interacting with AI. This includes AI systems that generate or manipulate image, audio or video content, for example deepfakes.
AIA je formulisan kao sektorski - poseban u odnosu na Direktivu o e-poslovanju, DSA i DSM. Zakon se primenjuje na dobavljače koji stavljaju  AI na tržište  EU, ali i na korisnike u EU i na dobavljače i korisnike u trećoj zemlji, ako se izlazni rezultati koriste u EU. AIA sadrži znatne elemente procene rizika. Predlog se sastoji od tri kategoriej sistema. Drugim poglavljem su obuhvaćeni sistemi koji su zbog visokog rizika potpuno zabranjeni. Član 5 izrikom zabranjuje četiri takva sistema zbog opasnosti koja preti od njihove upotrebe. Trećim poglavljem su pokriveni visoko-rizični sistemi koji nisu po sebi zabranjeni ali pred koje se stavljaju značajni pravni zahtevi. Konačno, poglavlje 4 i 5 se odnose na sve sisteme i uvode obaveznu transparentnost i mere podrške inovacijama.
13 March 2024.  Following EU Parliament's Plenary Session vote (523/46/49), AI Act will undergo final linguistic approval by lawyer-linguists in April, a step considered a formality, before being published in the Official Journal. It will then come into effect after 21 days, with prohibited systems provisions in force six months later, by the end of 2024. Other provisions will come in over the next 2-3 years.The AI Act imposes significant penalties for non-compliance with the prohibited systems provisions, with fines up to €35 million or 7% of global turnover. What's the message? Make sure you aren't providing or using prohibited AI systems in the EU by the end of the year.
The following will be prohibited AI systems under the AI Act by year end:
  • Manipulative and deceptive practices.  AI systems that use subliminal techniques to materially distort a person’s decision-making capacity, leading to significant harm, are banned. This includes systems that manipulate behaviour or decisions in a way that the individual would not have otherwise made.
  • Exploitation of vulnerabilities.  The Act prohibits AI systems that target individuals or groups based on age, disability, or socio-economic status to distort behaviour in harmful ways.
  • Biometric categorisation.  It outlaws AI systems that categorise individuals based on biometric data to infer sensitive information like race, political opinions, or sexual orientation. This prohibition does not cover any labelling or filtering of lawfully acquired biometric datasets, such as images. There are also exceptions for law enforcement.
  • Social scoring.  AI systems designed to evaluate individuals or groups over time based on their social behaviour or predicted personal characteristics, leading to detrimental treatment, are banned.
  • Real-time biometric identification.  The use of real-time remote biometric identification systems in publicly accessible spaces for law enforcement is heavily restricted, with allowances only under narrowly defined circumstances that require judicial or independent administrative approval.
  • Risk assessment in criminal offences.  The Act forbids AI systems that assess the risk of individuals committing criminal offences based solely on profiling, except when supporting human assessment already based on factual evidence.
  • Facial recognition databases.  AI systems that create or expand facial recognition databases through untargeted scraping of images are prohibited.
  • Emotion inference in workplaces and educational institutions.  The use of AI to infer emotions in sensitive environments like workplaces and schools is banned, barring exceptions for medical or safety reasons.
The Act mandates prior authorisation for the use of 'real-time' remote biometric identification systems by law enforcement.
Maj 2024.  AI Act stupa na snagu (usvojen u EU 13.03.2024.) kao prvi pravni akt u svetu koji na obavezujući način reguliše oblast veštačke inteligencije, a sa ciljem da obezbedi napredak društva, i istovremeno poštovanje prava i osiguravanje bezbednosti pojedinca.
  • preambula
  • sadržina: I Opšte odredbe; II Zabranjene AI prakse; III AI sistemi visokog rizika; IV Obaveze transparentnosti za provajdere AI sistema; VIIIa AI modeli opšte namene; V Mere u cilju podrške inovacijama; VI Upravljanje; VII Baza podataka EU za visokorizične sisteme AI iz aneksa III; VIII Post-tržišni monitoring, deljenje informacija i nadzor tržišta; IX Kodeks ponašanja; X Poverljivost i kazne; XI Delegiranje ovlašćenja i postupanje organa; XII Završne odredbe
  • aneksi

Nakon zauzetog stava o etičkim principima kojima se tehnologija mora voditi, AI Act pokreće fazu regulacije, pretvaranja etičkih principa u načela i stvaranje konkretnih pravila.AI Act temelji digitalni prostor AI ekosistema na tri oslonca:
  • inovativni razvoj
  • transparentnost i odgovornost
  • sistem procene nivoa rizika po društvo i prava pojedinca
Akt klasifikuje aplikacije VI na osnovu procenjenog nivoa rizika, a zatim po funkcionalnosti svrstava različite pojavne oblike četiri kategorije:
  • Neprihvatljiv rizikzabranjene su prakse VI koje se smatraju previše štetnima da bi bile dozvoljene, poput manipulacije ponašanjem ljudi na njihovu štetu (prevarne tehnike za iskrivljavanje ponašanja i ometanje informisanog donošenja odluka) ili poput sistema koji nepravedno kategoriziraju pojedince (social scoring), pa i softvera za prepoznavanje lica u stvarnom vremenu na javnim mestima.
  • Visok rizik – deli se na dve podkategorije: sistemi sigurnosnih komponenti (medicinski uređaji) i sistemi za osetljivu upotrebu (biometrija, kritična infrastruktura, obrazovanje, zapošljavanje, osnovne usluge, sprovođenje zakona, migracije, pravosuđe). Kategorija obuhvata sisteme VI koji bi mogli značajno uticati na sigurnost ljudi ili na fundamentalna prava. Neophodna je stroga usklađenost i procena sigurnosti ovih sistema, nadgledanje, upravljanje podacima, čuvanje zapisa, sajber-bezbednost, sistem prijavljivanja ozbiljnijih incidenata...
  • Ograničeni riziksistemi VI koji direktno komuniciraju s korisnicima, poput chatbota. – Neophodno je da su transparentne u pogledu toga da njima upravlja VI kako bi korisnici bili svesni da ne komuniciraju s ljudima.
  • Minimalni rizikvećina aplikacija VI spada u ovu kategoriju, gde se sloboda inovacija održava uz minimalnu regulatornu intervenciju. To su sistemi koji predstavljaju zanemariv rizik za prava ili sigurnost pojedinaca.  Koriste se uz minimalan nadzor.

Neophodno je da organizacije i kompanije koje razvijaju i upotrebljavaju VI sisteme i softver, u potpunosti operacionalizuju odgovornu VI (Responsible AI) koja podrazumeva preduzimanje planiranih radnji dizajniranja, primene i upotrebe VI za stvaranje vrednosti i izgradnju poverenja štiteći korisnike, sugrađane, društvo, od potencijalnih rizika VI.
Fair Artificial Intelligence.  Neophodno je da kompanije koje razvijaju sisteme zasnovane na VI primenjuju FairAI (MIM5 Minimal Interoperability Mechanisms 5) kao i sve tehničke mere sajberbezbednosti NIS2 direktive o sajber-bezbednosti.
Council of Europe's AI Convention (2023–2024) (Convention on Artificial Intelligence, Human Rights, Democracy and the Rule of Law) aims to protect human rights against the harms of AI. The AI Convention may become the first legally-binding international treaty on AI.

Explainable XAI

  • Fairness and debiasing: Manage and monitor fairness. Scan your deployment for potential biases. 
  • Model drift mitigation: Analyze your model and make recommendations based on the most logical outcome. Alert when models deviate from the intended outcomes.
  • Model risk management: Quantify and mitigate model risk. Get alerted when a model performs inadequately. Understand what happened when deviations persist.
  • Lifecycle automation: Build, run and manage models as part of integrated data and AI services. Unify the tools and processes on a platform to monitor models and share outcomes. Explain the dependencies of machine learning models.
  • Multicloud-ready: Deploy AI projects across hybrid clouds including public clouds, private clouds and on premises. Promote trust and confidence with explainable AI.

Generative AI (GenAI) and Large Language Models (LLMs) are viewed as critical technologies to transition from being AI-native to becoming intrinsic automation-native, integrating AI deeply into every aspect of network operations. Following the technical report of the European Commission on Ethics guidelines for trustworthy AI, AI solutions should pursue trustworthiness. This aligns with the principles that emphasize transparency, fairness, and accountability of AI solutions. In this context, for AI/GenAI automation to be effectively integrated into commercial networks, it is essential to build substantial trust and clarity in the often opaque black box nature of AI. By revealing the impact of different inputs on the outputs, AI developers and researchers can identify and rectify biases, inaccuracies, or unforeseen behaviors in AI models. In this intent, eXplainable AI (XAI) techniques and measurements become crucial. They help clarify the reasoning behind AI's predictions and decisions, which are instrumental in enhancing the comprehension of causality within AI models. The concept of explainability in AI is crucial to enable the trust and reliability required by critical infrastructures and services. They can be applied in different forms within 6G networks, specifically as Ante-hoc, In-hoc, and Post-hoc explanations.

Responsible AI

Nowadays, Articial Intelligence (AI) is democratized in our everyday life. Responsible AI is an AI that takes into acount societal values, moral and ethical consideration. The main pillars are:
  • Accountability referes to the need to explain and justify one's decision and actions to its partners, users and others with whom the system interacts.
  • Responsibility referes to the role of people themselves and to the capability of AI systems answer for one's decision and identify errors or unexpected results.
  • Transparency referes to the need to describe, inspect and reproduce the mechanisms through governance of the data used created.
  • Fairness refers to the equitable treatment of individuals, or groups of individuals, by an AI system. Bias occurs when an AI system has been designed, intentionally or not, in a way that may make the system's output unfair. Bias can be present both in the algorithm of the AI system and in the data used to train and test it. It can emerge as a result of cultural, social, or institutional expectations; because of technical limitations of its design; or when the system is used in unanticipated contexts or to make decisions about communities that are not considered in the initial design.
  • In order to ensure that systems will uphold human values, design methods are needed that incorporate ethical principles and address societal concerns.  Moreover, implementing ethical actions in machines will help us better understand ethics overall. Responsible AI implies the need for mechanisms that enable AI systems themselves to reason about, and act according to, ethics and human values. This requires models and algorithms to represent and reason about, and take decisions based on, human values, and to justify their decisions according to their effect on those values.

Ethically Aligned Design (EAD1)

Prioritizing ethical and responsible artificial intelligence has become a widespread goal for society. Important issues of transparency, accountability, algorithmic bias, and value systems are being directly addressed in the design and implementation of autonomous and intelligent systems (A/IS). While this is an encouraging trend, a key question still facing technologists, manufacturers, and policymakers alike is how to assess, understand, measure, monitor, safeguard, and improve the well-being impacts of A/IS on humans.
EAD will provide pragmatic and directional insights and recommendations, serving as a key reference for the work of technologists, educators and policymakers in the coming years. EAD sets forth scientific analysis and resources, high-level principles, and actionable recommendations. It offers specific guidance for standards, certification, regulation or legislation for design, manufacture, and use of A/IS that provably aligns with and improves holistic societal well-being.
As the use and impact of autonomous and intelligent systems (A/IS) become pervasive, we need toestablish societal and policy guidelines in order for such systems to remain human-centric, servinghumanity’s values and ethical principles. These systems must be developed and should operate ina way that is beneficial to people and the environment, beyond simply reaching functional goals and addressing technical problems. This approach will foster the heightened level of trust between people and technology that is needed for its fruitful use in our daily lives.
  • EAD Pillars of Conceptual Framework fall broadly into three areas, reflecting anthropological, political, and technical aspects.: universal human values, political self-determination and data agency, and technical dependability. 
  • EAD General Principles have emerged through the continuous work of dedicated, open communities in a multi-year, creative, consensus-building process. They articulate high-level principles that should apply to all types of autonomous and intelligent systems (A/IS). Created to guide behavior and inform standards and policy making, the General Principles define imperatives for the ethical design, development, deployment, adoption, and decommissioning of autonomous and intelligent systems. The Principles consider the role of A/IS creators, i.e., those who design and manufacture, of operators, i.e., those with expertise specific to use of A/IS, other users, and any other stakeholders or affected parties.

The ethical and values-based design, development, and implementation of autonomous and intelligent systems should be guided by the following General Principles:
  • Human rights.  A/IS shall be created and operated to respect, promote, and protect internationally recognized human rights.
  • Well-being.  A/IS creators shall adopt increased human well-being as a primary success criterion for development.
  • Data agency.  A/IS creators shall empower individuals with the ability to access and securely share their data, to maintain people’s capacity to have control over their identity.
  • Effectiveness.  A/IS creators and operators shall provide evidence of the effectiveness and fitness for purpose of A/IS.
  • Transparency.  The basis of a particular A/IS decision should always be discoverable.
  • Accountability.  A/IS shall be created and operated to provide an unambiguous rationale for all decisions made.
  • Awareness of misuse.  A/IS creators shall guard against all potential misuses and risks of A/IS in operation.
  • Competence.  A/IS creators shall specify and operators shall adhere to the knowledge and skill required for safe and effective operation. 

Classical ethicsBy drawing from over two thousand five hundred years of classical ethics traditions, the authors explored established ethics systems, addressing both scientific and religious approaches, including secular philosophical traditions, to address human morality in the digital age. Through reviewing the philosophical foundations that define autonomy and ontology, this work addresses the alleged potential for autonomous capacity of intelligent technical systems, morality in amoral systems, and asks whether decisions made by amoral systems can have moral consequences.In doing so, we critique assumptions around concepts such as good and evil, right and wrong, virtue and vice, and we attempt to carry these inquiries into artificial systems’ decision-making processes. Assigning foundations for morality, autonomy, and intelligence.  Classical theories of economy in the Western tradition, starting with Plato and Aristotle, embrace three domains: the individual, the family, and the polis. The formation of the individual character (ethos) is intrinsically related to the others, as well as to the tasks of administration of work within the family (oikos). Eventually, this all expands into the framework of the polis, or public space (poleis). When we discuss ethical issues of A/IS, it becomes crucial to consider these three traditional dimensions, since western classical ethics was developed from this foundation and has evolved in modernity into an individual morality disconnected from economics and politics. Classical ethics for a technical world:
  • maintaining human autonomy
  • implications of cultural migration in A/IS
  • applying goal-directed behavior (Virtue Ethics) to A/IS
  • eequirement for rule-based ethics in practical programming

Arts and AI?

New technologies, and in particular artificial intelligence, are drastically changing the nature of creative processes. Computers are playing very significant roles in creative activities such as music, architecture, fine arts, and science. Indeed, the computer is already a canvas, a brush, a musical instrument, and so on. However, we believe that we must aim at more ambitious relations between computers and creativity. Rather than just seeing the computer as a tool to help human creators, we could see it as a creative entity in its own right. This view has triggered a new subfield of Artificial Intelligence called Computational Creativity. Computational creativity is the study of building software that exhibits behavior that would be deemed creative in humans. Such creative software can be used for autonomous creative tasks, such as inventing mathematical theories, writing poems, painting pictures, and composing music. However, computational creativity studies also enable us to understand human creativity and to produce programs for creative people to use, where the software acts as a creative collaborator rather than a mere tool. Historically, it has been difficult for society to come to terms with machines that purport to be intelligent and even more difficult to admit that they might be creative.  A typical statement of detractors of computational creativity is that “simulating artistic techniques means also simulating human thinking and reasoning, especially creative thinking. This is impossible to do using algorithms or information processing systems.” We could not disagree more!? creativity is not some mystical gift that is beyond scientific study but rather something that can be investigated, simulated, and harnessed for the good of society. Artificial intelligence not only encourages the integration of science and art, but it also allows today’s artists to think more deeply about how to build future art. However, AI computer would be just apparently creative but not really creative for two main reasons, which are: the lack of intentionality and our reluctance to give a place in our society to AI agents. 
Chat Generative Pre-trained Transformer (ChatGPT; OpenAI, San Francisco, CA, USA) is an AI chatbot released in November, 2022. Developed using human feedback and freely accessible, the platform has already attracted millions of interactions.Over the past few years, language models have benefited greatly from the rapid development of Artificial Intelligence (AI) and Natural Language Processing (NLP), making them more accurate, flexible, and useful than ever before [1]. The term “Generative AI” is used to describe a subset of AI models that can generate new information by discovering relevant trends and patterns in already collected information. These models may produce work in a wide range of media, from written to visual to audio [2]. To analyse, comprehend, and produce material that accurately imitates human-generated outcomes, Generative AI models depend on deep learning approaches and neural networks. OpenAI's ChatGPT is one such AI model that has quickly become a popular and versatile resource for a number of different industries. Its humanoid text generation is made possible by its foundation in the Generative Pre-trained Transformer (GPT) architecture [3]. It has the ability to comprehend and produce a broad variety of words since it has been training on an extensive amount of text data. Linguistic transformation, summarised text, and conversation production are just some of the applications that can benefit from its capacity to create natural-sounding content. ChatGPT can be trained to do a variety of activities, including language recognition, question answering, and paragraph completion. It's also useful for building chatbots and other conversational interfaces. In a nutshell, ChatGPT is a robust NLP model that can comprehend and create natural language for a wide range of applications, including text production, language understanding, and interactive programmes [4].It is essential to ChatGPT's function in promoting scientific research to comprehend its genesis and evolution. To clarify, the ChatGPT is not a Generative Adversarial Network (GAN) model but rather a linguistic model built on the GPT architecture, which is relevant here. GPT models are tailored to NLP activities including text production and language comprehension, as opposed to GANs, which are more commonly employed for activities including picture generation [5]. The origins of ChatGPT lie in NLP, a subfield of AI that aims to teach computers to comprehend and produce human speech. The motivation behind developing ChatGPT was to establish a powerful and flexible AI language model that could help with a wide range of activities, such as text production, translation, and data analysis.In order to address some of the limitations of prior sequence-tosequence models for NLP, ChatGPT was built on the foundation of the Transformer architecture. This innovative design made it possible to make powerful language models like OpenAI's GPT series, which included GPT-2 and GPT-3, which were the versions that came before ChatGPT [6]. The GPT-3.5 architecture is the basis for ChatGPT; it is an improved version of OpenAI's GPT-3 model. Even though GPT-3.5 has fewer variables, nevertheless produces excellent results in many areas of NLP, such as language understanding, text generation, and machine translation [6]. ChatGPT was trained on a massive body of text data and fine-tuned on the goal of creating conversational replies, allowing it to create responses to user inquiries that are strangely similar to those of a person.The revolutionary models GPT-2, GPT-3, and ultimately ChatGPT were developed by OpenAI, which has been at the cutting edge of AI innovation [51–53].OpenAI maintained its study and development efforts after the success of GPT-3, eventually resulting in ChatGPT, which is based on the GPT-4 architecture [7]. ChatGPT is optimised for conversational tasks; it outperforms GPT-3 in terms of contextual comprehension, answer creation,and consistency [51]. OpenAI extended its research and development activities after the release of GPT-3, eventually resulting in ChatGPT, which is based on the GPT-4 model [52]. ChatGPT is optimised for conversational activities; it outperforms GPT-3 in terms of contextual comprehension, answer creation, and coherence [2]. GPT models are developed to create coherent and human-like natural language text, including phrases, paragraphs, and complete papers [53]. In order to perform well on subsequent assignments like text categorization and question answering, GPT models must first be pre-trained on massive volumes of text data. In unsupervised pre-training, the model is trained on a huge body of text data, including that found in textbooks or online, without the use of tags or comments. The GPT model is taught to predict the next word in a text sequence by examining the words that came before it in the training data. In the field of NLP, this is known as a language modelling job [3]. The model trains to recognise and generalise linguistic trends, including syntax, vocabulary, and logic, by training on a vast body of text data. If the GPT model is given a smaller labelled dataset after pre-training, it may be modified on a single downstream job by updating its weights and biases to better match that task [4]. If categorization of text is required to be performed downstream, for instance, the model may be educated to determine which label best fits a particular piece of input text. Fig. 1 shows the timeline of GPTs from the first version (GPT-1) to the latest version (ChatGPT).GPT-3.  It is more significant and efficient than its predecessor, GPT-2 with 175 billion parameters [10]. Using a language modelling job, GPT-3 was trained on a large collection of text data that comprised books, articles, and online content. The model was taught to anticipate the next word in a string of text based on the words that came before it, and it has since been used to produce convincing, natural-sounding prose. GPT-3's versatility lies in its capacity to handle several NLP tasks while requiring job-specific training data. These tasks include text categorization, sentiment analysis, and question answering. It happens because the model may pick up a variety of language elements and patterns from its pre-training data, allowing it to generalise to a broad variety of activities and contexts. In addition, GPT-3 has a number of cutting-edge features, including multi-task learning and few-shot learning, which permit the system to pick up and master new jobs with minimal training data [11]. Due to its many useful features, GPT-3 is a language model for NLP that can be applied to many different domains. Chatbots, language translation, content production, as well as code generation are just some of the numerous practical applications that have made use of GPT-3. Employing a language modelling job, ChatGPT has already been trained on a huge collection of text data [1]. This data includes books, papers, and online content. ChatGPT is able to effectively generate logical and authentic replies in a discussion since it is pre-trained to grasp the patterns and links among phrases and words in natural language [2]. Fig. 2 shows a general architecture of ChatGPT, when a client interacts with ChatGPT to ask a query, the platform is able to deduce the user's purpose using Natural Language Understanding (NLU). The appropriate data is fetched from the underpinning Knowledge Base. Natural Language Generation (NLG) then sums up a response to the client based on this generated response. The discussion history is saved so that future interactions may be responded to and tailored to the individual. In order  to enhance the future quality of answers, Reinforcement Learning is used to collect customer feedback and take appropriate actions.GPT-4.  It was released by OpenAI recently, which includes substantial advances in deep learning scalability [7]. This new model is a comprehensive multidimensional linguistic framework that takes in images and text and produces written outcomes. GPT-4 has exhibited human-level performance on a variety of professional and academic standards, even if it might not be as proficient as humans in everyday situations [13]. For example, compared to GPT-3.5, it has attained a score in the highest 10% of participants in tests on a virtual legal examination [12]. There remains an opportunity for enhancements to GPT-4's factuality, steerability, and remaining within the provided restrictions, but after six months of incremental alignment using lessons from OpenAI's adversarial evaluation programme and ChatGPT, the model achieved its best-ever efficiency [57]. Text generation, question answering, language translation, and emotion evaluation are merely some of the many NLP tasks in which GPT models have proven themselves to be cutting-edge performers. Chatbots, aiding customers, and material production are just some of the practical applications identified in the literature [1,2,5].

AI art generators:

🎨 𝐁𝐢𝐧𝐠 𝐈𝐦𝐚𝐠𝐞 𝐂𝐫𝐞𝐚𝐭𝐨𝐫AI art models: DALL·E 3 Platform: Web https://lnkd.in/efhS6qWq🎨 𝐃𝐀𝐋𝐋·𝐄 𝟑 AI art models: DALL·E 3 Platform: Web (via ChatGPT) https://lnkd.in/emEMRGC3🎨 𝐃𝐫𝐞𝐚𝐦𝐒𝐭𝐮𝐝𝐢𝐨 (𝐒𝐭𝐚𝐛𝐥𝐞 𝐃𝐢𝐟𝐟𝐮𝐬𝐢𝐨𝐧)AI art models: Stable Diffusion Platform: Web https://dreamstudio.ai/🎨 𝐌𝐢𝐝𝐣𝐨𝐮𝐫𝐧𝐞𝐲AI art models: Midjourney Platform: Discord https://lnkd.in/gGXymwpR🎨 𝐂𝐚𝐧𝐯𝐚AI art models: Stable Diffusion Platform: Web, iOS, Android https://lnkd.in/g6-yMB4F🎨 𝐍𝐢𝐠𝐡𝐭𝐂𝐚𝐟𝐞AI art models: Stable Diffusion, DALL·E 2, CLIP-Guided Diffusion, VQGAN-CLIPPlatform: Web https://lnkd.in/dEW_ttWk🎨 𝐎𝐩𝐞𝐧𝐀𝐫𝐭AI art models: Stable Diffusion, DALL·E 2, and other open source models
Platform: Web https://openart.ai/create
🎨 𝐀𝐝𝐨𝐛𝐞 𝐅𝐢𝐫𝐞𝐟𝐥𝐲AI art models: Firefly Platform: Web, Adobe Express, Adobe Photoshop, and other Adobe toolshttps://firefly.adobe.com/🎨 𝐉𝐚𝐬𝐩𝐞𝐫 𝐀𝐫𝐭AI art models: Not specified, but likely based on Stable Diffusion Platform: Webhttps://www.jasper.ai/art🎨 𝐏𝐫𝐨𝐝𝐢𝐚AI art models: Stable Diffusion and other open source modelsPlatform: Web https://app.prodia.com/🎨 𝐋𝐞𝐚𝐩 𝐀𝐈AI art models: Stable Diffusion and other open source models Platform: Web https://tryleap.ai/🎨 𝐂𝐫𝐚𝐢𝐲𝐨𝐧AI art models: Based on original DALL·E model Platform: Web https://www.craiyon.com/🎨 𝐠𝐞𝐭𝐢𝐦𝐠.𝐚𝐢AI art models: Stable Diffusion and other open source models Platform: Web https://getimg.ai/🎨 𝐒𝐡𝐮𝐭𝐭𝐞𝐫𝐬𝐭𝐨𝐜𝐤 𝐀𝐈 𝐈𝐦𝐚𝐠𝐞 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐨𝐫AI art models: DALL·E 2 Platform: Web https://lnkd.in/dBgRJ7ck🎨 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐀𝐈 𝐛𝐲 𝐆𝐞𝐭𝐭𝐲 𝐈𝐦𝐚𝐠𝐞𝐬AI art models: Custom model developed with NVIDIA Platform: Web https://lnkd.in/eS-bpYVa🎨 𝐃𝐞𝐞𝐩 𝐃𝐫𝐞𝐚𝐦 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐨𝐫AI art models: Custom-trained models Platform: Web https://lnkd.in/gRZYrKW 

Responsible and ethical Large Language Models

Trust and accountability
  •  Foundation Model Transparency Index (specifies 100 fine-grained indicators that comprehensively codify transparency for foundation models, spanning the upstream resources used to build a foundation model (e.g. data, labor, compute), details about the model itself (e.g. size, capabilities, risks), and the downstream use (e.g. distribution channels, usage policies, affected geographies)
  • benefits of using human feedback to guide model behavior (Proximal Policy Optimization PPO) 
  • control and refine generated content by influencing token probabilities (prevent the model from generating harmful, offensive, or biased content)
  • prompt engineering (identify a limited set of high-priority keywords or topics; use regular expressions or natural language processing techniques to identify patterns in user inputs
  • bias detection and correction after training (post-processing tools and techniques)
Legal and ethical considerationsUsing LLMs raises ethical considerations, including the potential for biased outputs, breaches of privacy, and the risk of misuse. Addressing these concerns requires the adoption of transparent development practices, the responsible handling of data, and the integration of fairness mechanisms.Legality means an act according to the law, while ethics is about right and wrong behaviour. This means that some actions might be legal but, in some people's opinion, not ethical. Legality has its basis in ethics, while ethics has its basis in morals.Legal standards are based on written law, while ethical standards are based on human rights and wrongs. Something can be legal but not ethical. Legal standards are written by government officials, while ethical standards are written by societal norms.
The following discussion could serve as an important role in orienting and framing dialogue on foundation models (FM) and this new paradigm in AI. That said, to ensure the responsible development and deployment of these models on durable foundations, we envision collaboration between different sectors, institutions, and disciplines from the onset to be especially critical.
Foundation models (FM):  The nature of human language and NLPLanguage is the basis of most human communication and interaction. However, it is not just a means for humans to achieve shared goals: language is central to human thought, to how social and emotional relations are formed, to how we identify ourselves socially and personally, and to how humans record knowledge and develop societal intelligence. Spoken or signed languages arise in every human society, and the languages of the world are both incredibly diverse in the ways that they express and structure the information they convey, while also exhibiting surprising concordance in the richness of what makes a language. There are over 6,000 languages in the world, with estimates varying due to the inherent uncertainty of what constitutes a separate language. Languages are remarkably complex yet efficient systems, acquired consistently by children in a short amount of time, and which evolve and encompass the changing needs and conditions of linguistic communities. Due to this centrality of language in human activities, language understanding and generation is a critical element of research in artificial intelligence. Natural language processing (NLP) is the subfield of artificial intelligence concerned with language and, together with the related fields of automatic speech recognition (ASR) and text-to-speech (TTS), has the goal of giving computers the ability to understand and generate human language in much the same way human beings can.To date in 2021, NLP has been the field most profoundly affected by foundation models. The first generation of foundation models showcased an impressive variety of linguistic abilities, as well as a surprising amount of adaptability to a large range of linguistic situations. Since the introduction of the early foundation models ELMo and BERT in 2018, the field of NLP has become largely centered around using and understanding foundation models. The field has shifted to using foundation models as the primary tool, moving towards more generalized language learning as a central approach and goal. Foundation models have changed the overall process and mentality for training machine learning models for language, however there are theoretical and practical challenges facing foundation models as they are applied to a broader set of languages and more realistic and complex linguistic situations.The field of NLP has historically focused on defining and engineering systems for challenging linguistic tasks, with the vision that models that are good at these tasks will lead to competent language systems for downstream applications. NLP tasks include classification tasks for a whole sentence or document (e.g., sentiment classification, like predicting whether a movie review is positive or negative), sequence labeling tasks, in which we classify each word or phrase in a sentence or document (e.g., predicting if each word is a verb or a noun, or which spans of wordsrefer to a person or an organization), span relation classification, (e.g., relation extraction or parsing, like whether a person and location are linked by a “current residence” relation, or a verb and a noun by a “subject-verb” relation) and generation tasks, producing new text that is conditioned strongly on an input (e.g., producing a translation or summary of a text, recognizing or producing speech, or responding in a conversation). In the past, NLP tasks had distinct research communities that developed task-specific architectures, often based on pipelines of different models, each performing a linguistic sub-task such as token segmentation, syntactic parsing, or coreference resolution.By contrast, the dominant modern approach for performing each task is to use a single foundation model and adapt it slightly using relatively small amounts of annotated data specific to each task (sentiment classification, named entity tagging, translation, summarization) to create an adapted model. This has proved to be an extremely successful approach: for the vast majority of the tasks described above, a foundation model that is slightly adapted for a task greatly outperforms previous models or pipelines of models that were built specifically to perform that one task.
Foundation models (FM):  VisionVision underlies one of the primary modes through which a living organism understands its environment. The ability to see enables the near-constant, long-range gathering of dense signals, a critical capability developed over an evolutionary time-scale in a diverse range of life forms. For a skill executed effortlessly by even simple living creatures, transferring the same abilities to machines has proved remarkably challenging, leading computer vision and robotics researcher Hans Moravec in 1988 to observe a paradox: in AI, (what were considered) hard problems are easy and likewise easy problems are hard, and among the “easiest” problems of them all is the visual acuity which we use each day to continually interpret complex scenes in a matter of milliseconds.The field of computer vision and the challenges we define draw inspiration in many ways from human perception capabilities. Several classical theories [Marr 1982] suggested that humans may perceive real world scenes by contextualizing parts as a larger whole, and pointed the way for computer vision techniques to progressively model the physical world with growing levels of abstractions. Gibson [1979] suggested that human vision is inherently embodied and interactive ecological environments may play a key role in its development. These ideas continue to motivate the ongoing development of computer vision systems, iterating towards a contextual, interactive, and embodied perception of the world.In the context of computer vision, foundation models translate raw perceptual information from diverse sources and sensors into visual knowledge that may be adapted to a multitude of downstream settings. To a large extent, this effort is a natural evolution of the key ideas that have emerged from the field over the last decade. The introduction of ImageNet [Deng et al. 2009] and the advent of supervised pretraining led to a deep learning paradigm shift in computer vision. This transition marked a new era, where we moved beyond the classic approaches and task-specific feature engineering of earlier days towards models that could be trained once over large amounts of data, and then adapted for a broad variety of tasks, such as image recognition, object detection, and image segmentation . This idea remains at the core of foundation models.The bridge to foundation models comes from the limitations of the previous paradigm. Traditional supervised techniques rely on expensive and carefully-collected labels and annotations, limiting their robustness, generalization and applicability; in contrast, recent advances in self-supervised learningsuggest an alternative route for the development of foundation models that could make use of large quantities of raw data to attain a contextual understanding of the visual world. Relative to the broader aims of the field, the current capabilities of vision foundation models are currently early-stage: we have observed improvements in traditional computer vision tasks (particularly with respect to generalization capability) and anticipate that the near-term progress will continue this trend. However, in the longer-term, the potential for foundation models to reduce dependence on explicit annotations may lead to progress on essential cognitive skills (e.g., commonsense reasoning) which have proven difficult in the current, fully-supervised paradigm. In turn, we discuss the potential implications of foundation models for downstream applications, and the central challenges and frontiers that must be addressed moving forward. At a high-level, computer vision is the core sub-field of artificial intelligence that explores ways to endow machines with the capacity to interpret and understand the visual world. It encompasses a multitude of tasks, sub-domains and downstream applications, where the community has made continual progress over the last several decades. A selection of example tasks16: (1) semantic understanding tasks, which aim to discover the properties and relations among entities within visual scenes; these include image classification, object detection, semantic segmentation, action recognition, and scene graph generation, among others. (2) geometric, motion and 3D tasks, seeking to represent the geometry, pose and structure of still or moving objects, and include tasks of depth estimation, structure-from-motion, surface normal detection, curvature line and keypoint estimation, to name a few. (3) multimodal integration tasks, combining semantic and geometric understanding with other modalities such as natural language; these include, for instance, visual question answering, image captioning, and instruction following.
Foundation models (FM):  RoboticsA longstanding challenge of robotics research is to endow robots with the ability to handle the myriad conditions they will encounter in real-world settings. In this section, we discuss how the ideas underlying foundation models can potentially help bring about “generalist” robots that can, for example, cook a new meal in a new house, with a new kitchen. To make progress towards this goal, existing foundation models will not suffice. We need new types of models trained on a multitude of data sources, spanning grounded robotic interaction data to videos of humans performing tasks, amongst others. We focus on how such foundation models can apply to the problem of a robot controlling its own physical embodiment to successfully perform different tasks. This is a high-dimensional and closed-loop decision-making problem: the actions that a robot takes directly influence what it perceives next, which in turn influences the next robot action. This closed-loop aspect is not traditionally studied in language and computer vision, where large offline datasets are dominant and foundation models have already seen success. We focus on how the demonstrated benefits of foundation models — large-scale, self-supervised learning — can be leveraged in this new closed-loop data regime. The promise of a new type of robotic foundation model is in its ability to amplify the potential of robots to improve key facets of daily life ranging from manufacturing [Nof 1999; Sanneman et al. 2020], construction, autonomous driving, to household aid and personal assistance. Our discussion in this section primarily focuses on mobile manipulation robots for household tasks, but we expect its essence to be broadly applicable to the other use-cases of robotics listed above.On the critical path towards building new types of foundation models for robotics is embracing opportunities in task specification and task learning, coupled with tackling challenges in data acquisition and safety and robustness. Consider the following robot learning paradigm: starting with a description of a task capturing what a user might like the robot to do (e.g., “make breakfast”)—learn a corresponding policy to generate the desired robot actions. While policies can be parameterized in different ways, a common choice is that of a function that maps the task representation and environment observation (e.g., a scene image from a fixed or egocentric camera, or inputs from alternative sensors like LIDAR) to robot actions. As the robot acts in a task-conditioned manner, the subsequent states are fed back to the policy, generating more actions until the task has been satisfied.Recent breakthroughs in applying foundation models for language and vision suggest several potential benefits of large-scale, self-supervised pretraining for improving generalization. The ability to tap into diverse streams of data to learn meaningful representational priors (akin to those learned by models such as BERT and GPT-3) holds promise for learning powerful robotic foundation models for task specification. Diverse robotic interaction data can be used for learning action-conditional dynamics models or policies indexing general and semantically meaningful skills thereby holding promise for task learning. Yet while these opportunities exist, the key stumbling block is collecting the right data. Unlike language and vision data, robotics data is neither plentiful nor representative of a sufficiently diverse array of embodiments, tasks, and environments — we (as a field) still have not converged on the kinds of data that would be maximally useful for enabling generalist robotics (e.g., offline demonstrations, thirdperson recordings of humans, egocentric videos, autonomous experience, etc.) Coupled with issues in obtaining the right scale and diversity of data are questions of ensuring safety and robustness: how do we behave in a new environment without causing damage?Building new types of foundation models for robotics thus consists of a dichotomy of opportunities and challenges: opportunities for task specification and learning balanced against challenges of data collection and safe deployment. This section explores both by presenting a picture of how robotic foundation models might help us develop generalist robots, in a way that not only meaningfully addresses the challenges associated with building such systems, but that also embraces the potential of multi-modality — incorporating perception, actuation, and language — as well as human-robot interaction for specification and learning.Robotic foundation models could take a variety of forms: problems in robotics do not easily conform to a one-size-fits-all model, since different problems have different input-output signatures — a contrast to domains like NLP where many problems can be cast into a general “text-in, text-out” signature.
Foundation models (FM):  Reasoning and searchReasoning and search have been a central theme throughout the history of AI. Classic tests of intellect, from strategy games to abstract mathematical discovery, served as inspirational goal posts that pushed the limits of “machine intelligence” through a need to devise ever smarter ways of searching for winning solutions. In the early days, symbolic methods were the dominant approach for reasoning, but the involved engineering effort and the need to formalize heuristics to tackle intractable search spaces quickly proved cumbersome. More recently, data-driven methods using neural networks have shown encouraging results — e.g., defeating the best humans in Go, a board game with a much larger space of actions than the classic challenge of chess — by exploiting statistical structures and learning useful heuristics. This section outlines existing reasoning tasks, ones that require scaling to ever-larger search spaces and understanding the world broadly. We then argue that reasoningrole that foundation models should play a central role towards general reasoning as vehicles for capturing the statistical regularities of unbounded search spaces (generativity), allowing positive transfer across tasks and scenarios (universality), and exploiting the grounding of knowledge in multi-modal environments (grounding).Multimodality can allow foundation models to not only reason with formal symbolic language, but also exploit visual aspects of the problem, such as equivalence, symmetry, and Euclidean geometry, to prune the infinite search space and find promising constructions for a solution, mimicking the way humans reason.Recently, there has been a surge of interest in applying learning-based approaches to tackle reasoning problems. To overcome the unbounded search space challenge, researchers first started with a constrained search space to make the problem tractable. But such approaches suffered from the limited kinds of actions the solver could issue. For example, the solver could only apply theorems from a known database to prove the target theorem, instead of synthesizing novel theorems and lemmas. Because large language models offered a generic way of modeling the output space as a sequence, they quickly became a more favorable choice, allowing the generation of arbitrary kinds of actions. Researchers have applied these language model-based approaches to various applications, such as predicting protein structures, proving formal theorems, conjecturing theorems synthesizing programs from natural language, repairing, generating and understanding code. It has also been shown that scaling model size significantly improves reasoning capabilities, and furthermore standard techniques from language modelling, such as pretraining, can also greatly improve performance on these tasks. 
Foundation models (FM):  InteractionThe early forms of foundation models such as GPT-3 and DALL·E have demonstrated a high level of versatility both in terms of their ability to let even non-ML experts to prototype powerful AI-infused applications, and their ability to seamlessly integrate modalities ranging from texts to images. As the development of foundation models matures, themodels’ capacity will continue to expand and their versatility may ultimately lead to fundamental changes in how we interact with AI by allowing us to rapidly prototype and build highly dynamic and generative AI-infused applications. In this section, we discuss the opportunities that these changes present from the perspectives of two important stakeholders: (1) applications developers who will interact with foundation models to design user experience, and (2) end-users who will use or be affected by the AI-infused applications powered by foundation models. Finally, we consider scenarios in which the line that rigidly separates developers and end-users today may start to blur, affording new opportunities for creating AI-infused applications that more closely satisfy users’ needs and values.Unfortunately, the same generalizability and high ceiling that give foundation models their edge can also make these models difficult to work with, as they may be even more unpredictable and complex than single-purpose AI models. Indeed, recent work has shown that it can be difficult to make models like GPT-3 consistently perform the intended task, while understanding what it is capable of is still an active area of research. In an effort to improve the reliability and trustworthiness of AI-infused applications, we recommend that future work should continue to investigate how to achieve more predictable and robust behaviors from foundation models (e.g., through fine-tuning, or in cases where the main mode of interaction is natural language prompt, through prompt-engineering, calibrating, or pre-formatting a task-specific endpoint.
Foundation models (FM): Philosophy of understandingThere is not a precise technical definition of foundation model. Rather, this is an informal label for a large family of models, and this family of models is likely to grow and change over time in response to new research. This poses challenges to reasoning about their fundamental properties. However, there is arguably one defining characteristic shared by all foundation models: they are self-supervised. In self-supervision, the model’s sole objective is to learn abstract co-occurrence patterns in the sequences of symbols it was trained on. This task enables many of these models to generate plausible strings of symbols as well. There is no obvious sense in which this kind of self-supervision tells the model anything about what the symbols mean. The only information it is given directly is information about which words tend to co-occur with which other wordsA foundation model might be trained on a wide range of different symbols: not just language but also computer code, database files, images, audio, and sensor readings. As long as it is just learning co-occurrence patterns of the sequences it is exposed to, then it counts as a foundation model by our definition. As part of this learning, the model might come to represent strong associations between a given piece of text and a particular sensor reading, or between a sequence of pixel values and a database entry. These associations might reflect important aspects of the world we inhabit and the language we use to talk about it. Our central question is whether a foundation model could come to understand a natural language. With the above, we can now sharpen it: is self-supervision sufficient for understanding, keeping in mind that there are no constraints on the data used for this supervision? In order to address this question, we first need to define what we mean by understandingAs a start, we find it helpful to make explicit a distinction that is sometimes conflated in discussions of the topic. The distinction is between the metaphysics and the epistemology of understanding. Metaphysics concerns what it would mean (“in principle”) for an agent to achieve understanding. Epistemology, by contrast, concerns how (“in practice”) we could ever come to know that an agent has achieved the relevant type of understanding. In short, metaphysics is more about our ultimate target, whereas epistemology is more about how (if at all) we could know when we have reached it. Our epistemology thus depends to some extent on our metaphysics. 
Foundation models (FM): ApplicationsThe capabilities (of foundation models indicate that they have the potential to transform various sectors and industries, extending the role AI plays in society. Among the myriad applications where foundation models may be applied, we will focus on three disciplines — healthcare, law, and education — that are all foundational to societal function. Within each, we discuss the opportunities that foundation models pose for this domain alongside challenges and concerns. 
Foundation models (FM): TechnologyThe technological foundations of foundation models give rise to the capabilitiesthat determine their potential. To understand the technology used in development, we consider  the data, model architectures and systems used to train, and further adapt, these models alongside the theory that should be developed to understand this paradigm. To then understand the resulting models, we discuss how to evaluate and interpret alongside the importance of robustness, security and privacy, and long-term AI safety for ensuring the reliability of these models when deployed in society .
Foundation models (FM): TheoryRigorous mathematical theory plays a foundational role in many engineering and science disciplines (e.g., information theory in electrical engineering). We believe that theory of foundation models can be particularly beneficial in guiding technical decisions and innovations because of the huge computational costs associated with experimenting on foundation models. In addition, theoretical insights help elucidate fundamental limitations and explain surprising empirical phenomena. However, the community currently has a limited theoretical understanding of foundation models, despite much recent progress. Deep neural networks form the backbone of foundation models. Even in the well-studied supervised learning setting, where the train and test scenarios have the same distribution, there are numerous open questions around deep nets such as understanding non-convex optimization, the implicit regularization effect of optimizers, and expressivity. Foundation models raise questions that significantly go beyond the supervised deep learning setting. The core problem in theoretically analyzing foundation models is understanding why training on one distribution with a possibly unsupervised/self-supervised loss leads to good adaptation performance on different downstream distributions and tasks. 
Foundation models (FM): Key propertiesThe five key properties of a foundation model: expressivity — to flexibly capture and represent rich information; scalability — to efficiently consume large quantities of data; multimodality — to connect together various modalities and domains; memory capacity — to store the vast amount of accumulated knowledge; and compositionality — to generalize to new contexts, tasks and environments. 
  • During adaptation, a foundation model is converted into an adapted model (bottom row) in order to reflect updated information, desired behaviors, or deployment constraints.
  • Evaluation gives context to machine learning models: it serves as a means for (1) tracking progress — how do we we measure the performance of models and how do we design improved models; (2) understanding — what behaviors do models exhibit and how do they perform on different slices of data; and (3) documentation — how do we efficiently summarize model behavior and communicate this to diverse stakeholders.
  • Foundation models signal a paradigm shift where increasingly massive quantities of data are being “fed” to these models for improved adaptation performance with the overarching rule-of-thumb being "the more data the better" As previous sections have mentioned, this focus on data curation has raised concerns around the foundation model data lifecycle including (1) managing the data at such a large scale, (2) integrating data across new modalities, (3) reasoning over licensing and governance regulations — especially when considering the massing web-crawls used in foundation models training, and (4) understanding the data quality.
  • As central components in critical data-driven decision-making systems, machine learning models must address a variety of security and privacy threats. These threats can be characterized using the traditional CIA triad of computer security. ML systems should protect the Confidentiality of user data against inference and reconstruction attacks. Moreover, the secrecy of trained models themselves can be at risk of model stealing attacks. The Integrity of ML systems can be compromised by adversarial examplesand data poisoning attacks. Finally, resource-depletion attacks can threaten the Availability of ML systems.

Foundation models (FM): Societal impactThe societal impact of foundation models, referring both to the construction of the models themselves and their role in developing applications, requires careful examination. Specifically, we anticipate that foundation models will have wide-ranging societal consequences that are challenging to understand: foundation models are intermediary assets that are not directly deployed, but rather serve as a foundation that is further adapted. As a result, traditional approaches to reasoning about the societal impact of technology are likely complicated; societal impact is easier (but still difficult) to grasp for systems with well-specified purposes. In this chapter, we discuss how we may grapple with and beginning to understand the complexity of the societal impact of models foundation models. Specifically, we discuss (i) the harms with respect to inequity (fairness) and misuse, (ii) the impact with respect to the economy and environment, and (iii) the broader considerations with respect to the law (legality) and ethics
  • The intrinsic bias present within foundation models is the byproduct of various training bias sources (left) which, alongside biases introduced during adaptation, determines the extrinsic harms (right) experienced by users in the context of specific downstream applications. We emphasize that the same foundation model is the shared foundation for many different applications; its biases propagate to these many applications as a result. Further, since the harms experienced by users are the result of specific adapted models, attributing these harms to the various processes and sources depicted in this diagram is both crucial and challenging.
  • In this section,we consider misuse of foundation models—situations where people use foundation models as they are intended to be used (e.g., to generate language), but where their capabilities are intentionally leveraged to cause harm to populations or individuals. This definition positions misuse concerns between those of inequity (where models can cause harm without bad intentions; and security (where bad actors exploit unintentional abilities or vulnerabilities in models to cause harm.
  • In this section, we describe how US law may influence, constrain, or foster the creation and use of foundation models.We note that the legal landscape surrounding algorithmic tools remains uncertain. We highlight issues pertaining to (1) model training, (2) liability for model predictions, and (3) protections for model outputs. Though understanding how the law affects foundation models is crucial, it is important to recognize that the law cannot be the only lens through which we evaluate the construction, maintenance, and use of foundation models. Ethical frameworks are necessary to understand where legally permissible applications of foundation models may still be ill-advised for the harms they inflict and are discussed in more depth in ethics and fairness. Studying the potential for misuse and possible security concerns is critical for preventing harmful outcomes ex ante, as opposed to the ex post treatment that legal mechanisms often provide.
  • Foundation models have the potential to substantially improve overall living standards by increasing productivity and innovation. These models can be deployed to substitute for human labor, augment humans, or help in the discovery of new tasks and opportunities, which can lead to increased concentration of ownership and power, or more decentralization. On a broader level, the result can be either increased inequality due to potential centralization, or more broadly shared prosperity due to the easier adaptation of foundation models for a wide range of applications. The ultimate outcomes on all these dimensions are not dictated solely by technology or economics, but by the choices and actions of technologists, policymakers, managers, workers, and other members of society.
The autoencoder model is the basis for training foundational models from a ton of data. We are talking about tens of billions of training examples, like a good portion of the Internet.With that much data, it is not economically feasible to hire humans to label all of those data to tell a model what its targets are. Thus, people came up with many clever ideas to derive training targets from the training examples themselves [auto]matically.
  • The most straightforward idea is to just use the training data itself as the targets. 
  • Then, people tried hiding some parts of the training data and using those missing parts as the targets. This is called masking, which is how LLMs are trained these days.
  • Then, people tried pairing up text and images and using each other as targets. This is called constrative learning. This is the C in the famous CLIP model from OpenAI, which is the basis of all the multimodal foundational models.

GPT-4
  • Scale: GPT-4 has ~1.8 trillion parameters across 120 layers, which is over 10 times larger than GPT-3.
  • Mixture of experts (MoE): OpenAI utilizes 16 experts within their model, each with ~111B parameters for MLP.
  • Dataset: GPT-4 is trained on ~13T tokens, including both text-based and code-based data, with some fine-tuning data from ScaleAI and internally.
  • Dataset mixture: The training data included CommonCrawl & RefinedWeb, totalling 13T tokens. Speculation suggests additional sources like Twitter, Reddit, YouTube, and a large collection of textbooks.
  • Training cost: The training costs for GPT-4 were around $63 million, taking into account the computational power required and the time of training.
  • Inference cost: GPT-4 costs 3 times more than the 175B parameter Davinci, due to the larger clusters required and lower utilization rates.
  • Inference architecture: The inference runs on a cluster of 128 GPUs, using 8-way tensor parallelism and 16-way pipeline parallelism.
  • Vision multi-modal: GPT-4 includes a vision encoder for autonomous agents to read web pages and transcribe images and videos. This adds more parameters on top and it is fine-tuned with another ~2 trillion tokens.
GPT-4o -  the o stands for Omnimodel - is a conversational model that improves on its capabilities over text, audio and vision. Unlike previous approaches combining Intelligence, Text-to-Speech and Audio model, this new model reasons across voice, text and vision. This makes GPT-4o way faster, but also more conversational and natural. It can understand the context and the tone of the conversation, and can answer using a variety of tones as well. It can also be interrupted, which improves the experience and makes it all the more natural.GPT-4o can be combined with: 
  • real-time vision (images and videos), 
  • memory, 
  • GPTs, 
  • browse, and 
  • advanced data analysis.
The intelligence of AI models depends not only on their reasoning capabilities but also on their ability to understand context and tone emotionally. New user experiences that enable more natural conversations—in terms of speed, emotional and contextual intelligence, and support for real-time information—are a significant step toward achieving Artificial General Intelligence. 
VLM  (vision-language model)In the last few decades, Computer Vision (CV) and Natural Language Processing (NLP) have made several major technological breakthroughs in deep learning research. Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From having a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. Connecting visionto language will unlock several applications that will be key to the current AI-based technological revolution. Even though several works have already extended large language models to vision, connecting language to vision is not completely solved. For example, most models struggle to understand spatial relationships or count without complicated engineering overhead that relies on additional data annotation. Many VLMs also lack an understanding of attributes and ordering. They often ignore some part of the input prompt, leading to significant prompt engineering efforts to produce the desired result. VLM pre-training, aims to pre-train a VLM to learn image-text correlation, targeting effective zero-shot predictions on visual recognition tasks. Given image-text pairs , it first employs a text encoder and an image encoder to extract image and text features and then learns the vision-language correlation with certain pre-training objectives. VLM pre-training has been explored with three typical objectives: contrastive objectives, generative objectives and alignment objectives.
  • Contrastive learning has been widely explored in VLM pretraining, which designs contrastive objectives for learning discriminative image-text features
  • Generative VLM pre-training learns semantic knowledge by learning to generate images or texts via masked image modelling, masked language modelling, masked cross-modal modelling and image-to-text generation. Generative Adversarial Networks (GANs) have been a significant breakthrough in generative modeling, with the unique ability to generate realistic and diverse samples. GANs consist of two components, the generator and the discriminator, which engage in a continuous adversarial process. The generator produces synthetic images from random noise, while the discriminator aims to distinguish between these generated images and real images from the training dataset. Through backpropagation and optimization, the generator continually refines its outputs in response to the feedback from the discriminator and generates more realistic images. Diffusion-based methods model the image generation process as a series of diffusion steps, progressively refining the generated image. Diffusion models (DMs), also commonly known as diffusion probabilistic models, represent a class of generative models founded on Markov chains and trained through weighted variational inference. The primary objective of is to learn the impact of noise on the available information in the sample or the degree to which the diffusion process reduces the information available. The two-step process in consists of forward and reverse diffusion. In the forward diffusion process, Gaussian noise is successively introduced, representing the diffusion process until the data becomes total noise. The reverse diffusion process trains a neural network to learn the conditional distribution probabilities, allowing the model to reverse the noise and reconstruct the original data effectively.  Inspired by the success of autoregressive Transformers, autoregressive methods focus on sequentially predicting individual pixels or regions in an image, generating the output based on a learned probability distribution.
  • Alignment objectives enforce VLMs to align paired images and texts by learning to predict whether the given text describes the given image correctly. It can be broadly categorized into global image-text matching and local region-word matching for VLM pre-training.

The general architecture of a VLM consists of an image and text encoder to generate the embeddings which are then fused in an image-text fusion layer and this fused vector is passed through an LLM to generate the final visually aware generated text. 
Core research challenges in multimodal learning: 
  • Representation studies how to represent and summarize multimodal data to reflect the heterogeneity and interconnections between individual modality elements.
  • Alignment aims to identify the connections and interactions across all elements. 
  • Reasoning aims to compose knowledge from multimodal evidence usually through multiple inferential steps for a task.
  • Generation involves learning a generative process to produce raw modalities that reflect cross-modal interactions, structure, and coherence. 
  • Transference aims to transfer knowledge between modalities and their representations. 
  • Quantification involves empirical and theoretical studies to better understand the multimodal learning process.
AI for IoTThe proliferation of IoT devices and sensors, coupled with advancements in AI algorithms and computing technologies, has paved the way for a new era of intelligent IoT systems. AI techniques are increasingly integrated into IoT architectures to enable advanced analytics, autonomous decision-making, and adaptive behaviors. From smart homes and cities to industrial automation and healthcare, AI-powered IoT solutions are revolutionizing the way we interact with and leverage data from connected devices, driving innovation, efficiency, and sustainability. This mini-series aims to explore the intersection of AI and IoT, covering cutting-edge research, real-world applications, and best practices in leveraging AI to enhance IoT systems and services.
  • AI-enabled IoT applications and use cases: Explore innovative applications and use cases where AI enhances IoT functionalities and capabilities, spanning smart healthcare, intelligent transportation, precision agriculture, industrial automation, environmental monitoring, and more.
  • AI-driven data analytics and decision-making: Investigate AI techniques for processing, analyzing, and deriving actionable insights from IoT-generated data streams, enabling predictive maintenance, anomaly detection, personalized recommendations, and intelligent automation.
  • Generative AI and Large Language Models (LLMs) for IoT: Study the applications of generative AI in IoT environments; explore how LLMs, such as generative pre-trained transformer (GPT) models, can be utilized to generate synthetic data, enhance natural language understanding, and support human-machine interaction in IoT systems.
  • Edge AI and distributed intelligence: Discuss the integration of AI algorithms and models at the edge of IoT networks, enabling real-time inference, adaptive learning, and autonomous decision-making closer to data sources, and minimizing latency and bandwidth requirements.
  • AI-empowered robotics and sensing for IoT: Explore the integration of AI with robotics and sensing technologies in IoT systems. Introduce advancements in sensor technologies and data fusion techniques that enable intelligent data collection, processing, and analysis in dynamic environments.
  • Privacy, security, and trustworthiness: Address the privacy and security implications of AI-enabled IoT systems, including data privacy, confidentiality, integrity, and authenticity. Investigate trustworthiness in AI-enabled IoT systems, emphasizing the need for reliability, transparency, explainability, accountability, and fairness.
  • Standardization and interoperability: Discuss the challenges and opportunities in standardizing AI-enabled IoT technologies to ensure interoperability, compatibility, and seamless integration across heterogeneous IoT ecosystems. Explore emerging standards, protocols, and frameworks for facilitating collaboration and interoperability among AI and IoT technologies.
  • Demonstrations, Proof-of-Concepts, and deployments: Present innovative demonstrations and proof-of-concepts showcasing the integration of AI technologies with IoT systems. Share insights and best practices for deploying AI-powered IoT solutions in diverse environments, including smart cities, healthcare, agriculture, manufacturing, transportation, and energy.
  • Regulation and policy: Explore the regulatory and policy landscape surrounding AI and IoT technologies, including data governance, consumer protection, liability, accountability, and ethical considerations. Discuss the role of regulatory bodies, industry consortia, and international organizations in shaping ethical, legal, and policy frameworks to ensure the responsible development, deployment, and use of AI-powered IoT systems.

CPU   DPU   GPU   TPU   NPU   APU   QPU

Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs’ ability of general-purpose language understanding and generation is acquired by training billions of model’s parameters on massive amounts of text data, as predicted by scaling laws. The research area of LLMs, while very recent, is evolving rapidly in many different ways.LLMs are large-scale, pre-trained, statistical language models based on neural networks. The recent success of LLMs is an accumulation of decades of research and development of language models, which can be categorized into four waves that have different starting points and velocity:
  • statistical language models
  • neural language models
  • pre-trained language models
  • LLMs
Large language models (LLMs) mainly refer to transformer-based neural language models that contain tens to hundreds of billions of parameters. Emergent abilities include:
  • in-context learning, where LLMs learn a new task from a small set of examples presented in the prompt at inference time 
  • instruction following, where LLMs, after instruction tuning, can follow the instructions for new types of tasks without using explicit examples
  • multi-step reasoning, where LLMs can solve a complex task by breaking down that task into intermediate reasoning steps as demonstrated in the chain-of-thought prompt

Multimodal LLMs.  Inspired by the success of LLMs in natural language processing applications, an increasing number of research works are now facilitating LLMs to perceive different modalities of information like image, video, audio  etc. Multimodal LLMs (MLLMs) present substantial benefits compared to standard LLMs that process only text. By incorporating information from various modalities, MLLMs can achieve a deeper understanding of context, leading to more intelligent responses infused with a variety of expressions. Importantly, MLLMs align closely with human perceptual experiences, leveraging the synergistic nature of our multisensory inputs to form a comprehensive understanding of the world. Coupled with a user-friendly interface, MLLMs can offer intuitive, flexible, and adaptable interactions, allowing users to engage with intelligent assistants through a spectrum of input methods. According to the ways of constructing models, current MLLMs can be generally divided into three streams: 
  • pre-training MLLMs intends to support different modalities using unified end-to-end models
  • derived from instruction tuning  for NLP tasks, researchers are fine-tune pre-trained LLMs using multimodal instructions
  • prompting technique provides certain context, examples, or instructions to the model, fulfilling specialized tasks without changing the model parameters
  • visual reasoning application: recent visual reasoning systems tend to apply LLMs for better visual information analysis and visual-language integration