Data Mesh: Understanding Its Applications, Opportunities, and Constraints 


Data has experienced a metamorphosis in its perceived value and management within the corporate sphere. Previously underestimated and frequently discarded, data was often relegated to basic reports or neglected due to a lack of understanding and governance. This limited vision, combined with emerging technologies, led to an overwhelming influx of data, and nowhere for it to go. There was little to no governance or understanding of what data they had, or how long they had it.  

In the early 2000s, enterprises primarily used siloed databases, isolated data sets with limited accessibility. The 2010s saw the rise of Data Warehouses, which brought together disparate datasets but often led to bottlenecks. Data Lakes emerged as a solution to store vast quantities of raw data and quickly became swamps without adequate governance. Monolithic IT and data engineering groups would struggle to document, catalog, and secure the growing stockpile of data. Product owners and teams that would want, or need access to data would have to request access and wait. Sometimes those requests would end up in a backlog and forgotten about.  

In this new dawn of data awareness, the Data Mesh emerges as a revolutionary concept, enabling organizations to efficiently manage, process, and gain insights from their data. As organizations realize data’s pivotal role in digital transformation, it becomes imperative to shift from legacy architectures to more adaptive solutions, making Data Mesh an attractive option.  

 

The Basics of a Data Mesh 

The importance of personalized customer experiences should not be understated. More than ever, consumers are faced with endless options. To stand out from competitors, businesses must use data and customer behavior insights to curate tailored and dynamic customer journeys that both delight and command their audience. Analyze purchasing history, demographics, web activity, and other data to understand your customer, as well as their likes and dislikes. Use these insights to design customized customer experiences that increase conversion, retention, and ultimately, satisfaction.  

When discussing data architecture concepts, the terms “legacy” or “traditional” imply centralized data management concepts, characterized by monolithic architectures developed and maintained by a data engineering organization within the company. Business units outside of IT would often feel left in the dark, waiting for the data team to address their specific needs and leading to inefficiencies. 

First coined in 2019, the Data Mesh paradigm is a decentralized, self-service approach to data architecture. There are four central principles that Data Mesh is based on: Domain ownership, treating data as a product, self-service infrastructure, and federated computational governance. 

With Data Mesh, teams (Domains) are empowered to own and manage their data (Product). This requires stewardship at the team level to effectively manage their own resources to ingest, persist and serve data to their end users. Data stewards are responsible for the quality, reliability, security, and accessibility of the data. Data stewards bridge the gap between decentralized teams and enterprise-level governance and oversight. 

While teams enjoy autonomy, chaos would ensue without a federated governance approach. This ensures standards, policies and best practices are followed across all product owners and data stewards.  

Implementing a Data Mesh requires significant investment in both infrastructure and enhancing teams with the resources and expertise required to manage their own resources. It requires a fundamental change in companies’ mindset of how they treat data.  

While a Lakehouse would aim to combine the best of Data Lakes and Data Warehouses, Data Mesh ventures further by decentralizing ownership and control of data. While Data Fabric focuses on seamless data access and integration across disparate sources, Data Mesh emphasizes domain-based ownership. On the other hand, event-driven architectures prioritize real-time data flow and reactions, which can be complementary to Data Mesh. 

data mesh decentralized architecture

When and Where to Implement Data Mesh 

  1. Large Organizations with Data Rich Domains: With large organizations, departments often deal with a deluge of data.  From Human Resources to Sales, each team has their own requirements for how their data is used, stored, and accessed. As teams consume more data, time to market and development efficiency suffer in centralized architectures. External resources and time constraints are often the biggest issue. By implementing Data Mesh, teams can work independently and take control of their data, increasing efficiency and quality. As a result, teams can optimize and enrich their product offering and cut costs by streamlining ELT/ETL processes and workflows. 

With direct control over their data, teams can tune and tailor their data solutions to better meet customer needs.  

  1. Complex Ecosystem: Organizations, especially those operating in dynamic environments with intricate interdependencies, often face challenges in centralized data structures. In such architectures, there’s limited control over resource allocation, utilization, and management, which can hinder teams from maximizing the potential of their data. Centralized approaches can curtail innovation due to rigid schemas, inflexible data pipelines, and lack of domain-specific customization. Data Mesh offers organizations the flexibility to adapt to evolving data needs and utilize domain-specific expertise to curate, process, and consume data tailored to their unique requirements. 
  1. Rapidly growing data environments: Today’s digital age sees organizations collecting data at an unprecedented scale. The sheer volume of data can be overwhelming with the influx of IoT devices, vendor integrations, user interactions, and digital transactions. Centralized teams often grapple with scaling issues, processing delays, and the challenge of timely data delivery. Data Mesh addresses this by distributing the data responsibility across different domains or teams. Multiple decentralized units handle the influx as data inflow increases, ensuring timely processing and reducing system downtime. The result is a more resilient data infrastructure ready to meet both current demands and future needs. 

When Not to Implement Data Mesh 

  1. Small to Medium-sized Enterprises (SMEs): While Data Mesh presents numerous advantages, it may not be suitable for all organizations or projects. Smaller organizations typically handle lower data volumes and may not possess the resources needed to manage their data independently. In these cases, a centralized data architecture would be more suitable to minimize complications in design and maintenance with fewer resources to manage them. 
  1. Mature and Stable Centralized Architectures: Organizations usually only turn to new solutions when they are experiencing problems. If a well-established centralized architecture is performing and fitting the needs of the company, there isn’t a need necessarily for Data Mesh adoption. Introducing a fundamental change in how data is managed is an expensive and disruptive undertaking. Building new infrastructure and expanding team capabilities changing organizational culture takes time.  
  1. Short-term Projects: Implementing a Data Mesh requires significant time and resource investment. The benefits of a Data Mesh won’t be seen when building or designing a limited lifespan project or proof of concept. If a project’s duration doesn’t justify the investment of a Data Mesh or the scope doesn’t require domain-specific data solutions, then the benefits of a Data Mesh aren’t utilized. Traditional data architectures are usually more appropriate for these applications and don’t need the oversight/governance that a Data Mesh requires.

  

Opportunities Offered by Data Mesh 

  1. Scalability: Data Mesh enables organizations to scale their data processing capabilities more effectively by enabling teams to control how and when their data is processed, optimizing resource use and costs, and ensuring they remain agile amidst expanding data sources and consumer bases.  
  1. Enhanced Data Ownership: Treating data as a product rather than a byproduct or a secondary asset is revolutionary. By doing so, Data Mesh promotes a culture with a clear sense of ownership and accountability. Domains or teams that “own” their data are more inclined to ensure its quality, accuracy, and relevance. This fosters an environment where data isn’t just accumulated but is curated, refined, and optimized for its intended purpose. Over time, this leads to more prosperous, more valuable data sets that genuinely serve the organization’s needs. 
  1. Speed and Innovation: Decentralization is synonymous with autonomy. When teams have the tools and the mandate to manage their data, they are not bogged down by cross-team dependencies or bureaucratic delays. They can innovate, experiment, and iterate at a faster pace, resulting in expanded data collection and richer data sets. This agility accelerates data product development, enabling organizations to adapt to changing needs quickly, capitalize on new opportunities, and stay ahead of the curve in the competitive market.  
  1. Improved Alignment with Modern Architectures: Decentralization isn’t just a trend in data management; it’s a broader shift seen in modern organizational architectures, especially with the rise of microservices. Data Mesh naturally aligns with these contemporary structures, creating a cohesive environment where data and services coexist harmoniously. This alignment reduces friction, simplifies integrations, and ensures that the entire organizational machinery, services, and data operate in a unified, streamlined manner. 
  1. Enhanced Collaboration: As domains take ownership of their data, there’s an inclination to collaborate with other domains. This cross-functional collaboration fosters knowledge sharing, best practices, and a unified approach to data challenges, driving more holistic insights.

Constraints and Challenges 

  1. Cultural Shift: Teams may not want to own their own data or have the experience to take on the responsibility. Training initiatives, workshops, and even hiring external experts might be necessary to bridge these skill gaps. 
  1. Increased Complexity: Developing an environment that supports a Data Mesh architecture is not without its challenges. As the Data Mesh model expands, managing the growing number of interconnected resources and solving integration issues to ensure smooth communication between various domains can be a considerable obstacle. Planning appropriately to support teams with access, training and management of a Data Mesh is critical to its evolution and success. This includes well defined requirements for APIs, data exchange, and interface protocols. 
  1. Cost Implications: Transitioning to a Data Mesh could entail substantial upfront costs, including hiring additional resources, training personnel, investing in new infrastructure, and possibly overhauling existing systems. 
  1. Governance: Data Governance has become a hot topic as data architectures grow and mature. Ensuring a consistent view of data across all domains can be challenging, especially when multiple teams update or alter their datasets independently. Tools to manage integrity, security and compliance are a requirement in a Data Mesh architecture. The need for teams to have autonomy in a decentralized environment is balanced with a flexible but controlled governance model that is the foundation for federated governance. This can be a challenge when initially designing the model based on team requirements, but it’s an important step to take as early as possible when building a data platform.  

Skillset: Evolving with the Data Mesh Paradigm

With an evolved mindset, the Data Mesh paradigm demands expertise that may not have previously been cultivated within traditional data teams. This transition from central data lakes to domain-oriented data products introduces complexities requiring a deep understanding of the data and the specific use cases it serves, both internally and externally. Skills such as collaboration, domain-specific knowledge translation, and data stewardship become vital. As data responsibility becomes decentralized, each team member’s role becomes more critical in ensuring data integrity, relevance, and security. As data solutions evolve, teams must adopt a mindset of perpetual learning, keeping pace with the latest methodologies, tools, and best practices related to managing their data effectively. 

Embracing the Data Mesh

In the evolving landscape of data management, the Data Mesh presents a promising alternative to traditional architectures. It’s a journey of empowerment, efficiency, and decentralization. The burgeoning community support for Data Mesh, evident from the increasing number of case studies, forums, and tools developed around it, underscores its pivotal role in the future of data management. However, its success hinges on an organization’s readiness to embrace the cultural and operational shifts it demands. As with all significant transformations, due diligence, meticulous planning, and an understanding of the underlying principles are crucial for its fruitful adoption. Embracing the Data Mesh is more than just a technological shift; it’s a paradigm transformation. Organizations willing to make this leap will find themselves not just keeping up with the rapid pace of data evolution but leading the charge in innovative, data-driven solutions.  

Digital Transformation Trends that Future-Proof Your Business


The core of future-proofing your business lies in the incorporation of cutting-edge technological trends and strategic digitization of your business operations. Combining new, transformative solutions with tried-and-true business methods is not only a practical approach but an essential one when competing in this digital age. Using the latest digital transformation trends as your guide, start envisioning the journey of future-proofing your business in order to unlock the opportunities of tomorrow. 

#1 Personalization  

The importance of personalized customer experiences should not be understated. More than ever, consumers are faced with endless options. To stand out from competitors, businesses must use data and customer behavior insights to curate tailored and dynamic customer journeys that both delight and command their audience. Analyze purchasing history, demographics, web activity, and other data to understand your customer, as well as their likes and dislikes. Use these insights to design customized customer experiences that increase conversion, retention, and ultimately, satisfaction.  

#2 Artificial Intelligence  

AI is everywhere. From autonomous vehicles and smart homes to digital assistants and chatbots, artificial intelligence is being used in a wide array of applications to improve, simplify, and speed up the tasks of everyday life. For businesses, AI and machine learning have the power to extract and decipher large amounts of data that can help predict trends and forecasts, deliver interactive personalized customer experiences, and streamline operational processes. Companies that lean on AI-driven decisions are propelled into a world of efficiency, precision, automation, and competitiveness.  

#3 Sustainability 

Enterprises, particularly those in the manufacturing industry, face increasing pressure to act more responsibly and consider environmental, social, and corporate governance (ESG) goals when making business decisions. Digital transformations are one way to support internal sustainable development because they lead to reduced waste, optimized resource use, and improved transparency. With sustainability in mind, businesses can build their data and technology infrastructures to reduce impact. For example, companies can switch to more energy-efficient hardware or decrease electricity consumption by migrating to the cloud.  

#4 Cloud Migration 

More and more companies are migrating their data from on-premises to the cloud. In fact, by 2027, it is estimated that 50% of all enterprises will use cloud services1. What is the reason behind this massive transition? Cost saving is one of the biggest factors. Leveraging cloud storage platforms eliminates the need for expensive data centers and server hardware, thereby reducing major infrastructure expenditures. And while navigating a cloud migration project can seem challenging, many turn to cloud computing partners to lead the data migration and ensure a painless shift.  

Future-Proof Your Business Through Digital Transformation with Kopius

By embracing these digital transformation trends, your company is not only adapting to the current business landscape but also unlocking new opportunities for growth. Future-proofing your business requires a combination of strategic acumen and technical expertise. This is precisely where a digital transformation partner, who possesses an intimate grasp of these trends, can equip your business with the resources and solutions to confidently evolve. Reach out to Kopius today and let’s discuss a transformational journey that will future-proof your business for the digital future.  

A Step-By-Step Guide to Customer Experience Personalization


Winning the interest and loyalty of customers means more than just offering a superior product or service. The secret lies in a powerful strategy called personalization – a dynamic approach that tailors the customer experience to meet individual needs and preferences. As businesses across industries strive to create lasting connections with their customers and meet their evolving expectations, the importance of personalization in the customer experience should not be overstated. Read on to explore the compelling case for customer personalization and a step-by-step guide on how your business can embark on this journey to elevate the customer experience. 

Let’s face it, generic offerings are outdated. Today, customers yearn for something more; they want an experience that resonates with their unique tastes. Personalization is the magic ingredient that taps into this desire. By tailoring products, services, and interactions to individual preferences, businesses create a sense of connection that fosters lasting loyalty. And beyond that, research from McKinsey found that companies who implemented a personalization strategy generated 40% more revenue than their counterparts who placed less emphasis on this approach. All signs point to tailored customer journeys.  

Data lies at the heart of personalization, offering insights into customer behaviors. More than ever, companies have access to a wealth of customer information, such as past purchases and browsing habits, that act as the building blocks to these insights. Leveraging advanced analytics and artificial intelligence, businesses can uncover valuable patterns and trends, guiding them to craft personalized experiences for their customers. 

Building a successful personalization strategy requires thoughtful consideration and calculated execution. If you are just getting started, follow these steps to build an improved and tailored customer experience that will drive remarkable results for your business:

Step 1: Gather as Much Customer Data as Possible.

At the core of every successful personalization strategy lies a deep understanding of your customers. To lay this solid foundation, start by gathering valuable data from multiple touchpoints along their journey, including website interactions, purchase history, and customer feedback. Take advantage of powerful tools like customer relationship management (CRM) software, website analytics, and social media insights to gain a holistic view of your customers’ preferences, behaviors, and pain points.

Step 2: Divide Your Customers Into Audience Segments.

With an abundance of data at your fingertips, it is time to move on to segmentation. Divide your customers into distinct groups based on shared traits like demographics, purchase behavior, and interests. Audience segmentation empowers you to personalize your messaging or offerings, address individual customer needs with accuracy, and create a sense of relevance.

Step 3: Get Personal With Your Messaging.

Now that you have completed the segmentation process, it’s time to get personal! Start by creating interesting content with tailored product recommendations, and design exclusive offers that cater specifically to the unique preferences of each of your audience segments. By doing so, you will create truly personalized experiences that captivate your audience and leave an impression.

Step 4: Automate Dynamic Content Delivery. 

Offer real-time digital experiences that resonate with your customers’ interests and past interactions. Embracing innovative technologies like artificial intelligence allows you to analyze customer data, predict behavior, and implement an effective personalization strategy that delivers tailored experiences on the fly. AI-powered chatbots take personalized support a step further, offering instant assistance to resolve customer concerns and boost overall customer satisfaction levels.

Step 5: Track Your Personalization Campaigns. 

Monitor the impact of your personalization strategy on customer engagement, satisfaction, and business performance. Evaluate key metrics like conversion rates and customer retention to assess their effectiveness. Utilize any insights gained to identify areas for improvement and modify your approach accordingly. 

The possibilities for designing a personalized digital experience are limitless. AI-powered chatbots provide real-time personalized support, making customers feel valued and cared for. Dynamic content delivery ensures website experiences are based on individual preferences. Personalization will enrich the customer journey, increasing engagement and conversion rates. If you are ready to deliver personalized experiences, Kopius is here to help. Let’s team up to create extraordinary customer experiences for your business! 

5 Industries Winning at Artificial Intelligence


By Lindsay Cox

Augmented Intelligence (AI) and Machine Learning (ML) were already the technologies on everyone’s radar when the year started, and the release of Foundation Models like ChatGPT only increased the excitement about the ways that data technology can change our lives and our businesses. We are excited about these five industries that are winning at artificial intelligence.

As an organization, data and AI projects are right in our sweet spot. ChatGPT is very much in the news right now (and is a super cool tool – you can check it out here if you haven’t already).

I also enjoyed watching Watson play Jeopardy as a former IBMer 😊

There are a few real-world examples of how five organizations are winning at AI. We have included those use cases along with examples where our clients have been leading the way on AI-related projects.

You can find more case studies about digital transformation, data, and software application development in our Case Studies section of the website.

Consumer brands: Visualizing made easy

Brands are helping customers to visualize the outcome of their products or services using computer vision and AI. Consumers can virtually try on a new pair of glasses, a new haircut, or a fresh outfit, for example.  AI can also be used to visualize a remodeled bathroom or backyard.

We helped a teledentistry, web-first brand develop a solution using computer vision to show a customer how their smile would look after potential treatment. We paired the computer vision solution with a mobile web application so customers could “see their new selfie.” 

Consumer questions can be resolved faster and more accurately

Customer service can make or break customer loyalty, which is why chatbots and virtual assistants are being deployed at scale to reduce average handle time average speed-of-answer, and increase first-call resolutions.

We worked with a regional healthcare system to design and develop a “digital front door” to improve patient and provider experiences. The solution includes an interactive web search and chatbot functionality. By getting answers to patients and providers more quickly, the healthcare system is able to increase satisfaction and improve patient care and outcomes.

Finance: Preventing fraud

There’s a big opportunity for financial services organizations to use AI and deep learning solutions to recognize doubtful transactions and thwart credit card fraud which help reduce cost. Also known as anomaly detection, banks generate huge volumes of data which can be used to train machine learning models to flag fraudulent transactions.

Agriculture: Supporting ESG goals by operating more sustainably

Data technologies like computer vision can help organizations see things that humans miss. This can help with the climate crisis because it can include water waste, energy waste, and misdirected landfill waste.

The agritech industry is already harnessing data and AI since our food producers and farmers are under extreme pressure to produce more crops with less water. For example, John Deere created a robot called “See and Spray” that uses computer vision technology to monitor and spray weedicide on cotton plants in precise amounts.

We worked with PrecisionHawk to use computer vision combined with drone-based photography to analyze crops and fields to give growers precise information to better manage crops. The data produced through the computer vision project helped farmers to understand their needs and define strategies faster, which is critical in agriculture. (link to case study)

Healthcare: Identify and prevent disease

AI has an important role to play in healthcare, with uses ranging from patient call support to the diagnosis and treatment of patients.

For example, healthcare companies are creating clinical decision support systems that warn a physician in advance when a patient is at risk of having a heart attack or stroke adding critical time to their response window.

AI-supported e-learning is also helping to design learning pathways, personalized tutoring sessions, content analytics, targeted marketing, automatic grading, etc. AI has a role to play in addressing the critical healthcare training need in the wake of a healthcare worker shortage.

Artificial intelligence and machine learning are emerging as the most game-changing technologies at play right now. These are a few examples that highlight the broad use and benefits of data technologies across industries. The actual list of use cases and examples is infinite and expanding.

What needs to happen for your company to win at artificial intelligence? To learn more about Artificial Intelligence and Machine Learning, reach out to us today! Kopius is a leader in nearshore digital technology consulting and services.


Additional resources:


Addressing AI Bias – Four Critical Questions


By Hayley Pike

As AI becomes even more integrated into business, so does AI bias.

On February 2, 2023, Microsoft released a statement from Vice Chair & President Brad Smith about responsible AI. In the wake of the newfound influence of ChatGPT and Stable Diffusion, considering the history of racial bias in AI technologies is more important than ever.

The discussion around racial bias in AI has been going on for years, and with it, there have been signs of trouble. Google fired two of its researchers, Dr. Timnit Gebru and Dr. Margaret Mitchell after they published research papers outlining how Google’s language and facial recognition AI were biased against women of color. And speech recognition software from Amazon, Microsoft, Apple, Google, and IBM misidentified speech from Black people at a rate of 35%, compared to 19% of speech from White people.

In more recent news, DEI tech startup Textio analyzed ChatGPT showing how it skewed towards writing job postings for younger, male, White candidates- and the bias increased for prompts for more specific jobs.

If you are working on an AI product or project, you should take steps to address AI bias. Here are four important questions to help make your AI more inclusive:

  1. Have we incorporated ethical AI assessments into the production workflow from the beginning of the project? Microsoft’s Responsible AI resources include a project assessment guide.
  2. Are we ready to disclose our data source strengths and limitations? Artificial intelligence is as biased as the data sources it draws from. The project should disclose who the data is prioritizing and who it is excluding.
  3. Is our AI production team diverse? How have you accounted for the perspectives of people who will use your AI product that are not represented in the project team or tech industry?
  4. Have we listened to diverse AI experts? Dr. Joy Buolamwini and Dr. Inioluwa Deborah Raji, currently at the MIT Media Lab, are two black female researchers who are pioneers in the field of racial bias in AI.

Rediet Adebe is a computer scientist and co-founder of Black in AI. Adebe sums it up like this:

“AI research must also acknowledge that the problems we would like to solve are not purely technical, but rather interact with a complex world full of structural challenges and inequalities. It is therefore crucial that AI researchers collaborate closely with individuals who possess diverse training and domain expertise.”

To learn more about artificial intelligence and machine learning, reach out to us today! Kopius is a leader in nearshore digital technology consulting and services.


Additional resources:


ChatGPT and Foundation Models: The Future of AI-Assisted Workplace


By Yuri Brigance

The rise of generative models such as ChatGPT and Stable Diffusion has generated a lot of discourse about the future of work and the AI-assisted workplace. There is tremendous excitement about the awesome new capabilities such technology promises, as well as concerns over losing jobs to automation. Let’s look at where we are today, how we can leverage these new AI-generated text technologies to supercharge productivity, and what changes they may signal to a modern workplace.

Will ChatGPT Take Away Your Job?

That’s the question on everyone’s mind. AI can generate images, music, text, and code. Does this mean that your job as a designer, developer, or copywriter is about to be automated? Well, yes. Your job will be automated in the sense that it is about to become a lot more efficient, but you’ll still be in the driver’s seat.

First, not all automation is bad. Before personal computers became mainstream, taxes were completed with pen and paper. Did modern tax software put accountants out of business? Not at all. It made their job easier by automating repetitive, boring, and boilerplate tasks. Tax accountants are now more efficient than ever and can focus on mastering tax law rather than wasting hours pushing paper. They handle more complicated tax cases, those personalized and tailored to you or your business. Similarly, it’s fair to assume that these new generative AI tools will augment creative jobs and make them more efficient and enjoyable, not supplant them altogether.

Second, generative models are trained on human-created content. This ruffles many feathers, especially those in the creative industry whose art is being used as training data without the artist’s explicit permission, allowing the model to replicate their unique artistic style. Stability.ai plans to address this problem by enabling artists to opt out of having their work be part of the dataset, but realistically there is no way to guarantee compliance and no definitive way to prove whether your art is still being used to train models. But this does open interesting opportunities. What if you licensed your style to an AI company? If you are a successful artist and your work is in demand, there could be a future where you license your work to be used as training data and get paid any time a new image is generated based on your past creations. It is possible that responsible AI creators can calculate the level of gradient updates during training, and the percentage of neuron activation associated to specific samples of data to calculate how much of your licensed art was used by the model to generate an output. Just like Spotify pays a small fee to the musician every time someone plays one of their songs, or how websites like Flaticon.com pay a fee to the designer every time one of their icons is downloaded.  Long story short, it is likely that soon we’ll see more strict controls over how training datasets are constructed regarding licensed work vs public domain.

Let’s look at some positive implications of this AI-assisted workplace and technology as it relates to a few creative roles and how this technology can streamline certain tasks.

As a UI designer, when designing web and mobile interfaces you likely spend significant time searching for stock imagery. The images must be relevant to the business, have the right colors, allow for some space for text to be overlaid, etc. Some images may be obscure and difficult to find. Hours could be spent finding the perfect stock image. With AI, you can simply generate an image based on text prompts. You can ask the model to change the lighting and colors. Need to make room for a title? Use inpainting to clear an area of the image. Need to add a specific item to the image, like an ice cream cone? Show AI where you want it, and it’ll seamlessly blend it in. Need to look up complementary RGB/HEX color codes? Ask ChatGPT to generate some combinations for you.

Will this put photographers out of business? Most likely not. New devices continue to come out, and they need to be incorporated into the training data periodically. If we are clever about licensing such assets for training purposes, you might end up making more revenue than before, since AI can use a part of your image and pay you a partial fee for each request many times a day, rather than having one user buy one license at a time. Yes, work needs to be done to enable this functionality, so it is important to bring this up now and work toward a solution that benefits everyone. But generative models trained today will be woefully outdated in ten years, so the models will continue to require fresh human-generated real-world data to keep them relevant. AI companies will have a competitive edge if they can license high-quality datasets, and you never know which of your images the AI will use – you might even figure out which photos to take more of to maximize that revenue stream.

Software engineers, especially those in professional services frequently need to switch between multiple programming languages. Even on the same project, they might use Python, JavaScript / TypeScript, and Bash at the same time. It is difficult to context switch and remember all the peculiarities of a particular language’s syntax. How to efficiently do a for-loop in Python vs Bash? How to deploy a Cognito User Pool with a Lambda authorizer using AWS CDK? We end up Googling these snippets because working with this many languages forces us to remember high-level concepts rather than specific syntactic sugar. GitHub Gist exists for the sole purpose of offloading snippets of useful code from local memory (your brain) to external storage. With so much to learn, and things constantly evolving, it’s easier to be aware that a particular technique or algorithm exists (and where to look it up) rather than remember it in excruciating detail as if reciting a poem. Tools like ChatGPT integrated directly into the IDE would reduce the amount of time developers spend remembering how to create a new class in a language they haven’t used in a while, how to set up branching logic or build a script that moves a bunch of files to AWS S3. They could simply ask the IDE to fill in this boilerplate to move on to solving the more interesting algorithmic challenges.

An example of asking ChatGPT how to use Python decorators. The text and example code snippet is very informative.

For copywriters, it can be difficult to overcome the writer’s block of not knowing where to start or how to conclude an article. Sometimes it’s challenging to concisely describe a complicated concept. ChatGPT can be helpful in this regard, especially as a tool to quickly look up clarifying information about a topic. Though caution is justified as demonstrated recently by Stephen Wolfram, CEO of Wolfram Alpha who makes a compelling argument that ChatGPT’s answers should not always be taken at face value.. So doing your own research is key. That being the case, OpenAI’s model usually provides a good starting point at explaining a concept, and at the very least it can provide pointers for further research. But for now, writers should always verify their answers. Let’s also be reminded that ChatGPT has not been trained on any new information created after the year 2021, so it is not aware of new developments on the war in Ukraine, current inflation figures, or the recent fluctuations of the stock market, for example.

In Conclusion

Foundation models like ChatGPT and Stable Diffusion can augment and streamline workflows, and they are still far from being able to directly threaten a job. They are useful tools that are far more capable than narrowly focused deep learning models, and they require a degree of supervision and caution. Will these models become even better 5-10 years from now? Undoubtedly so. And by that time, we might just get used to them and have several years of experience working with these AI agents, including their quirks and bugs.

There is one important thing to take away about Foundation Models and the future of the AI-assisted workplace: today they are still very expensive to train. They are not connected to the internet and can’t consume information in real-time, in online incremental training mode. There is no database to load new data into, which means that to incorporate new knowledge, the dataset must grow to encapsulate recent information, and the model must be fine-tuned or re-trained from scratch on this larger dataset. It’s difficult to verify that the model outputs factually correct information since the training dataset is unlabeled and the training procedure is not fully supervised. There are interesting open source alternatives on the horizon (such as the U-Net-based StableDiffusion), and techniques to fine-tune portions of the larger model to a specific task at hand, but those are more narrowly focused, require a lot of tinkering with hyperparameters, and generally out of scope for this particular article.

It is difficult to predict exactly where foundation models will be in five years and how they will impact the AI-assisted workplace since the field of machine learning is rapidly evolving. However, it is likely that foundation models will continue to improve in terms of their accuracy and ability to handle more complex tasks. For now, though, it feels like we still have a bit of time before seriously worrying about losing our jobs to AI. We should take advantage of this opportunity to hold important conversations now to ensure that the future development of such systems maintains an ethical trajectory.

To learn more about our generative AI solutions, reach out to us today! Kopius is a leader in nearshore digital technology consulting and services.


Additional resources:


What Separates ChatGPT and Foundation Models from Regular AI Models?


By Yuri Brigance

This introduces what separates foundation models from regular AI models. We explore the reasons these models are difficult to train and how to understand them in the context of more traditional AI models.

chatGPT Foundation Model

What Are Foundation Models?

What are foundation models, and how are they different from traditional deep learning AI models? The Stanford Institute’s Center of Human-Centered AI defines a foundation model as “any model that is trained on broad data (generally using self-supervision at scale) that can be adapted to a wide range of downstream tasks”. This describes a lot of narrow AI models as well, such as MobileNets and ResNets – they too can be fine-tuned and adapted to different tasks.

The key distinctions here are “self-supervision at scale” and “wide range of tasks”.

Foundation models are trained on massive amounts of unlabeled/semi-labeled data, and the model contains orders of magnitude more trainable parameters than a typical deep learning model meant to run on a smartphone. This makes foundation models capable of generalizing to a much wider range of tasks than smaller models trained on domain-specific datasets. It is a common misconception that throwing lots of data at a model will suddenly make it do anything useful without further effort.  Actually, such large models are very good at finding and encoding intricate patterns in the data with little to no supervision – patterns which can be exploited in a variety of interesting ways, but a good amount of work needs to happen in order to use this learned hidden knowledge in a useful way.

The Architecture of AI Foundation Models

Unsupervised, semi-supervised, and transfer learning are not new concepts, and to a degree, foundation models fall into this category as well. These learning techniques trace their roots back to the early days of generative modeling such as Restricted Boltzmann Machines and Autoencoders. These simpler models consist of two parts: an encoder and a decoder. The goal of an autoencoder is to learn a compact representation (known as encoding or latent space) of the input data that captures the important features or characteristics of the data, aka “progressive linear separation” of the features that define the data. This encoding can then be used to reconstruct the original input data or generate entirely new synthetic data by feeding cleverly modified latent variables into the decoder.

An example of a convolutional image autoencoder model architecture is trained to reconstruct its own input, ex: images. Intelligently modifying the latent space allows us to generate entirely new images. One can expand this by adding an extra model that encodes text prompts into latent representations understood by the decoder to enable text-to-image functionality.

Many modern ML models use this architecture, and the encoder portion is sometimes referred to as the backbone with the decoder being referred to as the head. Sometimes the models are symmetrical, but frequently they are not. Many model architectures can serve as the encoder or backbone, and the model’s output can be tailored to a specific problem by modifying the decoder or head. There is no limit to how many heads a model can have, or how many encoders. Backbones, heads, encoders, decoders, and other such higher-level abstractions are modules or blocks built using multiple lower-level linear, convolutional, and other types of basic neural network layers. We can swap and combine them to produce different tailor-fit model architectures, just like we use different third-party frameworks and libraries in traditional software development. This, for example, allows us to encode a phrase into a latent vector which can then be decoded into an image.

Foundation Models for Natural Language Processing

Modern Natural Language Processing (NLP) models like ChatGPT fall into the category of Transformers. The transformer concept was introduced in the 2017 paper “Attention Is All You Need” by Vaswani et al. and has since become the basis for many state-of-the-art models in NLP. The key innovation of the transformer model is the use of self-attention mechanisms, which allow the model to weigh the importance of different parts of the input when making predictions. These models make use of something called an “embedding”, which is a mathematical representation of a discrete input, such as a word, a character, or an image patch, in a continuous, high-dimensional space. Embeddings are used as input to the self-attention mechanisms and other layers in the transformer model to perform the specific task at hand, such as language translation or text summarization. ChatGPT isn’t the first, nor the only transformer model around. In fact, transformers have been successfully applied in many other domains such as computer vision and sound processing.

So if ChatGPT is built on top of existing concepts, what makes it so different from all the other state-of-the-art model architectures already in use today? A simplified explanation of what distinguishes a foundation model from a “regular” deep learning model is the immense scale of the training dataset as well as the number of trainable parameters that a foundation model has over a traditional generative model. An exceptionally large neural network trained on a truly massive dataset gives the resulting model the ability to generalize to a wider range of use cases than its more narrowly focused brethren, hence serving as a foundation for an untold number of new tasks and applications. Such a large model encodes many useful patterns, features, and relationships in its training data. We can mine this body of knowledge without necessarily re-training the entire encoder portion of the model. We can attach different new heads and use transfer learning and fine-tuning techniques to adapt the same model to different tasks. This is how just one model (like Stable Diffusion) can perform text-to-image, image-to-image, inpainting, super-resolution, and even music generation tasks all at once.

Challenges in Training Foundation Models

The GPU computing power and human resources required to train a foundation model like GPT from scratch dwarf those available to individual developers and small teams. The models are simply too large, and the dataset is too unwieldy. Such models cannot (as of now) be cost-effectively trained end-to-end and iterated using commodity hardware.

Although the concepts may be well explained by published research and understood by many data scientists, the engineering skills and eye-watering costs required to wire up hundreds of GPU nodes for months at a time would stretch the budgets of most organizations. And that’s ignoring the costs of dataset access, storage, and data transfer associated with feeding the model massive quantities of training samples.

There are several reasons why models like ChatGPT are currently out of reach for individuals to train:

  1. Data requirements: Training a large language model like ChatGPT requires a massive amount of text data. This data must be high-quality and diverse and is typically obtained from a variety of sources such as books, articles, and websites. This data is also preprocessed to get the best performance, which is an additional task that requires knowledge and expertise. Storage, data transfer, and data loading costs are substantially higher than what is used for more narrowly focused models.
  2. Computational resources: ChatGPT requires significant computational resources to train. This includes networked clusters of powerful GPUs, and a large amount of memory volatile and non-volatile. Running such a computer cluster can easily reach hundreds of thousands per experiment.
  3. Training time: Training a foundation model can take several weeks or even months, depending on the computational resources available. Wiring up and renting this many resources requires a lot of skill and a generous time commitment, not to mention associated cloud computing costs.
  4. Expertise: Getting a training run to complete successfully requires knowledge of machine learning, natural language processing, data engineering, cloud infrastructure, networking, and more. Such a large cross-disciplinary set of skills is not something that can be easily picked up by most individuals.

Accessing Pre-Trained AI Models

That said, there are pre-trained models available, and some can be fine-tuned with a smaller amount of data and resources for a more specific and narrower set of tasks, which is a more accessible option for individuals and smaller organizations.

Stable Diffusion took $600k to train – the equivalent of 150K GPU hours. That is a cluster of 256 GPUs running 24/7 for nearly a month.  Stable Diffusion is considered a cost reduction compared to GPT. So, while it is indeed possible to train your own foundation model using commercial cloud providers like AWS, GCP, or Azure, the time, effort, required expertise, and overall cost of each iteration impose limitations on their use. There are many workarounds and techniques to re-purpose and partially re-train these models, but for now, if you want to train your own foundation model from scratch your best bet is to apply to one of the few companies which have access to resources necessary to support such an endeavor.

Contact Us for AI Services

If you are ready to leverage artificial intelligence and machine learning solutions, reach out to us today! Kopius is a leader in nearshore digital technology consulting and services.


Additional resources:


Data Trends: Six Ways Data Will Change Business in 2023 and Beyond


By Kristina Scott

Data is big and getting bigger. We’ve tracked six major data-driven trends for the coming year.

Digital analytics data visualization, financial schedule, monitor screen in perspective

Data is one of the fastest-growing and most innovative opportunities today to shape the way we work and lead. IDC predicts that by 2024, the inability to perform data- and AI-driven strategy will negatively affect 75% of the world’s largest public companies. And by 2025, 50% of those companies will promote data-informed decision-making by embedding analytics in their enterprise software (up from 33% in 2022), boosting demand for more data solutions and data-savvy employees.

Here is how data trends will shift in 2023 and beyond:

  1. Data Democratization Drives Data Culture

If you think data is only relevant to analysts with advanced knowledge of data science, we’ve got news for you.  Data democratization is one of the most important trends in data. Gartner research forecasts that 80% of data-driven initiatives that are focused on business outcomes will become essential business functions by 2025.

Organizations are creating a data culture by attracting data-savvy talent and promoting data use and education for employees at all levels. To support data democratization, data must be exact, easily digestible, and accessible.

Research by McKinsey found that high-performing companies have a data leader in the C-suite and make data and self-service tools universally accessible to frontline employees.

2. Hyper-Automation and Real-Time Data Lower Costs

Real-time data and its automation will be the most valuable big data tools for businesses in the coming years. Gartner forecasts that by 2024, rapid hyper-automation will allow organizations to lower operational costs by 30%. And by 2025, the market for hyper-automation software will hit nearly $860 billion.

3. Artificial Intelligence and Machine Learning (AI & ML) Continue to Revolutionize Operations

The ability to implement AI and ML in operations will be a significant differentiator. Verta Insights found that industry leaders that outperform their peers financially, are more than 2x as likely to ship AI projects, products, or features, and have made AI/ML investments at a higher level than their peers.

AI and ML technologies will boost the Natural Language Processing (NLP) market. NLP enables machines to understand and communicate with us in spoken and written human languages. The NLP market size will grow from $15.7 billion in 2022 to $49.4 billion by 2027, according to research from MarketsandMarkets.

We have seen the wave of interest in OpenAI’s ChatGPT, a conversational language-generation software. This highly-scalable technology could revolutionize a range of use cases— from summarizing changes to legal documents to completely changing how we research information through dialogue-like interactions, says CNBC.

This can have implications in many industries. For example, the healthcare sector already employs AI for diagnosis and treatment recommendations, patient engagement, and administrative tasks. 

4. Data Architecture Leads to Modernization

Data architecture accelerates digital transformation because it solves complex data problems through the automation of baseline data processes, increases data quality, and minimizes silos and manual errors. Companies modernize by leaning on data architecture to connect data across platforms and users. Companies will adopt new software, streamline operations, find better ways to use data, and discover new technological needs.

According to MuleSoft, organizations are ready to automate decision-making, dynamically improve data usage, and cut data management efforts by up to 70% by embedding real-time analytics in their data architecture.

5. Multi-Cloud Solutions Optimize Data Storage

Cloud use is accelerating. Companies will increasingly opt for a hybrid cloud, which combines the best aspects of private and public clouds.

Companies can access data collected by third-party cloud services, which reduces the need to build custom data collection and storage systems, which are often complex and expensive.

In the Flexera State of Cloud Report, 89% of respondents have a multi-cloud strategy, and 80% are taking a hybrid approach.

6. Enhanced Data Governance and Regulation Protect Users

Effective data governance will become the foundation for impactful and valuable data. 

As more countries introduce laws to regulate the use of various types of data, data governance comes to the forefront of data practices. European GDPR, Canadian PIPEDA, and Chinese PIPL won’t be the last laws that are introduced to protect citizen data.

Gartner has predicted that by 2023, 65% of the world’s population will be covered by regulations like GDPR. In turn, users will be more likely to trust companies with their data if they know it is more regulated.

Valence works with clients to implement a governance framework, find sources of data and data risk, and activate the organization around this innovative approach to data and process governance, including education, training, and process development. Learn more.

What these data trends add up to

As we step into 2023, organizations that understand current data trends can harness data to become more innovative, strategic, and adaptable. Our team helps clients with data assessments, by designing and structuring data assets, and by building modern data management solutions. We strategically integrate data into client businesses, use machine learning and artificial intelligence to create proactive insights, and create data visualizations and dashboards to make data meaningful.  

We help clients to develop a solution and create a modern data architecture that supports differentiated, cloud-enabled scalability, self-service capability, and faster time-to-market for new data products and solutions. Learn more.

Additional resources:


Training the Machines: An Introduction To Types of Machine Learning


by Yuri Brigance

I previously wrote about deep learning at the Edge. In this post I’m going to describe the process of setting up an end-to-end Machine Learning (ML) workflow for different types of machine learning.

There are three common types of machine learning training approaches, which we will review here:

  1. Supervised
  2. Unsupervised
  3. Reinforcement

And since all learning approaches require some type of training data, I will also share three methods to build out your training dataset via:

  1. Human Annotation
  2. Machine Annotation
  3. Synthesis / Simulation

Supervised Learning:

Supervised learning uses a labeled training set of both inputs and outputs to teach a model to yield the desired outcome. This approach typically relies on a loss function, which is used to evaluate training accuracy until the error has been sufficiently minimized.

This type of learning approach is arguably the most common, and in a way, it mimics how a teacher explains the subject matter to a student through examples and repetition.

One downside to supervised learning is that this approach requires large amounts of accurately labeled training data. This training data can be annotated manually (by humans), via machine annotation (annotated by other models or algorithms), or completely synthetic (ex: rendered images or simulated telemetry). Each approach has its pros and cons, and they can be combined as needed.

Unsupervised Learning:

Unlike supervised learning, where a teacher explains a concept or defines an object, unsupervised learning gives the machine the latitude to develop understanding on its own. Often with unsupervised learning, the machines can find trends and patterns that a person would otherwise miss. Frequently these correlations elude common human intuition and can be described as non-semantic. For this reason, the term “black box” is commonly applied to such models, such as the awe-inspiring GPT-3.

With unsupervised learning, we give data to the machine learning model that is unlabeled and unstructured. The computer then identifies clusters of similar data or patterns in the data. The computer might not find the same patterns or clusters that we expected, as it learns to recognize the clusters and patterns on its own. In many cases, being unrestricted by our preconceived notions can reveal unexpected results and opportunities.   

Reinforcement Learning:

Reinforcement learning teaches a machine to act in a semi-supervised approach. The machines are rewarded for correct answers, and the machine wants to be rewarded as much as possible. Reinforcement learning is an efficient way to train a machine to learn a complicated task, such as playing video games or teaching a legged robot to walk.

The machine is motivated to be rewarded, but the machine doesn’t share the operator’s goals. So if the machine can find a way to “game the system” and get more reward at the cost of accuracy, it will greedily do so. Just as machines can find patterns that humans miss in unsupervised learning, machines can also find missed patterns in reinforcement learning, and exploit those invisible patterns to receive additional reinforcement. This is why your experiment needs to be airtight to minimize exploitation by the machines.

For example, an AI twitterbot that was trained with reinforcement learning was rewarded for maximizing engagement. The twitterbot learned that engagement was extremely high when it posted about Hitler.

This machine behavior isn’t always a problem – for example reinforcement learning helps machines find bugs in video games that can be exploited if they aren’t resolved.

Datasets:

Machine Learning implies that you have data to learn from. The quality and quantity of your training data has a lot to do with how well your algorithm can perform. A training dataset typically consists of samples, or observations. Each training sample can be an image, audio clip, text snippet, sequence of historical records, or any other type of structured data. Depending on which machine learning approach you take, each sample may also include annotations (correct outputs / solutions) that are used to teach the model and verify the results. Training datasets are commonly split into groups where the model only trains on a sub-set of all available data. This allows a portion of the dataset to be used for validation of the model, to ensure that the model has generalized enough data to perform well on data it has not seen before.

Regardless of which training approach you take, your model can be prone to bias which may be inadvertently introduced through unbalanced training data, or selection of the wrong inputs. One example is an AI criminal risk assessment tool used by courts to evaluate how likely a defendant is to reoffend based on their profile as input. Because the model was trained on historical data, which included years of disproportionate targeting by law enforcement of low-income and minority groups, the resulting model produced higher risk scores for low-income and minority individuals. It is important to remember that most machine learning models pick up on statistical correlations, and not necessarily causations.

Therefore, it is highly desirable to have a large and balanced training dataset for your algorithm, which is not always readily available or easy to obtain. This is a task which may initially be overlooked by businesses excited to apply machine learning to their use cases. Dataset acquisition is as important as the model architecture itself.

One way to ensure that the training dataset is balanced is through Design of Experiments (DOE) approach, where controlled experiments are planned and analyzed to evaluate the factors which control the value of an output parameter or group of parameters. DOE allows for multiple input factors to be manipulated, determining their effect on the model’s response. Thus, giving us the ability to exclude certain inputs which may lead to biased results, as well as gain a better understanding of the complex interactions that occur inside the model.

Here are three examples of how training data is collected, and in some cases generated:

  1. Human Labeled Data:

What we refer to human labeled data is anything that has been annotated by a living human, either through crowdsourcing or by querying a database and organizing the dataset. An example of this could be annotating facial landmarks around the eyes, nose, and mouth. These annotations are pretty good, but in certain instances can be imprecise. For example, the definition of “the tip of the nose” can be interpreted differently by different humans who are tasked with labeling the dataset. Even simple tasks, like drawing a bounding box around apples in photos can have “noise” because the bounding box may have more or less padding, may be slightly off center, and so on.

Human labeled data is a great start if you have it. But hiring human annotators can be expensive and prone to error. Various services and tools exist, from AWS SageMaker GroundTruth to several startups which make the labeling job easier for the annotators, and also connect annotation vendors with clients.

It might be possible to find an existing dataset in the public domain. In an example with facial landmarks, we have WFLW, iBUG, and other publicly available datasets which are perfectly suitable for training. Many have licenses that allow commercial use. It’s a good idea to research whether someone has already produced a dataset that fits your needs, and it might be worth paying for a small dataset to bootstrap your learning process.

2. Machine Annotation:

In plain terms, machine annotation is where you take an existing algorithm or build a new algorithm to add annotations to your raw data automatically. It sounds like a chicken and egg situation, but it’s more feasible than it initially seems.

For example, you might already have a partially labeled dataset. Let’s imagine you are labeling flowers in bouquet photos, and you want to identify each flower. Maybe you had some portion of these images already annotated with tulips, sunflowers, and daffodils. But there are still images in the training dataset that contain tulips which have not been annotated, and new images keep coming in from your photographers.

So, what can you do? In this case, you can take all the existing images where the tulips have already been annotated and train a simple tulip-only detector model. Once this model reaches sufficient accuracy, you can fill in the remaining missing tulip annotations automatically. You can keep doing this for the other flowers. In fact, you can crowdsource humans to annotate just a small batch of images with a specific new flower, and that should be enough to build a dedicated detector that can machine-annotate your remaining samples. In this way, you save time and money by not having humans annotate every single image in your training set or every new raw image that comes in. The resulting dataset can be used to train a more complete production-grade detector, which can detect all the different types of flowers. Machine annotation also gives you the ability to continue improving your production model by continuously and automatically annotating new raw data as it arrives. This achieves a closed-loop continuous training and improvement cycle.

Another example is where you have incompatible annotations. For example, you might want to detect 3D positions of rectangular boxes from webcam images, but all you have are 2D landmarks for the visible box corners. How do you estimate and annotate the occluded corners of each box, let alone figure out their position in 3D space? Well, you can use a Principal Component Analysis (PCA) morphable model of a box and fit it to 2D landmarks, then de-project the detected 3D shape into 3D space using camera intrinsics . This gives you full 3D annotations, including the occluded corners. Now you can train a model that does not require PCA fitting.

In many cases you can put together a conventional deterministic algorithm to annotate your images. Sure, such algorithms might be too slow to run in real-time, but that’s not the point. The point is to label your raw data so you can train a model, which can be inferenced in milliseconds.

Machine annotation is an excellent choice to build up a huge training dataset quickly, especially if your data is already partially labeled. However, just like with human annotations, machine annotation can introduce errors and noise. Carefully consider which annotations should be thrown out based on a confidence metric or some human review, for example. Even if you include a few bad samples, the model will likely generalize successfully with a large enough training set, and bad samples can be filtered out over time.

3. Synthetic Data

With synthetic data, machines are trained on renderings or in hyper-realistic simulations – think of a video game of a city commute, for example. For Computer Vision applications, a lot of synthetic data is produced via rendering, whether you are rendering people, cars, entire scenes, or individual objects. Rendered 3D objects can be placed in a variety of simulated environments to approximate the desired use case. We’re not limited to renderings either, as it is possible to produce synthetic data for numeric simulations where the behavior of individual variables is well known. For example, modeling fluid dynamics or nuclear fusion is extremely computationally intensive, but the rules are well understood – they are the laws of physics. So, if we want to approximate fluid dynamics or plasma interactions quickly, we might first produce simulated data using classical computing, then feed this data into a machine learning model to speed up prediction via ML inference.

There are vast examples of commercial applications of synthetic data. For example, what if we needed to annotate the purchase receipts for a global retailer, starting with unprocessed scans of paper receipts? Without any existing metadata, we would need humans to manually review and annotate thousands of receipt images to assess buyer intentions and semantic meaning. With a synthetic data generator, we can parameterize the variations of a receipt and accurately render them to produce synthetic images with full annotations. If we find that our model is not performing well under a particular scenario, we can just render more samples as needed to fill in the gaps and re-train.

Another real-world example is in manufacturing where “pick-and-place” robots use computer vision on an assembly line to pack or arrange and assemble products and components. Synthetic data can be applied in this scenario because we can use the same 3D models that were used to create injection molds of the various components to make renderings as training samples that teach the machines. You can easily render thousands of variations of such objects being flipped and rotated, as well as simulate different lighting conditions. The synthetic annotations will always be 100% precise.

Aside from rendering, another approach is to use Generative Adversarial Network (GAN) generated imagery to create variation in the dataset. Training GAN models usually requires a decent number of raw samples. With a fully trained GAN autoencoder it is possible to explore the latent space and tweak parameters to create additional variation. Although it’s more complex than classical rendering engines, GANs are gaining steam and have their place in the synthetic data generation realm. Just look at these generated portraits of fake cats! 

Choosing the right approach:

Machine learning is on the rise across industries and in businesses of all sizes. Depending on the type of data, the quantity, and how it is stored and structured, Valence can recommend a path forward which might use a combination of the data generation and training approaches outlined in this post. The order in which these approaches are applied varies by project, and boils down to roughly four phases:

  1. Bootstrapping your training process. This includes gathering or generating initial training data and developing a model architecture and training approach. Some statistical analysis (DOE) may be involved to determine the best inputs to produce the desired outputs and predictions.
  2. Building out the training infrastructure. Access to Graphics Processing Unit (GPU) compute in the cloud can be expensive. While some models can be trained on local hardware at the beginning of the project, long-term a scalable and serverless training infrastructure and proper ML experiment lifecycle management strategy is desirable.
  3. Running experiments. In this phase we begin training the model, adjusting the dataset, experimenting with the model architecture and hyperparameters. We will collect lots of experiment metrics to gauge improvement.
  4. Inference infrastructure. This includes integrating the trained model into your system and putting it to work. This can be cloud-based inference, in which case we’ll pick the best serverless approach that minimizes cloud expenses while maximizing throughput and stability. It might also be edge inference, in which case we may need to optimize the model to run on a low-powered edge CPU, GPU, TPU, VPU, FPGA, or a combination of thereof. 

What I wish every reader understood is that these models are simple in their sophistication. There is a discovery process at the onset of every project where we identify the training data needs and which model architecture and training approach will get the desired result. It sounds relatively straight forward to unleash a neural network on a large amount of data, but there are many details to consider when setting up Machine Learning workflows. Just like real-world physical research, Machine Learning requires us to up a “digital lab” which contains the necessary tools and raw materials to investigate hypotheses and evaluate outcomes – which is why we call AI training runs “experiments”. Machine Learning has such an array of truly incredible applications that there is likely a place for it in your organization as part of your digital journey.

Additional resources: