Rather than use a typical assortment of stock photos of data centers, Macs, and servers, we decided to leverage AI to provide the illustrations for this article. To do this, we uploaded a draft of this article to an LLM (Large Language Model) and asked it to write image prompts for a diverse set of illustrations for each point, utilizing a range of artistic styles. We then used Draw Things on Mac Studio with Flux.1 Schnell to generate images for each prompt. We’ve included our favorites in the article, along with the prompt used to generate each image.
When working with large language models, context is key, as it drives the predictive model’s ability to synthesize information. Without sufficient context, AI models will generate very generic and un-useful responses, as they can only build those responses from a combination of the provided context and the general knowledge embedded in the model.
If you have experience with common AI platforms, it’s not too difficult to detect AI-generated content, because it’s written in the same generic style and tone, and often doesn’t include relevant details in its response. This most commonly occurs due to a lack of context, as the model must infer any details that aren’t explicitly provided to it in the prompt.
This is amplified in business use, especially when trying to answer difficult questions or generate content for a specific audience. Every organization is slightly different in how it operates and has domain-specific proprietary knowledge that is not included in the general datasets used to train AI models.
Knowing which context to include is often difficult. For example, you may have a specific dataset driving a question that you’re looking to answer, but how much of that dataset refers to specific organizational knowledge? Can the question you’re asking be answered with only the dataset, question, and a general understanding of human knowledge? If not, what else needs to be included? Determining which information to include as context can be time consuming, and you don’t know what you don’t know, there may be specific details or patterns in the data that you’re missing, but are recognizable by an AI model with the appropriate contextual knowledge.
Additionally, much of the most important contextual knowledge is either proprietary, or (in some cases) personally identifying, and cannot be trusted with public AI providers. This is especially true in heavily regulated industries.
With many public AI providers, there are plans and products that promise data privacy and isolation, but it’s still running on shared infrastructure. And for providers that train their own models, the incentives for data privacy are not aligned, as they need additional data to improve new versions of those models.
When you setup a private AI server, you have complete control over the data processed by the server. All of the AI models used run locally on the server, so there is no need to send the data to a third party for processing. This allows the use of proprietary or private information as context when chatting with AI models, allowing for much more focused answers and deeper analysis of business problems.
Compared to traditional x86 servers, the Mac is uniquely suited towards running very large AI models on a single server or workstation. At scale, AI models are typically run on large clusters of interconnected servers, using powerful dedicated GPU cards with limited onboard memory.
These dedicated GPU cards are very fast, but to run larger models, you need multiple GPUs, due to the RAM constraints of a single GPU. For private instances, this can get very expensive, especially for smaller teams, with a single H100 costing thousands per month.
With Apple silicon, Mac computers use unified RAM. This means that the system RAM is shared between both the CPU and onboard GPU, allowing a single server to run very large models, at a cost-effective rate for smaller teams that don’t need the performance of datacenter GPUs.
Additionally, the AI software stack available on macOS is both mature and easily managed, especially when compared to datacenter-scale GPU solutions. This is due in part to Apple’s long-term focus on machine learning applications and frameworks, even predating the advent of LLMs. Anecdotally, the wide majority of developers who work at frontier AI companies use the Mac as their local workstation, even when primarily working with remote large-scale GPU clusters.
RAG solves for the need for wide organizational knowledge as a part of the context. When a prompt is submitted, RAG searches for snippets of stored document content that have similarity with words in the prompt and inserts the snippets into the context of the prompt before generating a response. This allows for a wide range of relevant context to be added to the prompt, enabling the model to generate more accurate results.
Context insertion allows the user to include an entire document as context for a chat. This is useful if you have summary documents that define and explain certain organizational terms, or frameworks that you want the AI to use and follow when generating content. It’s also useful if you have a specific set of data that you want to analyze or pull answers from. One thing to be aware of here is the context window - it differs by model, but this is the total amount of text that can be included when generating a response. The context window is also consumed by RAG snippets and the conversation history. As a result, inserted documents should be clear and concise, including only relevant text if possible.
To effectively manage the server, it’s important to understand what each component does, as components are interchangeable, and, with the pace of AI development, frequently superseded by new contending solutions.
For the server itself, a machine with ample RAM and GPU resources is highly recommended. In the Mac lineup, the Mac Studio is optimal, as models with up to 192GB of unified RAM are available. The larger form-factor of the Mac Studio also allows the use of Apple silicon M series Max and Ultra chips, which are equipped with many more GPU cores than Pro and base model Apple silicon M series chips, offering over double the GPU performance of the highest-end Mac mini.
Large Language Models are what most people consider “AI models” to be. It’s the component that generates the text based on the provided context, providing reasoning and intelligence. Popular closed-source LLMs include ChatGPT, Claude, and Gemini.
LLMs consist of billions of predictive parameters to generate text according to the prompt. The number of parameters greatly affects the quality of results, with larger models having dozens, or even hundreds of billions of parameters.
The quality of an LLM can greatly affect the quality of generated output. Generally speaking, for a given generation of LLMs, the larger the number of parameters, the better the quality of output. That said, AI is a rapidly evolving space, and newer-generation small models can outperform older-generation large models.