<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Mundher's blog]]></title><description><![CDATA[Mundher Al-Shabi]]></description><link>https://mundher.com</link><generator>RSS for Node</generator><lastBuildDate>Tue, 14 Apr 2026 17:31:24 GMT</lastBuildDate><atom:link href="https://mundher.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[AI Agent Experience (AX)]]></title><description><![CDATA[In an article I recently co-authored, we argued that a fundamental shift is underway in product design. The traditional principles of User Experience (UX), which rest on direct user control and manipulation, are becoming obsolete with the rise of tru...]]></description><link>https://mundher.com/ai-agent-experience-ax</link><guid isPermaLink="true">https://mundher.com/ai-agent-experience-ax</guid><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Mon, 14 Jul 2025 16:00:13 GMT</pubDate><content:encoded><![CDATA[<p>In an article I recently co-authored, we argued that a fundamental shift is underway in product design. The traditional principles of User Experience (UX), which rest on direct user control and manipulation, are becoming obsolete with the rise of true AI agents.</p>
<p>Here is a summary of our main points:</p>
<blockquote>
<p>We defined this new paradigm as "Agent Experience" (AX), where the user’s role evolves from an active operator to a supervisor. Unlike simple assistants like Siri, true AI agents can independently manage complex, multi-step goals. This means users will no longer navigate intricate workflows but will instead state their objectives and oversee the AI's execution.</p>
<p>Our article posits that designers must now build "cockpits" or dashboards for monitoring and intervention, rather than step-by-step task flows. In this new world, natural language becomes the primary interface, and the core design challenge is to establish trust by ensuring the user remains the ultimate authority with clear pathways to step in and manage their AI counterparts. We concluded that the companies that succeed will be those that best empower users to supervise these increasingly autonomous systems confidently.</p>
</blockquote>
<p><a target="_blank" href="https://www.productvoyagers.com/p/ai-agent-experience-ax">Read the Full Article Here</a></p>
]]></content:encoded></item><item><title><![CDATA[The Messy Reality of Evaluating GenAI Systems]]></title><description><![CDATA[For years, evaluating traditional machine learning models, while never simple, followed a well-trodden path. Your team knew the drill: assemble a labeled dataset, define success with metrics like precision and recall, and track performance. The core ...]]></description><link>https://mundher.com/the-messy-reality-of-evaluating-genai-systems</link><guid isPermaLink="true">https://mundher.com/the-messy-reality-of-evaluating-genai-systems</guid><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Sun, 29 Jun 2025 14:41:55 GMT</pubDate><content:encoded><![CDATA[<p>For years, evaluating traditional machine learning models, while never simple, followed a well-trodden path. Your team knew the drill: assemble a labeled dataset, define success with metrics like precision and recall, and track performance. The core of the work was getting the data right to build a predictable, robust system.</p>
<p>Then came the Generative AI explosion. Suddenly, the old playbook feels inadequate. We're no longer just predicting a "churn" vs. "no churn" label. We’re generating nuanced text for marketing, complex code for features, and intricate product designs. The very definition of a "good" output has become subjective and context-dependent.</p>
<p>This paradigm shift is forcing us to rethink evaluation from the ground up. For those on the front lines building these products, this isn't just an academic exercise; it's a critical bottleneck to shipping reliable software.</p>
<h2 id="heading-the-two-worlds-of-genai-evaluation-closed-and-open-ended"><strong>The Two Worlds of GenAI Evaluation: Closed and Open-Ended</strong></h2>
<p>The first step to building a coherent strategy around GenAI is to distinguish between the two fundamental types of tasks your system might be performing.</p>
<h3 id="heading-closed-ended-predictions"><strong>Closed-Ended Predictions:</strong></h3>
<p>This is familiar territory. The model's job is to produce a specific, constrained output. Because there's a definite "right" answer, we can lean on our traditional toolkit.</p>
<p><strong>Example:</strong> You're building a feature to automatically categorize incoming support tickets ("Billing Issue", "Technical Glitch", "Feature Request").</p>
<p><strong>Why it's closed-ended:</strong> There's a predefined, finite set of correct labels.</p>
<p><strong>How you measure it:</strong> You can use <strong>precision</strong> (of all the tickets we labeled "Billing Issue," how many actually were?) and <strong>recall</strong> (of all the actual "Billing Issue" tickets, how many did we find?). These are clear, quantifiable KPIs you can build dashboards around and track release over release.</p>
<h3 id="heading-open-ended-predictions"><strong>Open-Ended Predictions:</strong></h3>
<p>This is where the real challenge begins. Think of tasks where there are many possible "good" answers.</p>
<p><strong>Example:</strong> You're launching an AI-powered feature that summarizes long customer feedback emails for internal teams.</p>
<p><strong>Why it's open-ended:</strong> A 500-word email can be summarized effectively in dozens of different ways. Which one is best? It depends on what the reader needs to know.</p>
<p><strong>The problem:</strong> Simple one-to-one comparisons with a "golden" summary in a test set are no longer sufficient. This is the core challenge for teams building the next generation of AI-powered features.</p>
<h2 id="heading-the-evaluation-toolkit-for-open-ended-generation"><strong>The Evaluation Toolkit for Open-Ended Generation</strong></h2>
<p>While the problem is complex, we're not flying completely blind. The field has developed several methods to bring structure to this ambiguity.</p>
<h3 id="heading-the-classics-bleu-and-rouge"><strong>The Classics: BLEU and ROUGE</strong></h3>
<p>These metrics are your first-line, automated tools. They compare the words and phrases in the model's output to one or more human-written reference examples.</p>
<p>Let's take our feedback summarizer feature. Suppose the original feedback is: <em>"The new dashboard is visually appealing, but the process to export reports is now much slower and requires three extra clicks. I also can't find the date filter easily."</em></p>
<p><strong>Reference Summary:</strong> "User likes the new dashboard's look but finds report exporting slower and the date</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td></td><td><strong>Generated Summary</strong></td><td><strong>The Takeaway</strong></td></tr>
</thead>
<tbody>
<tr>
<td><strong>A (High ROUGE score)</strong></td><td>"User says the dashboard's look is good but exporting reports is slow and the date filter is hard to find."</td><td>This scores well because it uses many of the same keywords. It's a good initial signal for factual recall, often useful for CI/CD checks to prevent basic regressions.</td></tr>
<tr>
<td><strong>B (Low ROUGE score)</strong></td><td>"The user praised the UI redesign. However, they reported significant workflow regressions, specifically with export speed and filter visibility."</td><td>This summary is arguably <em>more useful</em> because it synthesizes the feedback with more professional terminology. However, it would score lower on ROUGE, highlighting the limitations of relying solely on lexical metrics.</td></tr>
</tbody>
</table>
</div><h3 id="heading-the-modern-approaches-semantic-similarity-and-llm-as-a-judge"><strong>The Modern Approaches: Semantic Similarity and LLM-as-a-Judge</strong></h3>
<p>To get closer to measuring actual quality, we need more sophisticated tools.</p>
<p><strong>1. Semantic Evaluation:</strong> This measures if the <em>meaning</em> is the same, even if the words are different. In the example above, a good semantic similarity metric would score "Generated Summary B" highly against the reference because the core concepts are identical. This is a much better signal of true understanding.</p>
<p><strong>2. LLM-as-a-Judge:</strong> This is a game-changer for iterating quickly. You use a powerful LLM as a tireless, automated evaluator. As a team, you define the rubric.</p>
<p><strong>Example:</strong> To evaluate the summaries from your AI feature, you can use an API call to a "Judge" LLM with a prompt like this:</p>
<blockquote>
<p><em>You are an expert editor. Please evaluate the following AI-generated summary based on the original customer feedback.</em></p>
<p><em>Original Feedback: [Insert original email here]</em></p>
<p><em>AI-Generated Summary: [Insert summary to be evaluated here]</em></p>
<p><em>Please score the summary on a scale of 1-5 for the following criteria:</em></p>
<p><em>1.  Conciseness: Is the summary brief and to the point?</em></p>
<p><em>2.  Factual Accuracy: Does the summary correctly represent all key points from the original feedback?</em></p>
<p><em>3.  Clarity: Is the summary unambiguous and easy to understand?</em></p>
<p><em>Provide a score and a brief justification for each.</em></p>
</blockquote>
<p>This approach is faster and cheaper than constant human review and allows you to scale your evaluation based on criteria that your team defines as important for the product.</p>
<h3 id="heading-special-case-evaluating-rag-systems"><strong>Special Case: Evaluating RAG Systems</strong></h3>
<p>Retrieval-Augmented Generation (RAG) is the architecture powering most modern enterprise chatbots. It has two parts: finding the right info (Retrieval) and then using it to talk (Generation). To debug effectively, you must evaluate them separately.</p>
<ul>
<li><p><strong>Example:</strong> A customer support chatbot that answers questions based on a company's knowledge base. A user asks, "How do I get a refund for my subscription?"</p>
</li>
<li><p><strong>1. Evaluate the Retrieval:</strong> Did the system pull the correct document?</p>
<ul>
<li><p><strong>Success:</strong> The RAG system retrieves the "Refund Policy" article.</p>
</li>
<li><p><strong>Failure:</strong> The system retrieves the "Subscription Upgrade" article.</p>
</li>
<li><p><strong>The Metric:</strong> Use classic search metrics like <strong>NDCG</strong> or simple <strong>top-k hit rate</strong> (e.g., "was the correct document in the top 3 results 95% of the time?"). This tells you if your retrieval component—be it a vector database or a search API—is effective.</p>
</li>
</ul>
</li>
<li><p><strong>2. Evaluate the Generation:</strong> Assuming it found the right document, did it generate a good answer?</p>
<ul>
<li><p><strong>Success:</strong> "To get a refund, go to your Account Settings, click 'Subscription,' and follow the 'Cancel and Refund' link. You are eligible for a refund if you cancel within 14 days of purchase."</p>
</li>
<li><p><strong>Failure:</strong> "The document mentions refunds." (Accurate but useless).</p>
</li>
<li><p><strong>The Metric:</strong> Here you use the open-ended techniques: LLM-as-a-Judge to check for helpfulness, or human spot-checks.</p>
</li>
</ul>
</li>
</ul>
<p>By separating these, you can pinpoint the real problem. Is the chatbot failing because the knowledge base is bad (a content/retrieval problem) or because it's bad at explaining things (a generation problem)? This distinction is critical for assigning bugs and planning sprints.</p>
<h2 id="heading-it-all-comes-down-to-data-and-a-hybrid-approach"><strong>It All Comes Down to Data and a Hybrid Approach</strong></h2>
<p>The uncomfortable truth for any team building with AI is that the biggest challenge isn't the metrics; it's the <strong>test data</strong>. Creating a high-quality, diverse evaluation set that reflects real-world usage is an expensive, ongoing engineering and product effort. I’ve seen many teams deprioritize this, only to build evaluation systems that show green lights while the actual user experience stagnates.</p>
<p>The most successful GenAI companies, like OpenAI and Anthropic, have a core competency in data curation for evaluation. This is not a coincidence.</p>
<p><strong>An Action Plan for Teams:</strong></p>
<ol>
<li><p><strong>Treat the Test Set as a Product:</strong> It needs to be versioned, maintained, and expanded alongside your main product. This is a shared responsibility: product managers define the key use cases to cover, and engineers build the infrastructure to test against them reliably.</p>
</li>
<li><p><strong>Build a Hybrid Evaluation Dashboard:</strong> Don't rely on one number. Your team's dashboard should provide a complete picture:</p>
<ul>
<li><p><strong>Automated Metrics (ROUGE/Semantic Similarity):</strong> For a quick, directional pulse in your CI/CD pipeline.</p>
</li>
<li><p><strong>LLM-as-a-Judge Scores:</strong> For scalable, criteria-based quality checks on nuanced attributes like "helpfulness" or "brand tone."</p>
</li>
<li><p><strong>Human Feedback:</strong> A direct pipeline from a "thumbs up/down" button in the UI or periodic human reviews on critical flows.</p>
</li>
</ul>
</li>
<li><p><strong>Define "Quality" Holistically:</strong> Your team's definition of done must go beyond factual correctness. Your evaluation rubric should include:</p>
<ul>
<li><p><strong>Safety:</strong> Does it avoid harmful or inappropriate language?</p>
</li>
<li><p><strong>Tone of Voice:</strong> Does it sound like your brand?</p>
</li>
<li><p><strong>User Satisfaction:</strong> Is the user actually achieving their goal?</p>
</li>
</ul>
</li>
<li><p><strong>Deconstruct Your Systems:</strong> For multi-step systems like RAG, insist on component-level metrics. When a metric drops, you need to know <em>which part</em> of the system broke to debug efficiently.</p>
</li>
</ol>
<p>Building in the GenAI era requires a new level of rigor in how we define and measure quality. Moving past simplistic scores to a holistic evaluation framework is no longer optional—it's the core work of building AI products that are not just impressive demos, but are also reliable, safe, and genuinely useful.</p>
]]></content:encoded></item><item><title><![CDATA[The Decline of the "Prompt Expert": Why AI Is Making Prompt Engineering Obsolete]]></title><description><![CDATA[For the past few years, the rise of large language models (LLMs) has fueled a growing industry of so-called "prompt experts"—people who claim to have mastered the art of crafting precise instructions to extract the best results from AI. But as LLMs b...]]></description><link>https://mundher.com/the-decline-of-the-prompt-expert-why-ai-is-making-prompt-engineering-obsolete</link><guid isPermaLink="true">https://mundher.com/the-decline-of-the-prompt-expert-why-ai-is-making-prompt-engineering-obsolete</guid><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Sun, 09 Feb 2025 18:07:55 GMT</pubDate><content:encoded><![CDATA[<p>For the past few years, the rise of large language models (LLMs) has fueled a growing industry of so-called "prompt experts"—people who claim to have mastered the art of crafting precise instructions to extract the best results from AI. But as LLMs become more advanced, the importance of prompt engineering is rapidly diminishing. The reality is simple: AI is getting better at understanding natural language, making elaborate prompting techniques increasingly unnecessary.</p>
<h2 id="heading-ai-is-becoming-more-intuitive">AI Is Becoming More Intuitive</h2>
<p>The early days of LLMs often required users to experiment with different phrasings to get the best results. However, modern AI models are trained on vast amounts of data and improved architectures that enable them to interpret instructions more naturally. Instead of needing a carefully structured prompt, today’s models can process vague, casual, or even slightly ambiguous commands with ease.</p>
<p>For example, early models required precise formatting, explicit step-by-step breakdowns, and structured wording. Now, newer models can infer context, understand implied meaning, and generate useful outputs without the need for complex prompt tuning. This means that instead of focusing on how to "trick" the AI into giving the best answer, users can simply ask questions as they would to a knowledgeable human.</p>
<h2 id="heading-the-overhyped-industry-of-prompt-engineering">The Overhyped Industry of "Prompt Engineering"</h2>
<p>As with any emerging technology, a subset of self-proclaimed experts have positioned themselves as gatekeepers, offering courses, guides, and consulting services on how to craft the perfect prompt. While some strategies may have been helpful in the past, the need for such expertise is rapidly fading.</p>
<p>Most prompt engineering advice boils down to common-sense practices like being clear, specifying output format, or providing context—all things that even casual users can figure out intuitively. The AI itself is improving at handling ambiguity, reducing the necessity for highly refined prompts. As a result, the idea that businesses need dedicated "prompt specialists" is becoming increasingly outdated.</p>
<h2 id="heading-the-future-conversational-ai-not-manual-tweaking">The Future: Conversational AI, Not Manual Tweaking</h2>
<p>Instead of relying on highly specific prompts, the future of LLMs is in their ability to engage in dynamic, natural conversations. AI systems are evolving to ask clarifying questions, refine their own outputs, and adapt based on user feedback. This means that rather than needing a human to master a rigid prompting technique, AI itself will adjust based on user intent.</p>
<p>Think about how we interact with human assistants: we don’t script perfect instructions in advance; we communicate, clarify, and refine our requests in real time. That’s exactly where AI is heading. The need to manually craft prompts will soon be seen as an unnecessary relic of early AI experimentation.</p>
]]></content:encoded></item><item><title><![CDATA[What Can Machine Learning Engineers Learn from Site Reliability Engineering?]]></title><description><![CDATA[Machine learning engineers transitioning from experimental models to production systems can significantly benefit from adopting principles established in Site Reliability Engineering (SRE). By integrating SRE practices, ML engineers can build systems...]]></description><link>https://mundher.com/what-can-machine-learning-engineers-learn-from-site-reliability-engineering</link><guid isPermaLink="true">https://mundher.com/what-can-machine-learning-engineers-learn-from-site-reliability-engineering</guid><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Wed, 05 Feb 2025 10:48:39 GMT</pubDate><content:encoded><![CDATA[<p>Machine learning engineers transitioning from experimental models to production systems can significantly benefit from adopting principles established in Site Reliability Engineering (SRE). By integrating SRE practices, ML engineers can build systems that are not only accurate but also robust, scalable, and reliable. Below are key lessons drawn from SRE that directly apply to ML engineering:</p>
<hr />
<h2 id="heading-1-define-slis-and-slos-beyond-model-accuracy"><strong>1. Define SLIs and SLOs Beyond Model Accuracy</strong></h2>
<p>Traditional ML metrics like accuracy or F1 scores are insufficient for production systems. SRE emphasizes <strong>Service Level Indicators (SLIs)</strong> and <strong>Service Level Objectives (SLOs)</strong> to quantify reliability. For ML systems, this includes:</p>
<ul>
<li><p><strong>Latency</strong>: Response time for model inference.</p>
</li>
<li><p><strong>Availability</strong>: Uptime of ML APIs or services .</p>
</li>
<li><p><strong>Data Drift</strong>: Monitoring input distribution shifts that degrade model performance.<br />  By setting SLOs for these metrics, teams can prioritize reliability alongside accuracy.</p>
</li>
</ul>
<hr />
<h2 id="heading-2-automate-deployment-and-monitoring"><strong>2. Automate Deployment and Monitoring</strong></h2>
<p>SRE reduces manual toil through automation, a practice critical for ML workflows:</p>
<ul>
<li><p><strong>CI/CD Pipelines</strong>: Automate model deployment with rollback capabilities to handle faulty updates.</p>
</li>
<li><p><strong>Self-Healing Systems</strong>: Use ML to detect anomalies (e.g., data pipeline failures) and trigger remediation.</p>
</li>
<li><p><strong>Testing</strong>: Integrate automated canary testing to validate model performance in staging before full rollout.<br />  Automation minimizes human error and accelerates iteration.</p>
</li>
</ul>
<hr />
<h2 id="heading-3-prioritize-observability-for-silent-failures"><strong>3. Prioritize Observability for Silent Failures</strong></h2>
<p>ML systems often fail silently (e.g., gradual accuracy decay). SRE-inspired observability includes:</p>
<ul>
<li><p><strong>Model Metrics</strong>: Track precision/recall over time and correlate with infrastructure health.</p>
</li>
<li><p><strong>Data Lineage</strong>: Monitor data pipelines to catch preprocessing errors or missing features.</p>
</li>
<li><p><strong>Root Cause Analysis</strong>: Use tools like tracing to link model failures to specific code or data changes.<br />  Comprehensive observability helps detect issues before users are impacted.</p>
</li>
</ul>
<hr />
<h2 id="heading-4-formalize-incident-response-for-model-failures"><strong>4. Formalize Incident Response for Model Failures</strong></h2>
<p>Treat model failures like system outages using SRE incident management practices:</p>
<ul>
<li><p><strong>Runbooks</strong>: Document steps to diagnose and resolve common issues (e.g., data drift).</p>
</li>
<li><p><strong>Blameless Postmortems</strong>: Analyze failures to improve processes rather than assign blame.</p>
</li>
<li><p><strong>Escalation Paths</strong>: Define roles for triaging severe incidents (e.g., automated rollbacks vs. human intervention).<br />  Proactive incident management reduces downtime and builds trust.</p>
</li>
</ul>
<hr />
<h2 id="heading-5-design-for-resilience"><strong>5. Design for Resilience</strong></h2>
<p>SRE emphasizes building systems that withstand failures. ML engineers should:</p>
<ul>
<li><p><strong>Implement Fallbacks</strong>: Deploy simpler models (e.g., rule-based systems) as backups during outages.</p>
</li>
<li><p><strong>Redundancy</strong>: Replicate data pipelines and model servers to avoid single points of failure.</p>
</li>
<li><p><strong>Chaos Engineering</strong>: Test system resilience by intentionally injecting failures (e.g., synthetic data corruption).<br />  Resilient design ensures graceful degradation under stress.</p>
</li>
</ul>
<hr />
<h3 id="heading-conclusion"><strong>Conclusion</strong></h3>
<p>Adopting SRE principles bridges the gap between experimental ML and production-grade systems. By focusing on reliability metrics, automation, observability, and resilience, ML engineers can create solutions that are not just innovative but also dependable at scale. As ML systems grow in complexity, the SRE mindset—proactive, data-driven, and iterative—will be indispensable for maintaining performance and user trust .</p>
]]></content:encoded></item><item><title><![CDATA[The EU AI Act: First Compliance Deadline is Here]]></title><description><![CDATA[February 2 marks the first compliance deadline for the EU’s AI Act, the groundbreaking regulatory framework that officially took effect last August. This legislation sets a global precedent in defining clear boundaries for the development and deploym...]]></description><link>https://mundher.com/the-eu-ai-act-first-compliance-deadline-is-here</link><guid isPermaLink="true">https://mundher.com/the-eu-ai-act-first-compliance-deadline-is-here</guid><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Wed, 05 Feb 2025 10:44:26 GMT</pubDate><content:encoded><![CDATA[<p>February 2 marks the first compliance deadline for the EU’s AI Act, the groundbreaking regulatory framework that officially took effect last August. This legislation sets a global precedent in defining clear boundaries for the development and deployment of artificial intelligence within the European Union.</p>
<h3 id="heading-key-provisions-and-banned-ai-applications">Key Provisions and Banned AI Applications</h3>
<p>As of today, certain AI applications are outright banned to protect fundamental rights, privacy, and societal well-being. These include:</p>
<p>❌ <strong>Social Scoring Based on Personal Behavior:</strong> AI systems that rank individuals based on their social conduct, similar to systems used in some authoritarian regimes.</p>
<p>❌ <strong>Manipulative or Deceptive AI:</strong> Technologies designed to influence user decisions through covert manipulation or exploitation of psychological vulnerabilities.</p>
<p>❌ <strong>AI Exploiting Vulnerabilities:</strong> Systems targeting individuals based on specific vulnerabilities related to age, disability, or socioeconomic status.</p>
<p>❌ <strong>Crime Prediction Based on Appearance:</strong> AI models making predictions about criminal behavior based solely on physical traits.</p>
<p>❌ <strong>Biometric AI Inferring Personal Characteristics:</strong> Technologies that deduce sensitive personal information, such as sexual orientation, from biometric data.</p>
<p>❌ <strong>Real-Time Biometric Surveillance in Public Spaces:</strong> The use of AI for continuous biometric monitoring in public areas without stringent legal oversight.</p>
<p>❌ <strong>Emotion Recognition at Work or School:</strong> AI tools aimed at analyzing emotions in professional or educational environments, which can lead to invasive surveillance.</p>
<p>❌ <strong>Facial Recognition Databases from Online Scraping:</strong> Databases created by harvesting facial images from the internet without explicit consent.</p>
<h3 id="heading-what-this-means-for-businesses">What This Means for Businesses</h3>
<p>Organizations operating within the EU or offering AI products and services in the region must ensure compliance with these regulations. Non-compliance can result in hefty fines and reputational damage. Businesses should:</p>
<ul>
<li><p><strong>Conduct thorough audits</strong> of their AI systems.</p>
</li>
<li><p><strong>Eliminate or modify</strong> prohibited functionalities.</p>
</li>
<li><p><strong>Implement robust governance frameworks</strong> to oversee AI ethics and compliance.</p>
</li>
</ul>
<h3 id="heading-looking-ahead">Looking Ahead</h3>
<p>The EU AI Act represents a significant shift towards ethical AI development and responsible deployment. As additional compliance deadlines approach, businesses must stay proactive, continuously adapting to meet evolving regulatory requirements. This new era of AI governance is not just about legal compliance—it’s about fostering trust, accountability, and fairness in technology.</p>
]]></content:encoded></item><item><title><![CDATA[Why Optimizing for Long-Term Value (LTV) Beats Just Chasing Clicks in AdTech]]></title><description><![CDATA[In the fast-paced world of digital advertising, it’s tempting to focus on the metric that’s easiest to measure: Click-Through Rate (CTR). After all, clicks provide immediate feedback, making it seem like a straightforward indicator of campaign perfor...]]></description><link>https://mundher.com/why-optimizing-for-long-term-value-ltv-beats-just-chasing-clicks-in-adtech</link><guid isPermaLink="true">https://mundher.com/why-optimizing-for-long-term-value-ltv-beats-just-chasing-clicks-in-adtech</guid><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Wed, 05 Feb 2025 10:41:20 GMT</pubDate><content:encoded><![CDATA[<p>In the fast-paced world of digital advertising, it’s tempting to focus on the metric that’s easiest to measure: Click-Through Rate (CTR). After all, clicks provide immediate feedback, making it seem like a straightforward indicator of campaign performance. But here’s the hard truth—a click doesn’t always equal value.</p>
<h3 id="heading-enter-deep-reinforcement-learning-drl">Enter Deep Reinforcement Learning (DRL)</h3>
<p>Unlike traditional models that optimize for short-term gains, such as achieving an immediate click, Deep Reinforcement Learning (DRL) takes a broader perspective. It focuses on maximizing <strong>user lifetime value (LTV)</strong>, a metric that captures the long-term financial contribution of a user.</p>
<h3 id="heading-what-does-this-mean-in-practice">🔍 What Does This Mean in Practice?</h3>
<h4 id="heading-1-beyond-the-first-click">🔥 1. Beyond the First Click</h4>
<p>DRL models evaluate how each ad impression influences not just initial clicks but also <strong>downstream actions</strong> like sign-ups, purchases, repeat visits, and brand loyalty. It shifts the focus from “Did they click?” to “Did that click lead to meaningful engagement?” By analyzing long-term user behavior, DRL ensures that ads are optimized to attract users who are more likely to convert into loyal customers.</p>
<h4 id="heading-2-handling-delayed-rewards">♻️ 2. Handling Delayed Rewards</h4>
<p>In traditional models, the value of an ad is often judged immediately after the click. DRL, however, excels in environments where <strong>rewards are delayed</strong>. Even if a user doesn’t convert right away, DRL algorithms, such as Q-learning with discount factors, track the impact of that interaction over time. This approach allows advertisers to understand the cumulative value of each ad impression, accounting for both immediate responses and future actions.</p>
<h4 id="heading-3-smarter-budget-allocation">🎯 3. Smarter Budget Allocation</h4>
<p>When advertisers chase clicks, they risk overspending on ads that generate high CTR but low conversion rates. DRL changes the game by enabling <strong>smarter budget allocation</strong>. It helps advertisers identify and invest in strategies that nurture long-term customer relationships. This leads to <strong>maximized ROI</strong>, as funds are directed toward campaigns that drive sustainable growth rather than fleeting engagement.</p>
<h3 id="heading-the-bottom-line">The Bottom Line</h3>
<p>While CTR can provide quick insights, it doesn’t capture the full story. Optimizing for <strong>long-term value</strong> through DRL allows advertisers to build deeper connections with their audience, enhance brand loyalty, and achieve greater financial returns. In the evolving landscape of AdTech, focusing on LTV isn’t just a smarter strategy—it’s the future of digital advertising.</p>
]]></content:encoded></item><item><title><![CDATA[Instead of using retrieval to enhance ChatGPT, why not use ChatGPT to improve the retrieval?]]></title><description><![CDATA[Given a query, instruct a generative model (ChatGPT) to write a passage to answer the question. The passage may contain factual errors, but it looks like a good answer!

The generated passage is passed through an Encoder (Contriever) to get the embed...]]></description><link>https://mundher.com/instead-of-using-retrieval-to-enhance-chatgpt-why-not-use-chatgpt-to-improve-the-retrieval</link><guid isPermaLink="true">https://mundher.com/instead-of-using-retrieval-to-enhance-chatgpt-why-not-use-chatgpt-to-improve-the-retrieval</guid><category><![CDATA[chatgpt]]></category><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Thu, 28 Dec 2023 11:03:10 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1703702599369/da986bdc-a123-45d7-8a19-d9a240126b87.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<ol>
<li><p>Given a query, instruct a generative model (ChatGPT) to write a passage to answer the question. The passage may contain factual errors, but it looks like a good answer!</p>
</li>
<li><p>The generated passage is passed through an Encoder (Contriever) to get the embedding of the passage. The encoder acts like a lossy compressor, where the extra (hallucinated) details are filtered out from the embedding.</p>
</li>
<li><p>A vector to search is performed against the corpus embeddings. The most similar real documents are retrieved and returned.</p>
</li>
</ol>
<p>    <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703702594304/89f8f21f-8c99-4001-bc19-67662bb2416a.png" alt class="image--center mx-auto" /></p>
<p>Paper: <a target="_blank" href="https://arxiv.org/abs/2212.10496">https://arxiv.org/abs/2212.10496</a></p>
]]></content:encoded></item><item><title><![CDATA[In-Context Few-Shots Prompting Approach]]></title><description><![CDATA[In the Few-Shot Prompting approach, through a few demonstrations, generative models quickly adapt to a specific domain and learn to follow the task format. However, the few-shots examples are fixed for all test examples (during inference). This neces...]]></description><link>https://mundher.com/in-context-few-shots-prompting-approach</link><guid isPermaLink="true">https://mundher.com/in-context-few-shots-prompting-approach</guid><category><![CDATA[search]]></category><category><![CDATA[chatgpt]]></category><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Fri, 22 Dec 2023 10:43:51 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1703184502346/a3fdbfb5-7f57-421c-b71e-e3f84f8782dc.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the Few-Shot Prompting approach, through a few demonstrations, generative models quickly adapt to a specific domain and learn to follow the task format. However, the few-shots examples are fixed for all test examples (during inference). This necessitates that the few-shot examples selected are broadly representative and relevant to a wide distribution of text examples.</p>
<p>In the alternative, we can have a few-more-shots, and then during the inference, we dynamically select few-shots of them and provide them to the LLM. the criteria for selecting the examples are based on their embedding similarity to the query (KNN). This method is called In-Context few-shots.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703184954001/4f29f7b1-c8e0-4619-a3fe-ceab82e77f15.png" alt class="image--center mx-auto" /></p>
<p>Paper: <a target="_blank" href="https://arxiv.org/abs/2101.06804">https://arxiv.org/abs/2101.06804</a></p>
]]></content:encoded></item><item><title><![CDATA[Is Gemini Really Better than ChatGPT?]]></title><description><![CDATA[A new third-party study finds Gemini’s Pro model achieved comparable but slightly inferior accuracy compared to the current version of OpenAI’s GPT 3.5 Turbo. However, It outperforms Mixtral on every task.
Furthermore, Gemini performed better than GP...]]></description><link>https://mundher.com/is-gemini-really-better-than-chatgpt</link><guid isPermaLink="true">https://mundher.com/is-gemini-really-better-than-chatgpt</guid><category><![CDATA[gemini]]></category><category><![CDATA[chatgpt]]></category><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Tue, 19 Dec 2023 19:27:18 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1703013984215/a0daf859-1db1-4fa6-9532-dd75e290ad9f.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A new third-party study finds Gemini’s Pro model achieved comparable but slightly inferior accuracy compared to the current version of OpenAI’s GPT 3.5 Turbo. However, It outperforms Mixtral on every task.</p>
<p>Furthermore, Gemini performed better than GPT 3.5 Turbo on particularly long and complex reasoning tasks and was also adept multilingually in tasks where responses were not filtered.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1703013952791/e377d739-d506-4e66-968d-f4959dedec85.png" alt class="image--center mx-auto" /></p>
<p>source: <a target="_blank" href="https://arxiv.org/abs/2312.11444">https://arxiv.org/abs/2312.11444</a></p>
<p>However, the result on Mixteral should be taken with a grab of salt, as a user on X raised issues on the Mixteral experimental setup (<a target="_blank" href="https://x.com/fluffykittnmeow/status/1737044933339472254?s=20">https://x.com/fluffykittnmeow/status/1737044933339472254?s=20</a>)</p>
]]></content:encoded></item><item><title><![CDATA[Managing AI Risks in an Era of Rapid Progress]]></title><description><![CDATA[Prominent AI researchers, including Geoffrey Hinton, Yoshua Bengio, Stuart Russell, and others, are urging the establishment of global regulations to ensure AI is used responsibly. They propose the creation of a supervisory body, similar to the Nucle...]]></description><link>https://mundher.com/managing-ai-risks-in-an-era-of-rapid-progress</link><guid isPermaLink="true">https://mundher.com/managing-ai-risks-in-an-era-of-rapid-progress</guid><category><![CDATA[AI]]></category><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Sun, 29 Oct 2023 21:03:29 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1698613377749/44f3cb37-6786-44af-a591-3a9518f2d597.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Prominent AI researchers, including Geoffrey Hinton, Yoshua Bengio, Stuart Russell, and others, are urging the establishment of global regulations to ensure AI is used responsibly. They propose the creation of a supervisory body, similar to the Nuclear Energy Agency, to oversee (watchdog) the most advanced AI systems developed using high-end supercomputers. Concurrently, they recommend exempting smaller, low-risk AI models and academic studies from such regulations. This AI watchdog agency needs access to advanced AI systems before deployment to evaluate them for dangerous capabilities.</p>
<h2 id="heading-my-takeaway">My takeaway</h2>
<p>I’m up for such regulation as long as the small to mid-AI companies aren’t affected.</p>
<p>But, how would they differentiate between the smaller, low-risk AI models and the high-risk ones?</p>
<p>And how can we make countries that don’t trust each other agree on such regulations?</p>
<p>Moreover, how to enforce such regulations globally?</p>
<p>Would it be like the Climate Change Conferences where the world failed to secure a solid commitment?</p>
<p>Paper: <a target="_blank" href="https://managing-ai-risks.com/">https://managing-ai-risks.com/</a></p>
]]></content:encoded></item><item><title><![CDATA[Can LLMs Self-critiquing Their Own Answers?]]></title><description><![CDATA[Self-correction/critiquing is a methodology proposed to improve the accuracy and appropriateness of the generated content by Large Language Models (LLMs). It involves an LLM reviewing its own responses, identifying problems or errors, and revising it...]]></description><link>https://mundher.com/can-llms-self-critiquing-their-own-answers</link><guid isPermaLink="true">https://mundher.com/can-llms-self-critiquing-their-own-answers</guid><category><![CDATA[llm]]></category><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Sun, 22 Oct 2023 16:09:08 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1697990710760/ee0b8230-bb8c-4ef0-8e5f-66dae992f8ad.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Self-correction/critiquing is a methodology proposed to improve the accuracy and appropriateness of the generated content by Large Language Models (LLMs). It involves an LLM reviewing its own responses, identifying problems or errors, and revising its answers accordingly.</p>
<p>But, If an LLM possesses the ability to self-correct, why doesn’t it simply offer the correct answer in its initial attempt?</p>
<p>In this month (October), two research papers showed that LLMs are not yet capable of self-correcting their reasoning basically because LLMs cannot verify the solution.</p>
<p>Moreover, the iterative mode, where the question and the generated answer are feedback to the LLM over and over is degrading the quality of the answer significantly.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1697990816540/27aa85e3-f7c9-402b-b206-14b06044fe64.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-references"><strong>References:</strong></h2>
<p><a target="_blank" href="https://arxiv.org/abs/2310.01798">LARGE LANGUAGE MODELS CANNOT SELF-CORRECT REASONING YET</a>: by J Huang et. al</p>
<p><a target="_blank" href="https://arxiv.org/abs/2310.12397">GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems:</a> by K Valmeekam et. all</p>
]]></content:encoded></item><item><title><![CDATA[5 factors should be considered when selecting the Vector Search]]></title><description><![CDATA[Here are 5 factors that should be considered when selecting the Vector Search/Index algorithms.
Data size:

For data sizes under 100K, a brute-force solution utilizing a FLAT index is sufficiently efficient.

Advanced algorithms may not offer signifi...]]></description><link>https://mundher.com/5-factors-should-be-considered-when-selecting-the-vector-search</link><guid isPermaLink="true">https://mundher.com/5-factors-should-be-considered-when-selecting-the-vector-search</guid><category><![CDATA[VectorSearch]]></category><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Sun, 08 Oct 2023 17:59:24 GMT</pubDate><content:encoded><![CDATA[<p>Here are 5 factors that should be considered when selecting the Vector Search/Index algorithms.</p>
<h3 id="heading-data-size">Data size:</h3>
<ul>
<li><p>For data sizes under 100K, a brute-force solution utilizing a FLAT index is sufficiently efficient.</p>
</li>
<li><p>Advanced algorithms may not offer significant speed improvements in such scenarios.</p>
</li>
</ul>
<h3 id="heading-speed-recall-trade-off">Speed-Recall trade-off:</h3>
<ul>
<li><p>When an exact match is important then brute force is the right solution</p>
</li>
<li><p>Significant query latency reduction is achievable with a minor sacrifice in recall.</p>
</li>
</ul>
<h3 id="heading-memory-limitation">Memory limitation:</h3>
<ul>
<li><p>Some algorithms like HNSW are memory-hungry.</p>
</li>
<li><p>Scalar and Product Quantization significantly reduces storage consumption, at the expense of the Recall.</p>
</li>
</ul>
<h3 id="heading-cpu-vs-gpu">CPU vs GPU:</h3>
<ul>
<li><p>Usually moving from CPU to GPU provides a speed boost</p>
</li>
<li><p>Not all algorithms are optimized for GPU</p>
</li>
</ul>
<h3 id="heading-buildingindexing-time">Building/Indexing time:</h3>
<ul>
<li><p>Sometimes the building/indexing time is crucial.</p>
</li>
<li><p>IVF has a shorter indexing time compared to HNSW.</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Document Order and Retrieval Strategies for Enhanced LLM Performance]]></title><description><![CDATA[Language models often struggle to use information in the middle of long input contexts, and that performance decreases as the input context grows longer.  
In a recent paper by NF Liu et. al, they discovered to get the best results with RAG (Retrieva...]]></description><link>https://mundher.com/document-order-and-retrieval-strategies-for-enhanced-llm-performance</link><guid isPermaLink="true">https://mundher.com/document-order-and-retrieval-strategies-for-enhanced-llm-performance</guid><category><![CDATA[llm]]></category><category><![CDATA[LLM-Retrieval ]]></category><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Sat, 07 Oct 2023 13:22:10 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1696684952754/8b228833-bf3a-49c5-9145-276308d1ed65.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Language models often struggle to use information in the middle of long input contexts, and that performance decreases as the input context grows longer.  </p>
<p>In a recent paper by NF Liu et. al, they discovered to get the best results with RAG (Retrieval-Augmented Generation), you should put the most important documents at the beginning or the end. This is shown clearly in Figure 1, where LLMs are better at using relevant information that occurs at the very beginning or end of its input context,</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696684814275/3169ff33-9149-4145-aff6-515275176262.png" alt class="image--center mx-auto" /></p>
<p>In the second observation, the accuracy drops as we increase the number of documents as shown in Fig. 2. Hence, the retrieval should only pass a handful of documents to the LLM.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696684832847/76807d61-2985-4339-912f-c19c3785b07a.png" alt class="image--center mx-auto" /></p>
<p>This underscores the significance of document order in the Retrieval part. Here are the things that we can do to increase the accuracy of the Retrieval at the expense of increasing the complexity, cost, and latency:  </p>
<p>- Instead of using simple retrieval, try a hybrid of semantic and lexical search.<br />- Use a Re-ranker as a second stage. So first we select K candidate documents using a retrieval, then re-rank the K document using a ML/DL-based re-ranker.</p>
<p>Reference:<br /><a target="_blank" href="https://arxiv.org/abs/2307.03172">How Language Models Use Long Contexts by NF Liu et. al</a></p>
]]></content:encoded></item><item><title><![CDATA[Vector Search vs. Vector Database]]></title><description><![CDATA[Vector search and vector database share many features and people use the terms interchangeably. But, they aren’t quite the same.  
Vector search is a process wherein a query vector is compared against a collection of vectors to find the most similar ...]]></description><link>https://mundher.com/vector-search-vs-vector-database</link><guid isPermaLink="true">https://mundher.com/vector-search-vs-vector-database</guid><category><![CDATA[VectorSearch]]></category><category><![CDATA[vector database]]></category><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Sat, 07 Oct 2023 13:17:11 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1696684569806/77b1d041-6095-47bd-98ac-699361e2c07c.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Vector search and vector database share many features and people use the terms interchangeably. But, they aren’t quite the same.  </p>
<p>Vector search is a process wherein a query vector is compared against a collection of vectors to find the most similar vectors based on a certain similarity measure (e.g., cosine similarity, Euclidean distance). Its ascendancy can be largely attributed to the progressive advancements in Approximate Nearest Neighbors (ANN) algorithms. These algorithms, by design, prioritize speed over absolute precision, making vector searches considerably faster and more scalable.  </p>
<p>Examples of vector search engines are Faiss, Annoy, HNSWLIB, and Google Vector Search (previously known as Matching Engine).  </p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696684560826/0fd001e1-8df1-42c0-8c09-a739eee87ffe.png" alt class="image--center mx-auto" /></p>
<p>On the other hand, a vector database is a storage system designed to handle vector data efficiently. It provides built-in vector search capabilities along with indexing to speed up search operations. It supports CRUD (Create, Read, Update, Delete) operations even during the importing of the data. Unlike Vector Search, a vector database accommodates various data types, making it more versatile than Vector Search, which primarily handles vectors.  </p>
<p>Examples of Vector Search are Qdrant, Milvus, Weaviate, and Pinecone.  </p>
<p>Recently, storage and search engines like PostgreSQL, Redis, and Elasticsearch have integrated vector operations within their solutions.</p>
]]></content:encoded></item><item><title><![CDATA[AI emits 130 to 2900 times less CO2e than a human when doing the same task!]]></title><description><![CDATA[Bill Tomlinson et. al find that an AI (ChatGPT, BLOOM,DALL-E2, Midjourney) writing a page of text emits 130 to 1500 times less CO2e than a human doing so. Similarly, an AI creating an image emits 310 to 2900 times less.  
The figure shown below compa...]]></description><link>https://mundher.com/ai-emits-130-to-2900-times-less-co2e-than-a-human-when-doing-the-same-task</link><guid isPermaLink="true">https://mundher.com/ai-emits-130-to-2900-times-less-co2e-than-a-human-when-doing-the-same-task</guid><category><![CDATA[chatgpt]]></category><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Sat, 07 Oct 2023 13:12:07 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1696684245491/fcd132e1-ff9e-4338-b268-2bd78422f924.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Bill Tomlinson et. al find that an AI (ChatGPT, BLOOM,DALL-E2, Midjourney) writing a page of text emits 130 to 1500 times less CO2e than a human doing so. Similarly, an AI creating an image emits 310 to 2900 times less.  </p>
<p>The figure shown below compares the CO2e emissions of AI and humans engaged in the task of creating one image. AI produces many times less CO2e than computer usage to support humans in making images.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696684259914/c5ae4189-ae17-405e-ad2d-ff85168f88f4.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-ia"> </h3>
<p>Reference:</p>
<p><a target="_blank" href="https://arxiv.org/abs/2303.06219">The Carbon Emissions of Writing and Illustrating Are Lower for AI than for Humans by B Tomlinson</a></p>
]]></content:encoded></item><item><title><![CDATA[No, you don’t need vector search to build your RAG]]></title><description><![CDATA[The Retrieval-Augmented Generation (RAG) framework enhances the quality of the Large Language Model (LLM) generated responses by grounding them on external sources, reducing hallucinations or fabricated information. Significantly, RAG allows for the ...]]></description><link>https://mundher.com/no-you-dont-need-vector-search-to-build-your-rag</link><guid isPermaLink="true">https://mundher.com/no-you-dont-need-vector-search-to-build-your-rag</guid><category><![CDATA[LLM-Retrieval ]]></category><category><![CDATA[llm]]></category><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Sat, 07 Oct 2023 13:04:21 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1696683948773/166cc0b6-5079-4bf1-ae14-e9367fe2e1b5.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The Retrieval-Augmented Generation (RAG) framework enhances the quality of the Large Language Model (LLM) generated responses by grounding them on external sources, reducing hallucinations or fabricated information. Significantly, RAG allows for the integration of updated information from these sources without the need for re-training the generative model.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696683805154/2e8144fb-8f2f-4118-a4cc-6669f35fee31.jpeg" alt class="image--center mx-auto" /></p>
<p>numerous online tutorials and examples frequently showcase the use of Embeddings with Vector Search, to build the retrieval part of the RAG framework.</p>
<p>Here I will argue that at least some RAG applications don’t need Embeddings or Vector Search, and a BM25 with an Inverted Index is sufficient if not better.</p>
<p>Computational Efficiency: Lexical matching and BM25 provide computational efficiency, quick responses in large databases, and faster data indexing due to their straightforward nature. On the other hand, an Encoder-based deep neural network is needed to generate the embeddings.</p>
<p>Ease of Implementation: Implementing BM25 with Elasticsearch is straightforward, requiring no specialized neural network knowledge or extensive tuning, making it accessible for teams of varied skill levels.</p>
<p>Transparent and Interpretable: The transparency of lexical matching and BM25, unlike black-box neural networks, simplifies troubleshooting and is crucial for understanding search algorithms in business-critical applications.</p>
<p>Consistent Performance: Lexical and BM25 algorithms deliver consistent, predictable performance across various datasets and domains, often valued in production environments despite not capturing nuanced semantic relationships like neural embeddings.</p>
]]></content:encoded></item><item><title><![CDATA[Top-3 Tools for Detection/Preventing Prompt Injection]]></title><description><![CDATA[Rebuff.ai:Rebuff offers 4 layers of defense:- Heuristics: Filter out potentially malicious input before it reaches the LLM.- LLM-based detection: Use a dedicated LLM to analyze incoming prompts - and identify potential attacks.- VectorDB: Store embed...]]></description><link>https://mundher.com/top-3-tools-for-detectionpreventing-prompt-injection</link><guid isPermaLink="true">https://mundher.com/top-3-tools-for-detectionpreventing-prompt-injection</guid><category><![CDATA[llm]]></category><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Sat, 07 Oct 2023 13:00:53 GMT</pubDate><content:encoded><![CDATA[<p><a target="_blank" href="https://github.com/protectai/rebuff">Rebuff.ai</a>:<a target="_blank" href="https://lnkd.in/g4SidZW6%EF%BF%BCRebuff"><br />Rebuff</a> offers 4 layers of defense:<br />- Heuristics: Filter out potentially malicious input before it reaches the LLM.<br />- LLM-based detection: Use a dedicated LLM to analyze incoming prompts - and identify potential attacks.<br />- VectorDB: Store embeddings of previous attacks in a vector database to recognize and prevent similar attacks in the future.<br />- Canary tokens: Add canary tokens to prompts to detect leakages</p>
<p><a target="_blank" href="https://github.com/leondz/garak/">Garak</a>:<a target="_blank" href="https://lnkd.in/gitkmNtw%EF%BF%BCIt%E2%80%99s"><br />It’s</a> a LLM vulnerability scanner (nmap for LLMs). It supports:<br />- probes for hallucination<br />- data leakage<br />- prompt injection<br />- misinformation<br />- toxicity generation<br />- jailbreaks</p>
<p><a target="_blank" href="https://github.com/utkusen/promptmap">Promptmap</a> was developed by my colleague Utku Sen. It is a tool that automatically tests prompt injection attacks and supports the following attack types:<br />- Basic Injection<br />- Translation Injection<br />- Math Injection<br />- Context-Switch<br />- External Browsing</p>
]]></content:encoded></item><item><title><![CDATA[Mitigate Prompt Injections]]></title><description><![CDATA[Prompt injections 💉 can manipulate language models like ChatGPT when connected to malicious information sources controlled by attackers. This vulnerability is similar to running untrusted code on a computer, but with natural language instructions fo...]]></description><link>https://mundher.com/mitigate-prompt-injections</link><guid isPermaLink="true">https://mundher.com/mitigate-prompt-injections</guid><category><![CDATA[chatgpt]]></category><category><![CDATA[llm]]></category><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Sat, 07 Oct 2023 12:56:58 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1696684382128/fc109d19-fcb2-4932-ad11-e6992102d969.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Prompt injections 💉 can manipulate language models like ChatGPT when connected to malicious information sources controlled by attackers. This vulnerability is similar to running untrusted code on a computer, but with natural language instructions for the language model. Despite efforts to mitigate the threat, current solutions only increase the time required to exploit the system, rather than eliminating the risk.</p>
<p>Prompt injection poses a challenge as the prompt can contain data, instructions, or both, reminiscent of familiar computer science issues like SQL Injection. A potential remedy is to separate instructions from data using a special Token, such as BERT's [SEP] token. However, many LLM applications, like ReAct and Chatbots, find this separation challenging.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696683245676/9f3a8a55-9fc5-4959-805c-2e405dedf896.jpeg" alt class="image--center mx-auto" /></p>
<p>A viable alternative is to employ two distinct prompts, differentiated by a unique token: one for system-level instructions and another for user inputs. Both can contain instructions and data, but the system prompt should always take precedence. For this to work, models, during their RLHF training phase, must be conditioned to prioritize the system prompt over conflicting user prompts. An example of such a solution is the System prompt in ChatGPT 4, while it’s a step in this direction, remains vulnerable to sophisticated attacks.</p>
<p>Another approach is pre and post-input sanitization, a mechanism that scrutinizes data entering and exiting the LLM to identify malicious content. This sanitization tool can range from a straightforward rule-based system, equipped with a predefined list of blacklisted keywords, to a more advanced machine-learning classifier. Nevertheless, these systems can be outmaneuvered by sophisticated attacks and necessitate regular updates to the machine-learning model or the blacklist to stay abreast of evolving attack strategies.</p>
]]></content:encoded></item><item><title><![CDATA[Model-Based Retrieval System]]></title><description><![CDATA[Information retrieval aims to locate text-based information relevant to a query within a substantial pool of potential sources. In its early stages, sparse retrieval primarily emphasized term matching using sparse representations and inverted indexes...]]></description><link>https://mundher.com/model-based-retrieval-system</link><guid isPermaLink="true">https://mundher.com/model-based-retrieval-system</guid><category><![CDATA[llm]]></category><category><![CDATA[information retrival]]></category><dc:creator><![CDATA[Mundher Al-Shabi, PhD]]></dc:creator><pubDate>Sat, 07 Oct 2023 12:11:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1696684450948/b7dd39f0-831b-4dda-9f61-076c71a2280d.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Information retrieval aims to locate text-based information relevant to a query within a substantial pool of potential sources. In its early stages, sparse retrieval primarily emphasized term matching using sparse representations and inverted indexes like BM25. However, recently, due to the revival of deep learning and the advent of Large Language Models (LLMs), dense retrieval has surpassed traditional sparse retrieval in performance across various tasks.</p>
<p>The most well-established type of dense retrieval is called Dual-Encoders. It consists of two identical encoders that convert the text into embeddings. These embeddings are designed to have similar representations for similar input data and dissimilar representations for dissimilar data.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1696680400794/e6bd2c46-6602-4f14-9149-644a52a357a4.jpeg" alt class="image--center mx-auto" /></p>
<p>The second type is the model-based originally known as Differentiable Search Index (DSI).  This approach involves training a text-to-text model that directly transforms textual queries to the corresponding document identifiers (docids). Basically, the model-based provides query responses solely through its parameters, dramatically simplifying the whole retrieval<br />Process. That means we don’t need any type of index such as Approximate Nearest Neighbors (ANN). Yannic made two videos explaining this approach (check the references).</p>
<p>A recent research paper called TOME suggests breaking down the Model-based into two stages. It tries to solve the problems introduced by the Model-based such as the discrepancy between pre-training and finetuning, and the discrepancy between training and inference. In the first stage, it generates the passage given the query. which we know that LLMs are good at. The second stage, given passage, it returns the URL. The URLs have semantic meaning compared to docids. Therefore, it is easier for the LLM to generate a URL. Another advantage is the architectural similarity between the training and the inference phase.</p>
<p>However, using model-based comes with costs and limitations. ​​First, using model-based types of architectures for information retrieval can result in high latency due to the computational resources required for processing large amounts of data, making real-time or low-latency applications challenging to implement. The second problem is the index (or model) update, where updating the inverted index or vector database is much easier and faster than model update (model incremental update). Finally, because in the model update, we rely on the model parameters to index the document, the model is prone to hallucination</p>
<h3 id="heading-references">References:</h3>
<p><a target="_blank" href="https://arxiv.org/abs/2202.06991">Transformer Memory as a Differentiable Search Index by Y Tay</a></p>
]]></content:encoded></item></channel></rss>