Why LLM Answers to Investment Questions are Often Wrong

Opinions

Sep 7, 2025

When we ask our users how they are using AI tools today, we noticed that almost all heavy users express both excitement and frustration at the same time - Deep Research, ChatGPT, Claude, Perplexity and Grok are impressive but they are often wrong or incomplete, not dependable for serious investment research work. 

Typical Patterns & Issues

Let’s use Perplexity and ChatGPT with “Web Search” as examples (others largely follow the same design pattern). When a user sends in a query, LLM interprets it and generates some query terms (Step 1 above in the illustration) to conduct searches (Step 2). The search terms are handled just like how you would manually use Google Search. The results would be scraped algorithmatically(Step 3) and then fed to LLM for synthesis considering the original user query (Step 4). The final answer then gets displayed to the user. 

Every of those 4 steps above could introduce errors when dealing with a finance-related task. 

Errors in Step 1: Imperfect and Inconsistent Queries

For example, I’m interested in which companies listed in Japan can potentially benefit from Germany's recent infrastructure budget increase. Now, ask a human to come up with the right Google search term, and you will find a variety of potential queries to try out. Perplexity will just pick maybe 2-3 queries to try. And it may use a different query term in the afternoon vs. in the morning due to LLM’s innate variability in token generation. This means, even if there is a good article out there on this topic, the imperfect query terms may not lead to it, or yield inconsistent results. 

Another way to push the horizontal LLM apps to break - try asking a question that requires synthesis info from multiple sources or time period, for example “give me each month’s seat capacity and utilization figures for Cathay Pacific’s fleet over the last 12 months” or “Cathay Pacific’s track record of exceeding or missing management targets over each of last 5 years”..  LLM will have the choice of either trying out queries literally in case there happens to be some articles that talk about these, or it can try to break down the tasks into sub-problems, generate many queries and then try to draw an answer. The 2nd way is agentic, but requires the right domain knowledge to properly break down the tasks. The 1st way is usually the default for most of the LLM systems today, even under Deep Research mode. 

Errors in Step 2: Bad Search Results

This is very common. Google is not perfect, Even if the LLM gets the query terms right and the query term is precise and straightforward, the top search results may not be good hits. For humans, you may have to click around and dig hard sometimes to find the right one among the results. For ChatGPT and Perplexity, it usually would just attempt to look at the top 5-10 results. 

For example, if you search “Q2 2024 earnings report for Mitsubishi Heavy” in Google, unfortunately the right PDF file link is not among even the top 5 results. So, if Perplexity or ChatGPT cannot even find the right source, how can it give the right answer? 

Another issue is that the top search results may be outdated (e.g. wrong CEO name shows up at the top if you search a few hours or days after a company changes its CEO). 

Search engines are also bad at retrieving older content or information from the past. For example, if you search for events or news on 2024/3/6 around Mitsubishi Heavy, you would see most of the results are not compliant with the date asked. This may matter a lot for certain investor use cases related to studying key company price and event histories.

Errors in Step 3: Higher Quality Content are Unscrapable

Bloomberg, WSJ and Financial Times probably have much better opinion pieces than Benzinga, Zacks and Seeking Alpha. But when the results from all of these sites show up in Step 3, LLM apps can only access the content of the lower-quality ones, with the pay-walled content out of the picture. This tends to lead to low-quality or incomplete answers. This is why in Perplexity Finance, you see the bullish / bearish views always coming from Zacks, Morningstar, Benzinga and maybe investing.com. Well, some may not mind this, but serious investors do. In fact, many do not want their mind / news source contaminated by them. 

Errors in Step 4: Hullucination, Needle in a Haystack, and Lacking Sense of Time

Even if the right content is scraped and provided to LLM for synthesis, LLMs may still fail the job. It may hallucinate when information is not available in the content provided. When a lot of information is provided in the context (from different long articles or documents scraped), the LLM may not accurately locate the most relevant info (needle in the haystack challenge). These are fairly well documented issues for LLMs. For example, ChatGPT and Perplexity both failed to identify Japan’s Ministry of Defense as the 10%+ revenue contributor to Mitsubishi Heavy (a fact clearly disclosed in all recent annual reports).

We also noticed another issue, likely due to an engineering flaw - Perplexity sometimes is not aware of chronological order when formulating its answers based on search results (see below for an example). 

Other issues: lack of data integration and domain-specific agentic workflows

Most of today’s products are also early in achieving the breadth of data integration needed for professional investment research. For example, financials, stock prices, consensus, multiples are must haves, but ChatGPT and Gemini cannot access those. Claude recently started integrating with CapIQ and Perplexity enabled FactSet integration, but those would only be available to the users with those licenses already,  and it will require a bit more time for those integrations to fully work well. Claude also does not have a good integration for historical news and events. And on top of these, no system today, except Distilla, preprocesses and index qualitative knowledge - supply chain, business model, historical MD&A topics, targets & guidances, and many more. 

We believe the investment domain requires its own agentic vertical system. It takes years to train a seasoned investment analyst to know what questions to ask and how to approach and break down those questions. Just like a super smart PhD without investment experience cannot excel at a hedge fund, a general LLM tool / agentic system will not do the right work in the professional investment domain. 

50% Success Rate or Maybe Lower

In short, if we assume 80% success rate in translating user task into the best query terms, 80% success rate in having the best content in the top search results, 80% chance in having the best content scraped into the context, and 80% in conducting the right synthesis from those content. The final success rate would be … around 40%. Note, for a non-finance query, the success rate may be higher, but the unique ecosystem and complexity around investing make these 80% step-wise assumptions likely still higher than reality - high-value content more rigorously protected by paywall, financial documents tend to be long, financial analysis tasks tend to be more complex and dynamics with no clear right or wrong boundary, etc. etc. 

Distilla’s Unique Solution 

Instead of the lazy, on-demand, web or doc searches, we ingest and pre-process information sources typically important and necessary for professional investors into our system, and painstakingly optimize LLM pipelines to extract structured knowledge from them. 

By storing the most important and commonly used information all in our system, we avoid the common pitfalls mentioned above. How do we ensure we get those info right?

  • We structure knowledge by industry and perfecting multi-step prompts to extract info from filings and transcripts at 90%+ accuracy rate (vs. unrefined RAG at 60-70%). We also make sure the filings we use contain max info, e.g. we use Japanese filings for Japan-listed companies, which have more info than disclosed in the English version (if that version does exist!). 

  • We ingest, dedupe, synthesize from only a procured list of news sources daily, for the last 2+ years, so that our knowledge base contains a time-indexed event list for each company we cover, all from top business and finance sites. Even when we cannot scrape paywalled sites, we work with 3rd party vendors to get the summaries for those articles. 

  • We codify typical professional investors’ workflows into our system, such as finding drivers, writing primers, and conducting technical analyses. We also let our agent try to break down users’ requests based on what we have in our knowledge base first, with the web searches as fallback and supplemental. This offers a more definitive task breakdown path similar to human analysts, instead of the open-ended search method.

In short, Distilla has pre-processed a broad set of info into a proprietary knowledge base using AI, and this unique approach yields a more robust vertical solution for professional investors. 

Why does no one else do this, not even Bloomberg has the structured qualitative knowledge on each company in their database? Because only recent LLM development made this economic and possible, and because this requires a lot of hard work and domain knowledge. 

We are still in the early days, and we recognize there are still many unfinished improvements. We hope we will build Distilla over time into a system that professional investors can fully trust and rely on 24/7. If you are interested in our vision and product, please drop us a note. We will respond within12 hours and would be more than happy to jump on a call.

About Distilla

Distilla is an AI-powered insight generation engine, made by veteran investors, for serious fundamental investors. Designed as a full-cycle acceleration platform, Distilla’s agents and AI contents help make investors more efficient in ideation, initiation, analyses, thesis iteration and tracking. Powered by a proprietary knowledge base and analytical frameworks codified from the best investors, Distilla delivers higher quality outputs and better insights. Get in touch with us at info@distilla.ai.

440 N Wolfe Rd, Sunnyvale, CA 94085, United States

Copyright ©2025 Distilla, Inc. All rights reserved.

440 N Wolfe Rd, Sunnyvale, CA 94085, United States

Copyright ©2025 Distilla, Inc. All rights reserved.

440 N Wolfe Rd, Sunnyvale, CA 94085, United States

Copyright ©2025 Distilla, Inc. All rights reserved.