Can AI Write a CMA? Testing AI Property Valuations

Can AI Write a CMA? Testing AI Property Valuations
If you've spent any time online in the last two years, you've probably typed an address into a chatbot just to see what would happen. Maybe you asked ChatGPT "what's my house worth?" Maybe you pasted in a property description and asked for a price range. And maybe — like a lot of people — you got an answer that sounded confident, specific, and almost convincing.
That's the problem.
AI tools are remarkably good at sounding like they know what they're talking about, even when they don't. And nowhere is that more dangerous than in real estate, where a single percentage point of error on a home's value can mean tens of thousands of dollars in a buyer's offer, a seller's listing price, or a refinance decision.
This article is an experiment. We took a real property, ran it through several AI tools to see what kind of valuation and comparative market analysis (CMA) each one would produce, and then compared those outputs against a CMA built the way a human real estate professional actually builds one — using live MLS data, on-the-ground knowledge of the neighborhood, and adjustments based on the physical realities of the property.
The goal isn't to bash AI. Large language models and automated valuation models (AVMs) are genuinely useful tools, and ignoring them isn't realistic. The goal is to map out, in concrete terms, where AI-generated valuations are helpful, where they're neutral, and where they can actively mislead a buyer or seller into a costly mistake.
If you're searching for information on AI property valuation and CMA accuracy, this is the long-form breakdown you've been looking for.

What Is a CMA, and Why Does It Matter?

Before diving into the AI side of things, it's worth being precise about what a CMA actually is — because part of the confusion around "can AI do this?" comes from people not knowing what "this" really involves.
A Comparative Market Analysis (CMA) is a report, typically prepared by a real estate agent or broker, that estimates a property's market value by comparing it to similar properties that have recently sold, are currently listed, or were listed but didn't sell (expired or withdrawn listings). A good CMA isn't just a list of numbers — it's an analysis. It involves:
- Selecting truly comparable properties. Same general area, similar size, similar age, similar style, and ideally sold within the last three to six months.
- Adjusting for differences. If the subject property has a finished basement and the comparable doesn't, that needs a dollar adjustment. Same goes for garage spaces, lot size, views, renovations, and dozens of other variables.
- Weighting the comparables. Not every comp is equally relevant. A property that sold six months ago in a declining market carries less weight than one that sold last week.
- Accounting for local market conditions. Is inventory tight? Are days-on-market shrinking or growing? Is the area trending up or down relative to the broader market?
- Applying local knowledge that data alone can't capture. Things like: this street backs up to a busy road and that hurts value, or this subdivision has an HOA that scares off certain buyers, or homes on the north side of this neighborhood get less afternoon sun and sell for slightly less.
A licensed appraiser does something similar but more rigorous, governed by USPAP (Uniform Standards of Professional Appraisal Practice) standards, and their report carries legal and financial weight for lending purposes.
The question we're testing is whether an AI tool — without direct, live access to MLS data, without the ability to physically walk the property, and without years of hyperlocal market experience — can replicate any meaningful portion of that process.

The Experiment: Setting Up a Real-World Test

To make this a fair test, we selected a real single-family residential property currently and recently active in a local MLS, located in a suburban/exurban market with a mix of property styles, lot sizes, and price points — the kind of market where "comps" aren't simple cookie-cutter subdivisions, but instead a patchwork of different lot sizes, ages, and finish levels. This is intentionally a harder test than a tract-home subdivision where every house is nearly identical.
The subject property characteristics, generalized for this test:
- Single-family home, built in the 1990s
- Approximately 2,400 finished square feet
- A walkout lower level with roughly 900 additional finished square feet
- 4 bedrooms, 3 bathrooms
- 2-car attached garage
- Lot size just under 1 acre
- Updated kitchen (within the last 5 years), original bathrooms
- No mountain or water view, but wooded lot with privacy
We then ran this property — using both the address (where the tool allowed it) and a detailed written description (for tools without address lookup) — through the following:
- A general-purpose AI chatbot (the kind most people reach for first)
- A second general-purpose AI chatbot from a different provider, for comparison
- A consumer automated valuation model (AVM) — the type of "instant estimate" tool built into major real estate listing sites
- A human-prepared CMA, built using actual MLS data, current and sold comparables, and standard appraisal-adjustment methodology
Each was asked, in some form, the same underlying question: What is this property worth, and why?

Round 1: The General-Purpose AI Chatbot

When you ask a general-purpose AI chatbot to value a property, one of two things tends to happen.
Scenario A: The chatbot is honest about its limitations. It tells you it doesn't have access to real-time MLS data, can't browse current listings, and can only give a rough estimate based on general knowledge of the area and any details you provide. This is the correct answer, and to its credit, when given an address it had no way to verify, the chatbot in our test acknowledged this limitation upfront.
Scenario B: The chatbot tries anyway — and gets specific. Despite the disclaimer, when pushed for a number, the chatbot produced a price range. And here's where things got interesting: the range it produced was based on a blend of general price-per-square-foot assumptions for the broader region, not the specific submarket the property was in.
The chatbot's estimate for the subject property landed meaningfully below the eventual human CMA range — and the reason why is instructive. The AI applied a price-per-square-foot figure to the total square footage (finished main levels plus the walkout basement) as if all square footage were created equal. But in this market — and in most markets — below-grade square footage does not sell for the same price per square foot as above-grade space. A walkout basement with good natural light and quality finishes might be worth somewhere in the neighborhood of 90-95% of above-grade value per square foot, but a fully subterranean basement with small windows might be worth considerably less. The AI had no way to know which scenario applied, so it either ignored the distinction entirely or applied a generic discount that didn't reflect this specific property.
This single issue — treating below-grade and above-grade square footage as interchangeable, or applying a one-size-fits-all discount — was the single largest source of valuation error across every AI tool we tested. It's worth sitting with that for a moment, because it's not a minor rounding error. On a property with nearly 900 square feet of finished basement, a miscalculation of $20-40 per square foot on that space alone translates to an $18,000-$36,000 swing in estimated value.
When asked to explain its reasoning, the chatbot provided a reasonable-sounding narrative: it referenced "typical" comparable sales in "the area," cited general trends like rising interest rates affecting buyer demand, and noted that homes with updated kitchens "typically command a premium." All of this was true in a general sense. None of it was tied to actual, verifiable sales data for this specific property's micro-market.
This is the core tension with AI-generated valuations: the writing is excellent, the reasoning sounds sound, but the underlying data is either generic, outdated, or in some cases, fabricated.

Round 2: A Second AI Chatbot, Same Property

To check whether the first result was a fluke, we ran the identical property description through a second major AI chatbot.
The result was different — and that difference is itself the point. The second tool produced an estimate that was higher than the first by roughly 8-10%, despite being given the exact same inputs. When two AI tools, given identical information, produce estimates that differ by tens of thousands of dollars, that's a signal that neither estimate should be treated as authoritative.
The second chatbot also did something the first one didn't: when asked to list "comparable sales" supporting its estimate, it generated what looked like specific property references — approximate addresses, sale prices, and dates. On closer inspection, these were not real, verifiable transactions. They were plausible-sounding constructions based on the kind of data the model had been trained on, presented with enough specificity to seem real.
This is arguably the single most dangerous failure mode in AI-generated CMAs: fabricated comparables presented with real-looking detail. A consumer who doesn't know better might take that list of "comps" at face value, not realizing that no such sales exist in the public record. If that consumer then uses those numbers to justify an offer price, a listing price, or a negotiation position, they're building a financial decision on a foundation that doesn't exist.
To be clear, this isn't a case of the AI being malicious — it's a known characteristic of how these models generate text. They predict plausible-sounding sequences of words based on patterns, and a list of "123 Oak Street, sold for $X on " is a very plausible-looking pattern, regardless of whether 123 Oak Street exists or sold for that price.

Round 3: The Automated Valuation Model (AVM)

Unlike a chatbot, a consumer AVM — the kind of "instant home value estimate" tool built into major listing portals — does have access to real property records, tax assessment data, and historical sales data for the actual address. This is a meaningful step up.
The AVM produced a value range for the subject property that was closer to the human CMA than either chatbot, which makes sense — it's working from real data rather than general knowledge.
However, the AVM's estimate still missed in a specific and predictable way: it could not account for the kitchen renovation. AVMs are built primarily from public records (square footage, bed/bath count, lot size, year built, assessed value, and sale history) and from broad statistical models of how prices move in an area. They generally cannot see inside a home. A kitchen that was gutted and rebuilt five years ago with new cabinets, countertops, and appliances looks, on paper, identical to a kitchen that hasn't been touched since the house was built — same square footage, same room count, same year-built date.
In our test, the AVM's estimate landed below the human CMA by an amount that, when discussed with a local agent familiar with the property, lined up almost exactly with the value typically added by a quality kitchen renovation in that price range. The AVM wasn't wrong given what it had access to — it was simply blind to a major value driver that no public database tracks.
The AVM also showed a wide "confidence range" — often 10% or more on either side of its point estimate. To its credit, this honesty about uncertainty is more useful than a chatbot's confident-sounding single number. But a 20%-wide range on a property in this price bracket still represents a difference of well over $100,000 between the low and high end — not precise enough to set a listing price or structure an offer.

Round 4: The Human CMA

Now for the comparison point: a CMA built the traditional way, using a human real estate professional's access to live MLS data and local market knowledge.
The process looked like this:
Step 1: Pull active, pending, and sold comparables. Rather than relying on a wide geographic radius, the agent pulled comparables from the specific micro-market — the same subdivision or immediately adjacent areas with similar lot characteristics (acreage, treed vs. open, walkout potential), narrowing to sales within the last 3-6 months where possible, with a few slightly older sales included for trend context.
Step 2: Separate above-grade and below-grade square footage for every comparable. This is the step that AI tools consistently skipped or flattened. The agent calculated a price-per-square-foot figure for above-grade living space separately from below-grade finished space, based on how those specific comparables actually sold — not a generic national or regional assumption.
Step 3: Apply the appraisal-style adjustment for below-grade space. Based on the quality of the walkout — natural light, ceiling height, finish level, egress windows — the agent applied a percentage-of-above-grade-value adjustment to the basement square footage. This is a methodology rooted in how appraisers actually treat below-grade space, and it varies meaningfully based on whether the basement is a true daylight walkout versus a partially subterranean space with small windows.
Step 4: Adjust for the kitchen renovation. Using comparables that had and hadn't been recently updated, the agent estimated a dollar value for the renovation based on what the market actually paid for that upgrade in similar homes — not a generic "renovations add X%" rule of thumb.
Step 5: Adjust for lot size, garage, and condition differences between the subject property and each comparable, using a standard adjustment grid.
Step 6: Weight the comparables based on recency, proximity, and similarity, and arrive at a final value range with a recommended list price (for a seller) or offer range (for a buyer).
The resulting CMA produced a value range that was notably narrower than the AVM's range and meaningfully different from both chatbot estimates — driven almost entirely by the below-grade square footage adjustment and the renovation adjustment, the two factors that every AI tool either got wrong or couldn't see at all.

Side-by-Side: Where Each Tool Landed

To summarize the experiment in plain terms:
- Chatbot #1 underestimated the property's value, primarily because it treated all square footage (above and below grade) as equal, applying a single price-per-square-foot figure across the entire finished area.
- Chatbot #2 produced a higher estimate than Chatbot #1 using identical inputs, and supported its number with comparable sales that could not be verified as real transactions.
- The AVM landed closer to the human CMA than either chatbot but underestimated value because it had no visibility into the kitchen renovation — a classic blind spot for any tool relying purely on public records.
- The human CMA produced the narrowest, most defensible range, built on verified live MLS data, a methodology-driven adjustment for below-grade space, and a market-based adjustment for the renovation.
The spread between the lowest AI estimate and the human CMA's midpoint was substantial enough to materially affect a listing price decision, a buyer's opening offer, or a refinance conversation — easily in the range of 10-15% of the property's value.

Where AI Genuinely Helps

It would be a mistake to walk away from this experiment thinking AI has no place in the valuation process. It has a real place — just not the place most people assume.
1. Summarizing and organizing data you already have. If you feed an AI tool a list of actual, verified comparable sales — pulled from the MLS by a licensed agent — it can be genuinely useful for organizing that data, spotting patterns, calculating averages, and drafting narrative summaries explaining why certain comps were chosen. The AI isn't generating the data; it's helping process data that's already been verified.
2. Drafting the narrative portions of a CMA report. A lot of a CMA isn't just numbers — it's the written explanation of market conditions, neighborhood trends, and the reasoning behind the value range. AI tools are quite good at turning a set of bullet points and figures into clear, readable prose for a client-facing report, dramatically speeding up the time it takes to produce a polished document.
3. Educating consumers on the process. If someone wants to understand what a CMA is, what adjustments typically get made, or what questions to ask their agent about a valuation, AI tools can explain the concepts well — as this article itself demonstrates. The failure mode isn't explaining the methodology; it's applying the methodology to a specific property without real data.
4. Identifying questions to ask. A consumer who runs their property description through an AI tool and gets back a value range can use that as a prompt to ask their agent better questions: "Why might my below-grade square footage be valued differently? How does my kitchen renovation factor in? What comps are you using and why?" Used this way, AI becomes a tool for better conversations rather than a replacement for them.
5. Spotting outliers and red flags. AI can be useful for sanity-checking — if a human-prepared CMA comes back wildly different from what an AVM or chatbot suggests, that gap is worth investigating. Sometimes the human CMA is right and the AI is wrong (as in this experiment). But occasionally a large gap signals that the human analysis missed something too — a recent comparable sale that hasn't been fully processed, for instance. The AI's role here is as a second opinion that prompts a "why is this different?" conversation, not as a tiebreaker.

Where AI Dangerously Misses

The flip side deserves equal weight, because the failure modes here aren't cosmetic — they're the kind of errors that cost real money.
1. No access to live, verified MLS data. This is the foundational problem. General-purpose AI chatbots are working from training data that may be months or years old, and even AVMs with access to public records often lag behind the most recent sales. A property market can shift meaningfully in a matter of months, and a valuation built on stale data will be stale, no matter how confidently it's presented.
2. Fabricated or unverifiable comparables. As demonstrated in this experiment, when pushed for specifics, AI chatbots can generate comparable sales that sound real — specific-seeming addresses, prices, and dates — without those transactions actually existing. This is perhaps the single most dangerous failure mode, because it's the hardest for a non-expert to detect. A list of "comps" looks authoritative regardless of whether it's real.
3. Flattening above-grade and below-grade square footage. This was the largest single source of error in our test, and it's not unique to this property — it's a structural blind spot. Any property with a meaningful amount of finished basement space, a converted attic, or other non-standard living area is at risk of being significantly mis-valued by tools that simply add up "total square footage" and apply one price-per-square-foot figure.
4. No visibility into renovations, condition, or finish quality. AVMs work from public records, which don't track whether a kitchen was renovated last year or never. https://agentsgather.com/can-ai-write-a-cma-testing-ai-property-valuations/

Search This Blog

Real Estate Company Blog