
12 Questions HR Leaders Should Ask AI Vendors
HR leaders are trying to evaluate AI tools without clear guidance on what good vendor answers actually look like. This article breaks down the questions that matter and how to interpret the responses.

Written by
Stacey Nordwall, People and Product
As HR leaders began evaluating AI tools, many were figuring it out as they went. They asked their networks what to look for, what to worry about, and what questions made sense to ask vendors. I kept seeing the same questions come up, but with little clarity on what good answers should actually sound like. So I chatted through them with Jon Williams, Pyn’s co-founder and lead engineer, who has spent the last 15 years building HR tech products. The goal isn’t to turn HR into AI experts, but to offer clearer signals: where vendor answers reflect real maturity, where they should raise concern, and where it’s reasonable to push for more detail.
1. What data was used to train this model — and what data was explicitly excluded?
Why HR should ask this: This is one of the most common questions HR leaders are advised to ask, and the instinct behind it is sound: training data shapes outcomes. If biased or inappropriate data is baked in, HR will be left explaining and defending decisions it didn't design.
But here's the nuance: Most AI tools in the HR space are built on top of base AI models (the underlying systems from companies like OpenAI, Anthropic, or Google that many HR tools are built on), and those models are all trained on broadly similar data — massive amounts of messy internet text. Asking a vendor what data trained the base model usually won't differentiate good vendors from bad ones.
The better question to ask: "What was done to this model after its initial training to make it appropriate for HR decisions and what data or principles were used in that process?"
What a good answer sounds like: The vendor can explain what work was done after the base model was created: additional training focused on HR use cases, safety testing, or rules that filter out inappropriate outputs. They can name what data was used in that phase, what was explicitly excluded (e.g., scraped résumés, inferred protected characteristics), and who made those decisions.
Jon says: Base AI models are all trained on internet-scale data, which includes a lot of junk. That's true across the industry. What actually matters is the post-training work: additional training, safety testing, and the guardrails layered on top. That's where vendors differentiate, and that's where you should focus your diligence. If a vendor can't clearly explain what they did after the base model was built, they're either reselling a generic model with minimal adaptation or they haven't done the work to make it safe for HR contexts.
2. Was this model trained to avoid unfair or discriminatory outcomes? If so, how?
What a good answer sounds like: The vendor can explain, in plain language, whether the model was trained with principles like fairness, harm minimization, or uncertainty handling, and how those principles were put into practice. If not, they can clearly explain what prevents the model from defaulting to historical inequities.
Why HR should ask this: AI systems are not neutral. HR needs to know whose values are shaping recommendations that affect people's jobs, pay, and careers.
Jon says: Technically, what you're asking about is whether the vendor used specific techniques for building fairness into the model. They should be able to name those techniques, not just say "we trained it to be fair." This is especially true as fairness is inherently subjective. Also ask: were these principles built in during training, or just added at the end as a filter on outputs? Even though both are useful, filters are easier to bypass than deeply trained behaviors.
3. What kinds of outputs is the model designed to refuse? How were those boundaries defined?
What a good answer sounds like: A mature answer includes concrete examples of refusal (e.g., inferring protected characteristics, making unsupported judgments about individuals) and explains who set those boundaries and why. Refusal is treated as a safety feature, not a weakness.
Why HR should ask this: Some questions should not be answered by a system. HR needs guardrails that prevent illegal, unethical, or indefensible outputs before they reach people.
Jon says: This is a great question to separate serious vendors from those who haven't thought deeply about safety. From an engineering standpoint, refusal behaviors should be documented and testable. Ask if they have a documented list of what the system will and won't do, and whether they've stress-tested the system by having people actively try to break it. Also seek a vendor that is working hard to achieve balance. Over-refusal can make a tool useless; under-refusal exposes you to risk.
4. How does the model behave when it encounters ambiguity, incomplete data, or conflicting signals about a person?
What a good answer sounds like: The vendor can explain whether the model flags uncertainty, asks for more input, reduces confidence, or declines to make a recommendation. Strong answers acknowledge that people data is often messy. Critically, the vendor should confirm that the system does not default to producing a recommendation anyway when data is insufficient. A system that always gives an answer, regardless of data quality, is shifting risk onto HR without signaling it.
Why HR should ask this: Guessing in the face of ambiguity is how bias and unfairness creep in, and HR is accountable for the consequences.
Jon says: This question gets at something engineers call "calibration" — meaning, does the system know what it doesn't know? A well-designed system should output confidence levels or explicitly flag low-certainty situations. Ask whether the system distinguishes between "I have high confidence" and "I'm making a best guess." If every output looks equally authoritative, that’s an automatic red flag.
5. Does the model's behavior change over time?
What a good answer sounds like: A clear explanation of whether the model stays the same or evolves, how updates and feedback are incorporated, and what controls exist to prevent unmonitored drift. Strong answers distinguish intentional improvement from uncontrolled change.
Why HR should ask this: A system that evolves without visibility can quietly invalidate earlier risk reviews, bias audits, and approvals HR already signed off on.
Jon says: This is where I'd dig in hard. There are three types of change: (1) the vendor pushes a new model version, (2) the model adapts based on your organization's feedback or data, or (3) the model quietly drifts in ways no one intended. You need to know which of these is happening and when. In coding land, we have tools that actively monitor this such as https://marginlab.ai/trackers/claude-code/ - however, this sort of tracking is incomplete in other domains and much harder with open-loop human-orientated processes.
It’s not uncommon for vendors to tweak parameters or A-B test model behavior. This has the potential to lead to significant variation in executing similar (or identical) tasks.
Related: If they use your data to improve the model, does that improvement flow back to other customers? That has confidentiality implications.
6. How do you monitor for unintended behaviors after deployment? How would we, as a customer, know if something changed?
What a good answer sounds like: The vendor describes active, ongoing monitoring on their end including what signals they track, how often reviews happen, and who responds to issues. Equally important, they explain what visibility customers get: release notes, alerts, version histories, audit reports, or dashboards. A strong answer treats monitoring as a shared responsibility and gives HR the information it needs to conduct its own reviews.
Why HR should ask this: Because some risks only emerge after sustained use, and because systems that affect employees need regular review. HR can only do that review if the vendor is watching for problems and keeping customers informed when something changes.
Jon says: In this case, you want to know the frequency and detail of what they actually monitor. The continuous calibration of these models is a lot of their value, so you should see very tight feedback loops.
7. If we believe the model is producing inappropriate, biased, or drifting outputs, how can we report that? What will happen next?
What a good answer sounds like: A defined reporting and escalation process, clear expectations for response time and remediation, and transparency about how issues are investigated and resolved.
Why HR should ask this: Because when something goes wrong, HR needs to know what their options are. Can you pause or disable the affected feature while the issue is being investigated? Will the vendor roll back to a previous version? How long might you be exposed to a known problem and what should you tell employees in the meantime? This question isn't just about reporting; it's about understanding what recourse you actually have.
Jon says: I think this is less a technical issue than a relationship one. Do they have an easy way to report and track issues, with clear escalation paths and transparency?
8. Can you explain how the model arrived at a specific output affecting an individual employee or candidate?
Why HR should ask this: If HR can't explain a decision, it can't defend it. This question feels essential.
But here's the nuance: For complex AI systems, especially those built on large language models (which many are), full transparency is often technically impossible. The models don't reason the way humans do, and their internal workings can't always be translated into a simple narrative. Asking for perfect explainability sets up an expectation that vendors will either overpromise to meet or dodge with "it's a black box" deflection.
The better question to ask: "What level of explanation can you provide for individual decisions — and is it enough for us to answer to employees and meet the documentation requirements in emerging AI legislation?"
What a good answer sounds like: The vendor is honest about the limits of explainability but can describe what is available: which inputs weighed heavily, what would have changed the output, confidence levels, or audit trails. Strong answers focus on reasonable justification rather than claiming perfect transparency or hiding behind complexity.
Jon says: Explainability is one of the hardest unsolved problems in AI, so calibrate your expectations. Most of the time you can’t really get a neat answer of "the model rejected this candidate because of X." What you can reasonably ask for: which inputs were influential, what the confidence level was, and whether the system can surface alternative interpretations. Be skeptical of vendors who claim full explainability. The real question is whether you'll have enough to construct a reasonable justification when someone asks "why did this happen?"
9. Where does this system expect a human to make the final decision? How does it support (rather than undermine) that human judgment?
What a good answer sounds like: The vendor can clearly identify which outputs are recommendations versus decisions, and where human review is expected or required. They can explain how the system presents information to support human judgment — such as showing confidence levels, flagging edge cases, or surfacing alternative options. Strong answers acknowledge that humans can be biased toward accepting AI recommendations, and describe how the system is designed to encourage genuine review rather than rubber-stamping.
Why HR should ask this: Because "human in the loop" only works if the human is genuinely empowered to override, question, or reject the system's output. If the system presents recommendations as foregone conclusions or makes it procedurally difficult to deviate, then the human review becomes theater, and HR is still accountable for outcomes it didn't truly control.
Jon says: This is one of the most important questions on the list, and it's often overlooked. And it is often more about how the AI is integrated and presented than the core operation of the model.
From a design standpoint, ask how recommendations are presented: Does the system show one answer or a range of options? Does it display confidence levels? Does it make it easy to override or difficult?
People tend to defer to AI outputs, which means the model needs to understand and communicate its limits and confidence level (see Question 4).
There are techniques that help here. A well-designed system actively counteracts this by requiring justification for acceptance (not just rejection), randomizing the order of options, or flagging when the model's confidence is low.
10. What documentation or artifacts are available to support internal review, employee questions, or legal scrutiny?
What a good answer sounds like: The vendor can name concrete materials HR could rely on — such as audit logs, rationale summaries, version history, or decision traces — and explain how they're accessed. They may also provide standard AI documentation like model cards, data sheets, or impact assessments that describe how the system works and what risks were considered.
Why HR should ask this: Because employers are increasingly expected to document and justify how decisions are made, especially when automated systems are involved.
Jon says: This is a procurement and legal question as much as a technical one, but engineers should care too. Ask whether audit logs are protected from editing and how long they're retained. If you ever face a legal challenge, you'll need to reconstruct what the system knew and did at a specific point in time. If they can't produce this, you're taking on the documentation burden yourself.
HR knows the criticality of this documentation. The vendor may not.
11. What happens to employee or candidate data after it's processed? How can we ensure it's deleted when appropriate?
What a good answer sounds like: The vendor can clearly explain their data retention policy: what data is stored, where, for how long, and under what conditions. They can describe how deletion requests are handled, how quickly they're executed, and whether deletion is complete or if residual data remains in backups or training pipelines. Strong answers also clarify whether customer data is ever used to improve the model for other customers — and crucially, whether you can opt out of that entirely. If opting out isn't an option, that's worth knowing before you sign.
Why HR should ask this: Because employee and candidate data doesn't belong to the vendor — it belongs to the people it describes, governed by your policies and applicable law. HR needs to know that data isn't retained longer than necessary, used for purposes beyond what was disclosed, or impossible to remove when someone exercises their rights under privacy laws like GDPR or CCPA, or under internal policy.
Jon says: This is a question about how data is handled from start to finish, and it's more nuanced than it sounds. Ask specifically: Is data stored permanently or only used in the moment and then discarded? If stored, is it encrypted and access-controlled? Does the vendor use customer data to retrain or improve models, and is that opt-in or opt-out?
If you can't opt out, ask what happens to your data if it gets used for training: it may become impossible to remove. We also see cases where that data is “accessible” via the right prompts. For example early versions of coding bots have been shown to have had training access to private codebases; then through prompting revealed their contents. See:
GitHub Copilot and secret leakage
Research on training data extraction
12. What risks or harms are you not confident this system can prevent?
What a good answer sounds like: The strongest answers acknowledge limits. The vendor can name residual risks, edge cases, or tradeoffs and explain how customers should account for them in deployment.
Why HR should ask this: No system is risk-free. HR needs to understand the risks it's being asked to manage.
Jon says: This is my favorite question on the list because it tests for intellectual honesty. Every system has ways it can fail. The question is whether the vendor knows them and can articulate them. A good engineer will tell you: "Here's where the model struggles. Here's the population where accuracy drops. Here's the edge case we haven't solved." If a vendor says their system has no meaningful risks, either they haven't looked hard enough or they're not being candid.
At Pyn, we believe HR leaders should have clear visibility into how technology shapes employee experiences — especially when AI is involved. That’s why we design our platform with transparency, control, and human oversight at the center, and why any AI features in Pyn are opt-in by design. HR teams should be able to decide what’s used, when it’s used, and how it impacts their people. If you’re curious how this approach shows up in practice, you can explore Pyn here.

Stacey loves to hike and read. Her goal is to create inclusive workplaces. Before Pyn, she was an early member of Culture Amp’s people team.