I taught my wife how to use RediSearch 2.0

Have you tried to find your health care provider in the insurance company network? There are no naming standards, and each insurance company collects data in a different format, as I learned from my wife.

Mikhail Volkov
7 min readMar 17, 2021

She told me about her recent use case of matching health care provider names in multiple datasets, and I decided to teach her how to use RediSearch.

This article will discuss the National Provider Identifier Standard (NPI) and how to match it with care providers connected to the CommonWell Health Alliance to find their NPI using RediSearch 2.0.

RediSearch 2.0

RediSearch, a real-time secondary index with full-text search capabilities for Redis, is one of the most mature and feature-rich Redis modules. RediSearch 2.0’s new architecture improves the developer experience of creating indices for existing data within Redis seamlessly and removes the need to migrate your Redis data to another RediSearch-enabled database.

This new architecture enables RediSearch to follow and auto-index other data structures, such as Streams or Strings, in future releases. For more information about RediSearch 2.0, please take a look at Getting Started with RediSearch 2.0.

National Provider Identifier Standard

NPI is a Health Insurance Portability and Accountability Act (HIPAA) Administrative Simplification Standard. The NPI is a unique identification number for covered health care providers.

You can download NPI data from the Center of Medicare & Medicaid Services (CMS) website free of charge. This file is updated monthly and contains 6.5M records nationwide.

How to index NPI data with RediSearch

It’s become effortless with the RediSearch 2.0 compare to the previous version:

  • Create hashes npi:XXXXXXXX with properties for the NPI, provider’s name, other known names, and State:
127.0.0.1:6379> HGETALL npi:1013033943
1) "npi"
2) "1013033943"
3) "provider"
4) "WILSON RON D. D.M.D."
5) "other"
6) ""
7) "state"
8) "GA"
  • Create the RediSearch index named idx:npi for hash keys start with npi:
FT.CREATE idx:npi ON hash PREFIX 1 "npi:" SCHEMA provider TEXT SORTABLE other TEXT SORTABLE state TEXT SORTABLE

The file size with NPI data is ~7.8Gb, and it should be processed using streams line by line. To push data to Redis, I used ioredis, A robust, performance-focused, and full-featured Redis client for Node.js.

/**
* Readline and File System
*/
const { createReadStream } = require("fs");
const { createInterface } = require("readline");

/**
* A robust, performance-focused and full-featured Redis client for Node.js.
*
* @see https://github.com/luin/ioredis
*/
const Redis = require("ioredis");

/**
* You can also specify connection options as a redis:// URL or rediss:// URL when using TLS encryption:
*/
const redis = new Redis("redis://localhost:6379", {
enableAutoPipelining: true,
});

/**
* File with NPI data
*/
const filename = "npidata.csv";

/**
* Loader
*/
async function loader() {
/**
* Stream reader
*/
const rl = createInterface({
input: createReadStream(filename),
crlfDelay: Infinity,
});

/**
* Process line by line
*/
for await (const line of rl) {
const npi = line.replace(/^"/g, "").replace(/"$/g, "").split('","');

/**
* Verify that line contains npi record in the correct format
*/
if (!npi.length || !npi[0] || !npi[1]) {
continue;
}

/**
* Add hash based on fields in CSV file
* Provider's name, if 1 then concatenate first, last, middle name - fields 5-7,10
* if 2, then organization business name - field 4
*/
await redis.hmset(
`npi:${npi[0]}`,
"npi",
npi[0],
"provider",
npi[1] == 1 ? [npi[5], npi[6], npi[7], npi[10]].join(" ") : npi[4],
"other",
npi[11],
"state",
npi[23]
);
}

/**
* Close connection when file processed.
*/
redis.quit();
}

/**
* Parse file and push to Redis
*/
loader();

Observing the data loading progress

Since we introduced the Redis Application plugin for Grafana, I always use it with my Redis databases.

Watching at a predefined streaming dashboard, I can observe my Redis database performance and metrics:

  • Ops/sec and number of clients
  • System and used memory
  • Network Input/Output
  • A number of keys

Having the Redis CLI panel handy, I can check keys and look for more information about the Redis database using the command-line interface:

Redis Data Source dashboard with Redis CLI panel
Redis Data Source dashboard with Redis CLI panel.

I am running a redismod image in a Docker container on my Macbook Pro to load and search the data. If you are looking for better performance, consider Redis Enterprise software or a Cloud subscription with the RediSearch module enabled.

The dashboard is very informative, but how to observe RediSearch index statistics?

The latest version of Redis Data Source includes the custom interface for the FT.INFO command, which returns information and statistics on the index. Returned values include:

  • A number of documents and distinct terms
  • Indexing state and percentage as well as failures
  • Size and capacity of the index buffers
Streaming Redis Search statistics
Streaming Redis Search statistics.

Data is indexed. Let’s search

As I mentioned, my wife was looking at matching the care providers' database of CommonWell Health Alliance with other data using NPI. Alliance’s database can be download for free for all or specific states and contains 15,462 providers nationwide.

We will be looking for each provider’s name and search for NPI in indexed nationwide data using RediSearch FT.SEARCH command:

/**
* Readline and File System
*/
const {
createReadStream,
writeFileSync,
mkdirSync,
existsSync,
rmdirSync,
} = require("fs");
const { createInterface } = require("readline");

/**
* A robust, performance-focused and full-featured Redis client for Node.js.
*
* @see https://github.com/luin/ioredis
*/
const Redis = require("ioredis");

/**
* You can also specify connection options as a redis:// URL or rediss:// URL when using TLS encryption:
*/
const redis = new Redis("redis://localhost:6379");

/**
* File with name to search
* Directory with results
*/
const filename = "search.csv";
const dir = "data";

/**
* Search
*/
async function search() {
const match = {};
const names = {};

/**
* Stream reader
*/
const rl = createInterface({
input: createReadStream(filename),
crlfDelay: Infinity,
});

/**
* Process line by line
*/
for await (const line of rl) {
const provider = line.replace(/^"/g, "").replace(/"$/g, "").split('","');

/**
* Verify that line contains record in the correct format
*/
if (!provider.length || !provider[0] || !provider[3]) {
console.log(`Wrong format: ${provider}`);
continue;
}

/**
* Remove State in front
* Replace comma, dot and dash
* Remove single letters
* Remove abbreviations
*/
name = provider[0]
.replace(/[,.\-]/g, " ")
.split(" ")
.filter((x) => x.length > 1)
.filter(
(x) =>
[
"MD",
"LLC",
"PLLC",
"FL",
"INC",
"HCA",
"PA",
"DO",
"AND",
"OF",
"DR",
"FOR",
"PL",
"DPM",
"THE",
].indexOf(x.toUpperCase()) < 0
)
.join(" ");

names[provider[0]] = name;

try {
/**
* Search
*/
const result = await redis.call(
`FT.SEARCH`,
"idx:npi",
`@provider|other:${name} @state:${provider[3]}`
);

/**
* Add to results
*/
!match[result[0]]
? (match[result[0]] = [[provider[0], name, result]])
: match[result[0]].push([provider[0], name, result]);
} catch {
console.log(`Error: ${provider[0]}`);
}
}

/**
* Print results
* Save JSON files
*/
Object.keys(match).forEach((i) => {
console.log(`${i}, ${match[i]?.length}`);

writeFileSync(
`${dir}/result-${i}.txt`,
JSON.stringify(match[i], null, 2),
"utf-8"
);
});

writeFileSync(`${dir}/names.txt`, JSON.stringify(names, null, 2), "utf-8");

/**
* Close connection when file processed.
*/
await redis.quit();
}

/**
* Data Folder clean-up
*/
if (existsSync(dir)) {
rmdirSync(dir, { recursive: true });
}
mkdirSync(dir);

/**
* Search file
*/
search();

This script saves results in the data folder and separated them based on the number of found records in JSON format for further processing and verification. One of the matching records:

[
"MI - David M Viviano MD",
"David Viviano",
[
1,
"npi:1508816539",
[
"npi",
"1508816539",
"provider",
"VIVIANO DAVID MATTHEW MD",
"other",
"",
"state",
"MI"
]
]
],

Found results

Within seconds on my Macbook Pro, I was able to match more than 55% of 15,462 records out of 6.5M nationwide based on state and provider names. You can see the results in the chart below based on the number of found results:

Matching results using RedisSearch
Matching results using RedisSearch.

The datasets I was working with have multiple duplicates and missing NPI. The nationwide NPI dataset has an additional list of other names that can improve results, and I did not use it for this article.

I can spend more time analyzing results to decrease the number of not found records, but overall my wife was thrilled with how easy and fast we could get the results for her project.

Volkov Labs is an agency founded by long-time Grafana contributor Mikhail Volkov. We find elegant solutions for non-standard tasks.

Check out the latest plugins and projects at https://volkovlabs.io

--

--