Pattern Recognition in Patient Data: Part 1
Published: May 18, 2026
Key Points in Brief
Billing in hospitals is an important but laborious and error-prone process, as clinical and administrative data must be taken into account in increasing quantities. It therefore makes sense to use machine learning - often equated with artificial intelligence - for pattern recognition in order to automatically provide coding specialists and medical controllers with time-saving and revenue-generating recommendations. In this blog article, we would like to explain the principles of this approach in order to make the advantages, but also the challenges, easier to understand.
When a patient is discharged after treatment in a hospital, a clinical coding specialist interprets the basic patient data (in particular diagnoses and therapies) using clinical data (e.g. findings, doctor's letter, surgical reports) from an economic and quality assurance perspective and describes these using codes from primarily two catalogs, the International Statistical Classification of Diseases and Related Health Problems (ICD) and the Operation and Procedure Codes (OPS).
For coding specialists, it is essential to know the structures within patient documentation and, among other things, to code doctors' diagnoses and therapies in a time-efficient and high-quality manner. This is because coding gaps or inconsistencies cause a revenue risk for the hospital. In order to improve the coding quality of specialists, one approach is to use probabilistic statistics or machine learning methods to automatically generate knowledge about disease data, use this knowledge for daily work and receive automated decision support.
Useful patterns in hospital billing
Healthcare professionals are interested in whether certain diseases typically indicate other diseases, i.e. whether there are patterns in patient data:
Patients for whom circumstances A, B, C apply often also have circumstance D.
This is the task of automatic classification. This can be solved "supervised" or "unsupervised" by learning patterns (or models or classifiers).
Nowadays, deep learning, i.e. deep neural networks, is often used for "supervised" classification, as they achieve very good results even with huge quantities of parameters, provided that sufficient pre-classified data and computing power are available.
One example of "unsupervised" classification is association rule learning. It offers an automated method for calculating statistical relationships and conclusions between all available data parameters. Such methods have already been used, for example, to automatically check coding consistency and coding quality(https://doi.org/10.1016/j.jbi.2018.02.001).
In such methods, usable documentation and coding patterns can be identified. These are of interest to medical controllers if medically plausible dependencies between facts allow supplementary or corrective coding to be recommended and billed on the basis of existing coding or documentation. Equally helpful are (exclusion) rules that assign a low probability to an incorrectly coded code and recommend corrective removal of the coding. The explainability of a recommendation is particularly important here, without which a coding specialist can hardly be helped in a meaningful way, as they would otherwise have to spend a lot of time researching and documenting the reason for a recommendation.
Application of an explainable machine learning procedure to billing data
The basis for documentation and coding patterns are historical patient cases. These are automatically or manually sorted into classes. Let's take the ICD code J18 as an example.
J18.- Pneumonia, pathogen unspecified
We will consistently refer tothis ICD codeJ18 as thetarget ICD code, as we understand it as the target of our prediction. To simplify matters, we limit ourselves to three-digit ICD codes, i.e. we truncate the ICD data for exact granular nomenclatures such as codingE11.7(Diabetes mellitus, type 2: With multiple complications) afterthe third digit and only consider the three-digit ICD code systematization, in the exampleE11.- (Diabetes mellitus, type 2). So depending on whether aJ18 pneumonia was billed for a patient case or not, we classify a patient case into two classes (see Table 1):
"J18 True Class" ifJ18was actually billed as a disease in thepatient case.
"J18 False Class" ifJ18wasnotbilled asa disease inthe patient case.
After we have assigned the patient cases to the respective classes "True Class" and "False Class" for learning a pattern, we start with the machine learning algorithm. For this purpose, the available cases are divided into training (80%) and test data (20%) and examined for patterns for the target ICD code. The classifier is then used to create a recommendation for our target ICD code. With this recommendation, we capture the predictive power with which the ICD codes from the training data predict the target ICD code as another ICD code to be billed. In the same way, we also record those ICD codes that are explicitly not a prediction of the target ICD code.
When applied to a patient case, such documentation and coding patterns then result in recommendations for coding, as shown below for an example case with codes E11, I79, A09, A41, R65, B37, N37, D62, E03, B95 etc. The recommendation of the learned pattern is to additionally code the J18 code in over 99% of cases and not to code the J18 code in less than 1% of cases. Fig. 2 shows a textual representation of the case in which all end-digit codes were abbreviated to three-digit codes. Highlighting of positive (orange) and negative (blue) correlations for the billing prediction of the target ICD code J18; in the example, the system is >99% sure that a J18 code should be recommended.
In addition, the recommendation of the learned pattern is "explained": thedarker the orange of an ICD code, the more strongly it contributes to the billing of the target ICD codeJ18(e.g. e.g.J91, I46, B95etc.), while ICD codes with a negative correlation to the billing of J18 (M86.-,M89.-andE03.-) appear inblue. The explainability of ICD codes in the prediction of our assumed target ICD code was calculated using the "LIME Explainer"(https://arxiv.org/abs/1602.04938).
The illustration of the explanation in Fig. 3 expresses the same facts in a different visualization:
The graph illustrates the extent to which ICD codes contribute or "pay toward" the need to bill the target ICD code. The green bars on the right-hand side of the graph show the ICD codes that are most important in explaining the billing of our target ICD code. In this sense, J91.-, pleural effusion in diseases classified elsewhere, is to be understood as the most important ICD code for billing the target ICD code, as the internal probability (0.006) of the explanation is shown highest with green bars.
At the same time, the red bars of the explanation show us the three most important ICD codes M86.-, M89.- and E03.-, which explicitly speak against the billing of pneumonia with a Target ICD code. M86.-, an osteomyelitis (infectious disease of the skeletal system), is to be understood with the greatest negative impact for coding specialists in such a way that a J18 pneumonia should only be billed in rare cases after an M86.- has been diagnosed. Accordingly, the explanation not only provides coding specialists with potential positive rules for how ICD codes support the billing of a J18 pneumonia, but also negative rules that exclude ICD codes from coding our target ICD code.
Conclusion:
Machine learning offers genuine added value in the clinical coding process: it recognizes patterns in historical patient data and provides coding specialists with precise, traceable recommendations — including justifications. Explainability is particularly crucial, as only those who understand why a code is being recommended can make fast and confident decisions. This allows coding gaps to be reduced, billing quality to be improved, and revenue risks for hospitals to be minimized.
Empolis