Revealing Bid Document Similarity Issues Using Python

To determine whether there is collusion or manipulation among bidders in different sections of the same project, auditors need to conduct similarity reviews of the bidding technical documents. Traditionally, auditors manually compared unstructured bidding documents, which consumed a lot of time and manpower. In an audit of a government investment project, auditors utilized Python text data analysis methods to compare the bidding technical documents, identify problematic points, and achieved good results. The audit background and specific methods are as follows.

The project is a municipal road engineering project, divided into 5 sections for separate implementation, with each section having similar content. The participating units for each section were determined through public bidding, including EPC general contracting, supervision, third-party inspection, and full-process cost consulting. Auditors needed to perform text data analysis on the bidding documents of the same type of participating units across the 5 sections to examine whether there were consistent or partially similar contents.

Establish a bidding technical document data analysis database. The organized bidding technical documents are written into Python using the docx and os modules and formatted into a standard format. The jieba module is then used to perform word segmentation on the document content. For example, a segment of text “equipped with fire-fighting equipment to prevent fire incidents” is segmented into “‘equipped’: 1, ‘fire-fighting’: 1, ‘equipment’: 3, ‘prevent’: 6, ‘fire’: 1, ‘incident’: 3” and so on (the numbers indicate the frequency of each segment in the entire document), and the segmented document is written into a data dictionary to generate a duplicate comparison database.

Establish an EPC general contracting bidding technical document data analysis database. First, the docx module is called to read the locally stored bidding technical documents, and after looping through them, the jieba module is called for word segmentation. It is generally believed that single-character words (such as function words “的” and “了”) have no comparison significance, so auditors only count the frequency of phrases longer than 1 character. Next, the segmented text content of each bidding technical document is written into the data dictionary cuts, calculating the frequency of each segment in the entire document to generate the data analysis database.

Write code to calculate the duplication rate. When comparing files A and B, auditors first use a for loop to iterate through the keys (segments) in the data dictionary cuts of file B. If a segment from file B appears in file A, the duplicated segment is written into a new data dictionary jg. Then, using a for loop to iterate through the key-value pairs in the data dictionary jg, the total number of duplicated words is calculated by multiplying the segment length by the duplication frequency. Similarly, using a for loop, the total number of words in files A and B is calculated, and the total number of duplicates is divided by the minimum total number of words in files A and B to obtain the duplication rate. Finally, a nested for loop is used to calculate the duplication rate of the bidding technical documents and output the data analysis results.

Through text data analysis calculations of the bidding technical documents, auditors found that the project management organization scheme of a certain bidding unit had a duplication rate of 73.04% with another bidding unit’s project management organization scheme, and other bidding units also had similar content in some documents. This algorithm effectively and accurately verified the issues.

(Author: Lu Jing, Audit Bureau of Xiangyang City, Hubei Province)

(This article is reprinted from “China Audit”, Issue 5, 2023)

Related posts

Leave a Comment Cancel reply