Skip to content

scanner: refactor new_plain_scanner to accept file_path, file_base an…#27199

Open
whisky-jb wants to merge 1 commit into
vlang:masterfrom
whisky-jb:master
Open

scanner: refactor new_plain_scanner to accept file_path, file_base an…#27199
whisky-jb wants to merge 1 commit into
vlang:masterfrom
whisky-jb:master

Conversation

@whisky-jb

Copy link
Copy Markdown
Contributor

Refactor scanner, duplicated code consolidated.

@medvednikov

Copy link
Copy Markdown
Member

15 ci errors (master is green now)

@whisky-jb

whisky-jb commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

15 ci errors (master is green now)

Tried to merge latest master, but still getting errors, unrelated to my change?

@medvednikov

Copy link
Copy Markdown
Member

1 sanitized job times out on master. Your PR has 13 failing.

@whisky-jb

Copy link
Copy Markdown
Contributor Author

@medvednikov

I'm unclear why this is not merged, can you provide me some more details? Still learning how to contribute, so some guidance appreciated.

@whisky-jb

Copy link
Copy Markdown
Contributor Author

try again

@whisky-jb whisky-jb reopened this May 25, 2026
  new_scanner_file now uses new_plain_scanner as its base, eliminating the duplicated Scanner struct initialization.
@medvednikov

Copy link
Copy Markdown
Member

Don't worry, I'll fix your PR myself today.

@whisky-jb

Copy link
Copy Markdown
Contributor Author

@medvednikov did you have any chance to look into this?

@medvednikov

Copy link
Copy Markdown
Member

I did, this change is not as trivial as it seems. I'll post more details a bit later.

@whisky-jb

Copy link
Copy Markdown
Contributor Author

The change is relatively simple, I try to avoid twice the similar call to &Scanner with the same parameters.

btw. working through a compiler book, that's why I started with the scanner and got stuck there ;-)

@GGRei

GGRei commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

I think the tricky part is that these two Scanner initializations are similar structurally, but they do not represent the same source contract.

new_scanner_file() creates a file-backed scanner: file_path/file_base/file_idx must describe a real file registered in table.filelist, and file_idx is copied into every token during scanning.

new_scanner() creates a scanner for virtual/generated/in-memory code: it intentionally uses internally_generated_v_code and leaves file_idx == -1.

So the duplication is not only mechanical duplication. It also makes the provenance mode explicit.
If new_plain_scanner() starts accepting arbitrary file_path/file_base, every caller now has to know whether it is creating a real-file scanner or a generated-text scanner, which is easier to misuse later.

Anyway, this is only my personal opinion, and for me the main risk of this PR is exactly there, it may look like a small deduplication, but it touches the scanner source/provenance contract. :)

@whisky-jb

Copy link
Copy Markdown
Contributor Author

I think the tricky part is that these two Scanner initializations are similar structurally, but they do not represent the same source contract.

new_scanner_file() creates a file-backed scanner: file_path/file_base/file_idx must describe a real file registered in table.filelist, and file_idx is copied into every token during scanning.

new_scanner() creates a scanner for virtual/generated/in-memory code: it intentionally uses internally_generated_v_code and leaves file_idx == -1.

So the duplication is not only mechanical duplication. It also makes the provenance mode explicit. If new_plain_scanner() starts accepting arbitrary file_path/file_base, every caller now has to know whether it is creating a real-file scanner or a generated-text scanner, which is easier to misuse later.

Anyway, this is only my personal opinion, and for me the main risk of this PR is exactly there, it may look like a small deduplication, but it touches the scanner source/provenance contract. :)

Let me disagree!

new_plain_scanner is not declared public, so everybody will use either new_scanner_file or new_scanner and for them, the contract stays the same!

@GGRei

GGRei commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Yes, new_plain_scanner is not public, so this does not change the external API.

My concern is internal, not external.

In V, non-public items are still callable from other files in the same module. scanner_test.v is already an example of that, it is in module scanner and this PR updates its direct new_plain_scanner(...) calls to the new signature.

The fragile part is that after this change, new_plain_scanner(...) can create a scanner with a real-looking file_path/file_base while file_idx is still the default -1.
Tokens copy s.file_idx when they are produced, so if any internal caller scans before setting file_idx, the produced tokens keep file_idx == -1.

So the current new_scanner_file() order looks OK today, because it sets s.file_idx before scanning. But the helper now allows a mixed provenance state: "real file path" plus "not registered in filelist".

That is why I think this is more than a public/private API question. It is an internal invariant issue.

For example, after this PR, code inside the scanner module can now do this:

prefs := &pref.Preferences{}
mut s := new_plain_scanner('module main\nfn main() {}', .skip_comments, prefs,
    '/tmp/demo.v', 'demo.v')

// file_idx was not set
tok := s.text_scan()

// The scanner looks file-backed...
assert s.file_path == '/tmp/demo.v'

// ...but the token says "no registered file".
assert tok.file_idx == -1

Again, I am not saying that the current PR path is broken today. As far as I can see, the current new_scanner_file() flow still works because it sets file_idx before scanning but my point is that the refactor removes a protection that existed in new_plain_scanner() itself. That mixed internal state becomes possible through the helper, as shown in the example above.

@whisky-jb

Copy link
Copy Markdown
Contributor Author

It is obviously possible, that inside the module, one can screw up things, there is no protection against this. On the other hand, we have 2 definitions for the scanner which are almost identical, this is not best practice. So my aim was to get ride of the code duplication. Not sure if there would be better ways of doing it?

@GGRei

GGRei commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Removing the duplicated Scanner{ ... } initialization seems like a good goal.

I think the safer shape would be to deduplicate only the common fields, while keeping the two provenance modes explicit.

So new_plain_scanner(text, mode, pref) could stay the generated/in-memory helper, using internally_generated_v_code and leaving file_idx == -1.

For file-backed source, there could be a second small internal helper that receives the already-read raw_text, plus file_path and file_idx. It can derive file_base from file_path, then set file_path, file_base, and file_idx together before any scan happens. new_scanner_file() would still handle os.is_file and util.read_file, then call that helper.

That removes the duplicated common initialization, but avoids a scanner state where file_path/file_base look file-backed while file_idx is still -1.

Names aside, the important part is that file_path, file_base, and file_idx move as one provenance unit. To me, that seems like the safest compromise between the intention of the PR and the protection the old structure provided.

Of course, this is only my personal opinion, not something mandatory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants